0% found this document useful (0 votes)
167 views6 pages

Improvization of Malayalam Speech Output in Espeak Text-To-Speech Synthesizer

The document discusses improving the Malayalam speech output of the eSpeak text-to-speech synthesizer. Issues with the current version include incorrect pronunciation of some words and numbers. The authors framed letter-to-sound rules for Malayalam and modified vowel durations to improve rhythm. An evaluation found their modified version produced more intelligible speech than the original eSpeak 1.47.10 version.

Uploaded by

deepapgopinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views6 pages

Improvization of Malayalam Speech Output in Espeak Text-To-Speech Synthesizer

The document discusses improving the Malayalam speech output of the eSpeak text-to-speech synthesizer. Issues with the current version include incorrect pronunciation of some words and numbers. The authors framed letter-to-sound rules for Malayalam and modified vowel durations to improve rhythm. An evaluation found their modified version produced more intelligible speech than the original eSpeak 1.47.10 version.

Uploaded by

deepapgopinath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Improvization of Malayalam speech output in eSpeak text-to-speech synthesizer

Neethu S. Nair Deepa P. Gopinath


College of Engineering Trivandrum College of Engineering Trivandrum
Thiruvananthapuram-695016 Thiruvananthapuram-695016
[email protected] [email protected]

Abstract units. Duration models helps adding naturality to the syn-


thesised speech.
Text-to-speech synthesis(TTS) system is aimed at gener- Duration models are generally of two types: rule based
ating speech corresponding to a given input text.The great- and statistical based. Rule based methods requires an in-
est challenge in a TTS system is the generation of natural tense analysis of segment durations for framing rules and
sounding speech.Among the existing TTS systems,eSpeak is preferred for small database. Klatt developed the first
is an open source software synthesizer that synthesizes rule based model in 1976[2].Statistical models make use
many foreign as well as Indian languages including Malay- of a large amount of recorded speech data to train models
alam.It is popularly used by visually impaired people for such as Sum of Products (SOP) models[9], Neural network
screen reading purpose because of its low resource inten- model[8], Classification and Regression Trees (CART)[7]
siveness and quick response.Synthesis of Malayalam speech etc.The duration patterns of each language is different and
in the current version of eSpeak(1.47.10), lacks intelligibil- even in one language, the pattern vary for different styles
ity.Therefore in this paper we attempt to improve the Malay- which means that textual information alone is not enough
alam output speech in eSpeak.Letter-to-sound rules were to produce natural sounding speech[4].Duration models are
framed and implemented.Problems with the pronunciation incorporated in a TTS to improve the naturality of speech.
of numbers have been rectified.Also the rhythm of speech
TTS has two major parts:a natural language processing
has been improved to some extent by modifying the vowel
(NLP) module which reads the input text and translates it
durations.An evaluation test was conducted to compare the
into a phonetic language, which specifies exactly how each
speech outputs of the existing and modified version of eS-
word is to be pronounced (Letter-To-Sound module) and a
peak TTS synthesizer.Considerable improvement has been
digital signal processing (DSP) module that converts the
observed compared to the version 1.47.10.
phonetic language into spoken speech.For natural sound-
ing speech synthesis, it is desirable that the text processing
component produces an appropriate sequence of phonemic
1. Introduction units corresponding to an arbitrary input text.

Speech is the primary means of communication and Malayalam Text to Speech systems that are available to-
interaction between people.Automatic generation of speech day generate poor quality voice.In the last few years, efforts
from text, referred to as text-to-speech (TTS) synthesis has were improving towards development of a TTS system for
been gaining significant interest in commercial applications Malayalam. Fragmented efforts can be seen with many dif-
like talking aid for vocally handicaped people,training and ferent agencies working toward this goal with different sets
educational aid,reading aid for visually handicaped people of technologies. One major challenge in achieving this goal
etc.Recent progress in speech synthesis has produced syn- was lack of linguistic details in a usable and documented
thesizers with high intelligibility for some major languages form. Further to that various analysis has to be done on
like English, but the sound quality and naturalness remain actual human voice find out various parameter that are re-
a major problem.Synthesis of natural sounding speech de- quired to synthesize voice.
pends on how well the duration and intonation pattern are eSpeak[3] is one of the TTS systems that synthesizes
imposed on them.Durational variation is incorporated in many foreign as well as Indian languages including Malay-
text to speech synthesis systems using duration models, alam.There are several issues regarding the Malayalam lan-
which predict the duration of individual segments by con- guage output in the current version(eSpeak 1.47.10).This
sidering various factors affecting duration of the speech paper aims at improving the Malayalam speech output of

4321
eSpeak. Success of Malayalam TTS depends not only on address-
ing this issue but also in incorporating the regional varia-
The overall paper is organized as follows.Section 2 dis-
tion in speaking[?].Not much work has been done in eS-
cusses about some of the existing Malayalam TTS sys-
peak.The natives of Latin have tried to improve their lan-
tems.Section 3 outlines the issues in the current eSpeak
guage in eSpeak by defining their own phoneme sets and
1.47.10 .Section 4 describes the methodology.In section 5
dictionary[11].Letter-to-sound rules have been framed for
we discuss about the results.We evaluate our result based
many languages like Latin[11], Urdu[5], Bengali[1] etc.
on Mean Opinion Score in section 6.The status of the mod-
But in Malayalam it has not been done yet.
ified eSpeak is discussed in section 7.

2. Related Works 3. Issues in eSpeak 1.47.10

Literature shows that the following Malayalam speech Several issues were identified regarding the Malayalam
engines are available with some advantages and limitations. language output in the current version.

• Malayalam consists of lot of words which are pro-


• Swaram : a joint project of Kerala State IT Mission, nounced differently from how it is written. Depending
Society for Promotion of Alternative Computing and on the position of a character in a word, whether it is
Employment (SPACE) and designed by INSIGHT. It the starting, middle or the ending letter, that letter gives
can be used for listening any written work in Malay- different sounds.Pronunciation of certain words in the
alam. Any type of file that support Unicode format current version is not correct.Letter to sound rules have
can run on this software. Have the advantage that na- already been framed and implemented for many lan-
tive speakers are involved in the development, smaller guages like Sinhala,Latin,Arabic,Urdu,Bengali etc.But
size 3-4MB. in Malayalam it has not been done so far.

• ML-TTS : It works with both Windows and Linux. • Pronunciation of Numbers are incorrect.Numbers
ML-TTS was developed through the effort of IIT above hundred were pronounced as /onnynu:ti/
Madras, IIT Hyderabad, C-DAC Trivandrum and instead of /orynu:ti/.For eg: 1121 was pronounced
Mumbai.Has the advantage of legibility and involve- /a:yiratti onnynu:ti irupatti onny/ instead of /a:yiratti
ment of native speakers in development.They have cer- orynu:ti irupatti onny/.Number 500 was pro-
tain limitations like bigger size 2GB, less mobility, nounced as /anju:Ry/ instead of /anû:Ry/.There
slow and without speed control. were problems with 3000,5000,8000,9000, 10000,
13000,15000,18000 and 19000.1 lakh was pronounced
• Dhvani : It was developed by Simputer trust, headed /onny laksham/ instead of /ory laksham/.
by Dr. Ramesh Hariharan at Indian Institute of Sci-
ence, Bangalore in year 2000. • Duration of speech sounds are not proper.Appropriate
durational variations are required inorder to acieve nat-
• eSpeak : Developed by Jonathan Duddington.It is orig- ural sound.
inally known as Speak and written for Acorn/RISC OS
computers starting in 1995. eSpeak is a speech syn-
thesizer software for various foreign as well as Indian 4. Methodology
languages including Malayalam. It uses formant syn-
thesis method and allows many languages to be pro- As already mentioned, eSpeak is an open source soft-
vided in a small size of just 2MB. The speech is clear, ware synthesizer for various foreign as well as Indian lan-
and can be used at high speeds, but is not as natural or guages including Malayalam.In eSpeak the specific features
smooth as larger synthesizers which are based on hu- of a language are captured in easy to understand text files
man speech recordings.However it has the limitations and so development is easier since no time is required to
that native speakers are not involved in development familiarize with the source code.Each language is provided
and Malayalam phonemes used at present are not per- with its own language module,their names starting with the
fectly legible to comprehend the spoken text. language name.The developer of eSpeak software,Jonathan
Duddington, has given a provision for modifying the lan-
guage modules by the native speakers.
It is observed that the development of TTS in Indian
languages is a difficult task, especially for Malayalam, The language module consists of a voice file,a pronun-
in which same letters are pronounced in multiple ways. ciation dictionary and a phoneme source file. A voice file

4322
specifies a language along with various attributes that af- In case of English ,the number pronunciation has
fect the characteristics of the voice quality and how the lan- a regular pattern.But, in Malayalam there are several
guage is spoken.The phoneme file contains phoneme defi- variations.Issues regarding the pronunciation of num-
nitions for the vowels and consonants which the language bers were already discussed in section 3.They were
uses.Some of the phonemes are derived from Hindi.All corrected by adding exceptions in the ml list file.
phonemes are represented in ASCII characters using the
Kirshenbaum scheme[10].The phoneme definitions consists 3. Duration Modelling
of the type (vowel, nasal etc), length or duration in millisec- Vowel durations in Malayalam were analyzed
onds, and formant frequencies.For the phonemes that are from a database created by IIIT Hyderabad[6].The
difficult to synthesis, prerecorded WAV files are used.Once database consists of around 1000 sentences taken from
the language’s phonemes have been defined, then pronun- Malayalam wikipedia,Out of which, duration data
ciation dictionary data can be produced in order to trans- from about 50 sentences were taken and durations of
late the input text into phonemes. This consists of two each vowels based on their positions in a word,were
source files: language rules (the spelling to phoneme rules) analyzed by plotting histograms.The durational varia-
and language list (an exceptions list and attributes of cer- tions were incorporated along with the phoneme defi-
tain words).Since the aim was to improve eSpeak TTS nitions in the ph malayalam file.
for Malayalam, we were concerned about the Malayalam
language module that consisted of ml rules, ml lists and
5. Results and Discussion
ph malayalam files which can be viewed in the eSpeak
documentation[3]. 1. Letter-to-Sound Rules in Malayalam
The Letter-to-sound(LTS) rules were framed and imple- Rules were formulated based on the knowledge
mented, the pronunciation of numbers were corrected and obtained from Malayalam language experts and com-
duration modelling was done to improve the naturalness of putation linguists. Here we mainly describe the for-
output speech of eSpeak. mulation of pronunciation rules which can be broadly
classified for consonants and vowels.
1. Framing and implementation of LTS rules
A set of 7 rules were formulated based on • Case /a/
the knowledge obtained from Malayalam language If /a/ is not in word final syllable and if
experts and computation linguists.These rules were suceeding consonant is palatal or alveolar (eg:
implemented ml rules file and the Malayalam phoneme /aTayum/) or,if preceding consonant is voiced
source file in the eSpeak directory .By rewriting the stop(/ga/,/ja/,/da/,/Da/,/ba/) or /ya/, /ra/, /Ra/, /la/
rules stored in the ml rules file, allows the user to (eg: /balaM/, /jalaja/) then,
define pronunciations for a single character or a group replace /a/ with the sound /e/.
of characters based on a certain context.The rules • Case /u/
are organized in a particular group and written in a /u/ is rounded only in the initial syllable and fi-
particular syntax as shown below: nal syllable ( ammu - rounded ; /karuNa/ - un-
rounded).
In word middle it is rounded only if the vowel in
.group P the preceding syllable is also /u/ (eg: /uDuppy/)
P( Ja If not word initial or word final and if preceding
P (w ja vowel (vowel in the preceding phoneme) is not
P Je /u/ then,
P (B J replace /u/ with raised and retracted form of shwa
/y/.
where B is a combining vowel sign and ’ ’ denotes • Case /i/
that the position of that particular consonant or vowel /i/ will be changed to a form of shwa in word
is at the end of the word.Apart from the capital let- middle and duration will be very less.
ter B and ’ ’, there are more such special symbols
which can be seen in detail in the dictionary file of • Case /h/
eSpeak documentation.[3].The newly formulated rules When in a consonant cluster with a nasal, /h/ is
were added to the existing rules file. not pronunced, instead the consonant is gemi-
nated i.e replace /h/ with that nasal. (eg: /brah-
2. Correcting the pronunciation of numbers mam/, /chihnam/)

4323
• Case /y/ length 90
When preceded by a consonant and followed FMT(vowel/a# 3)
by /a/, /ya/ changes into an opened up /e/ (eg: ENDIF
/vyasanam/) ENDIF
• Case /nda/ FMT(vowel/a# 3)
Post nasal stops are converted to nasals.In endphoneme
/nandi/, /da/ is replaced with /na/; ie it is sounded
/nanni/.
• Case anuswaram 6. Evaluation
If not word initial or word final and if succeeding
phoneme is /ka/ vargam then replace it with cor- Evaluation was done by playing the speech output(in
responding nasal. WAV format) of the version,eSpeak 1.47.10 and the mod-
(eg: /bhaNgi/, /saNgiidam/) ifed version to about 15 listeners.Output WAV files included
14 numbers, 8 words and 5 sentences.Listeners were asked
The newly formulated rules were added to the existing to put scores(between 0-5) according to the quality of per-
rules file(ml rules).For example, the last rule described ception.Finally Mean opinion score(MOS) was calculated
above is written as : for each of the numbers, words and sentences and tabulated
w ) K (B Ni , as shown in tables 1,2 and 3.
in the .group K . Here /ga/ was replaced by the corre-
sponding nasal sound. 1. Evaluation of Numbers
2. Pronunciation of Numbers First the numbers were evaluated.A set of 14
The pronunciation of numbers were corrected by numbers as specified in Table 1 were taken.The speech
adding exceptions in the ml list file.It sounded much outputs of eSpeak 1.47.10 and the modified version
better than the output of the current version(eSpeak were played to the listeners.The mean opinion score
1.47.10). was calculated and tabulated as shown in table 1.

3. Vowel Durations Sl No Numbers MOS(1.47.10) MOS(modified)


1 1121 1.4 3.7
It was observed that the vowels in the start-
2 3000 1.57 3.87
ing of a word have a duration of approximately
3 5000 1.47 3.77
240ms,vowels in second syllable of a word have
4 8000 1.53 3.97
a duration of 90 ms and vowels at word endings
5 10000 1.53 4.06
have duration around 350ms.Rest of the vowels
6 13000 1.5 3.86
coming in a word have approximately a duration of
7 13345 1.87 3.96
70-80ms. These durational variations were incor-
8 15000 1.33 3.9
porated along with the phoneme definitions in the
9 15789 1.73 3.93
ph malayalam file.By varying vowel durations in
10 18000 1.33 3.6
this manner improved the rhythm of speech to some
11 19000 1.4 3.87
extent. An example of definition of phoneme /a/ after
12 165234 1.67 3.83
incorporating the durational variations is shown below.
13 1234567 1.97 3.97
14 21635567 1.86 3.7
phoneme a Table 1. MOS of Number outputs of espeak 1.47.10 and modified
vowel starttype #a endtype #a version
length 70
IF thisPh(isWordStart) THEN
length 240 2. Evaluation of Words
FMT(vowel/a# 4) ELIF thisPh(isStressed) A set of 8 words as specified in Table 2 were taken
THEN length 150 FMT(vowel/a# 3) ELIF for evaluation.The speech outputs of eSpeak 1.47.10
thisPh(isWordEnd) THEN and the modified version were played to the listen-
length 350 ers.The mean opinion score was calculated and tabu-
FMT(vowel/a# 3) lated as shown in table 2.
ELSE
IF thisPh(isSecondVowel) THEN 3. Evaluation of Sentences

4324
Figure 1. Comparison of number outputs of eSpeak 1.47.10 and
modified version
Figure 3. Comparison of sentence outputs of eSpeak 1.47.10 and
Sl No Words MOS(1.47.10) MOS(modified)
modified version
1 Jalam 2.02 3.93
2 Nanni 1.23 3.83
3 Brammam 1.03 3.567 exceptions were added to the list file inorder to improve
4 Chinnam 1.06 3.93 the pronunciation of numbers.The corrected exception list
5 Bhangi 1.0 3.86 file(ml list) was sent to Jonathan Duddington,the developer
6 Uduppu 1.8 2.9 of eSpeak synthesizer and is being updated in the latest ver-
7 Vyasanam 1.63 2.8 sion released(eSpeak 1.47.11.c). Also we were able to im-
8 Vananira 0.866 1.7 prove the rhythm of speech to some extent by appropriate
Table 2. MOS of Word outputs of espeak 1.47.10 and modified modifications in the vowel durations.Eventhough problems
version still exists with the pronunciation of the letter /r./ and also
the intonation of speech, after the modifications we made as
already described, the speech output sounded better than the
previous version(eSpeak 1.47.10).An evaluation was also
done to compare the speech outputs of the current(1.47.10)
and the modified versions and it was observed that the out-
put of modified version sounded much better and natural
than the current version.Further improvement can be made
by defining new phoneme set specifically for Malayalam
and also by modifying the intonation of speech.

Figure 2. Comparison of word outputs of eSpeak 1.47.10 and mod- 8. Acknowledgement


ified version
References
A set of 5 sentences as specified in Table 3 were [1] F. Alam, P. K. Nath, and D. M. Khan. Text to speech for
taken for evaluation.The speech outputs of espeak bangla language using festival.
1.47.10 and the modified version were played to the [2] D.H.Klatt. linguistic uses of segmental duration in english
listeners.The mean opinion score was calculated and :acoustic and perceptual evidence . Journal of Acoustic So-
tabulated as shown in table 3. ciety of America, 59.
[3] eSpeak open source software speech synthesizer.
After comparing the output speech of eSpeak 1.47.10 http://espeak.sourceforge.net.
and the modified version it is clear that there is signif- [4] D. P. Gopinath and A. S. Nair. Duration analysis and mod-
icant improvement in the output of the modified ver- elling for text to speech synthesis system in malayalam,ph.d
sion.which can be clearly observed from the tabula- thesis : University of kerala, february 2009.
tions and the comparison plots. [5] S. Hussain. Letter-to-sound conversion for urdu text-to-
speech system.
[6] V. K. S. R. Kishore Prahallad, E. Naresh Kumar and A. W.
7. Conclusion Black. The iiit-h indic speech databases, 2012. in Proceed-
ings of Interspeech 2012, Portland, Oregon, USA.
In this paper we discuss about improving Malayalam [7] N. S. Krishna and H. A. Murthy. Duration modelling of in-
speech output of eSpeak text-to-speech synthesizer.Letter- dian languages hindi and telugu. in Journal of Computer
to-sound rules were formulated and implemented.Several Speech and Language, 21:282–295, 2007.

4325
Sl No Sentences MOS(1.47.10) MOS(modified)
1 Jalajaye kanan nalla bhangi und 1.97 4.03
2 Malayalam keralathile audhyogika bhashayanu 2.13 3.43
3 Nammalk orumich sramich nokam 1.97 3.13
4 Nammal cheyyunat valare nalla karyamanu 2.3 3.5
5 Ellavarkum nanni ariyikunnu 2.1 3.9
Table 3. MOS of Sentence outputs of espeak 1.47.10 and modified version

[8] K. Rao and B.Yegnanarayana. Modeling syllable duration


in indian languages using neural networks. in Journal of
Computer Speech and Language, 21:282–295, 2007.
[9] J. P. H. Santen. assignment on segmental duration in text to
speech synthesis. in Journal of Computer speech and lan-
guage, 8:95–128, 1994.
[10] K. Scheme. http://www.kirshenbaum.net/ipa/ascii-ipa.pdf.
[11] J.-W. van and L. M. Tromp. Latin text-to-speech. 2007.

4326

You might also like