Unit Selection Based Text-to-Speech Synthesizer For Tigrinya Language
Unit Selection Based Text-to-Speech Synthesizer For Tigrinya Language
Abstract
This paper brings together the development of the first unit selection based Text-to-Speech (TTS) system for
Tigrinya using the Festival framework and practical applications of it. Construction of a unit database and
implementation of the natural language processing modules are described and a Unit selection-based approach
generates speech by selecting proper units from a speech corpus and connecting them together. In this approach,
a set of features are defined to describe the speech units in the corpus and the expected units in the synthesized
utterance. In this paper the major tasks have been performed, via development of concatenative Unit selection
voice using phone as basic unit. We have used a speech corpus having a size of 4 hour, 38 minutes and 29
seconds, labelled at phoneme level.
We describe the implementation and evaluation of a G2P conversion model for a Tigrinya TTS system.
Letter to sound conversion for Tigrinya usually has simple one to one mapping between orthography and
phonemic transcription for most Tigrinya letters and an automatic clustering technique to cluster units based on
their phonetic and prosodic context. Having constructed the phonetic, prosodic, and acoustic features extraction
inventory for each phone to synthesize the input text the Festival speech synthesis was then adopted in order for
the synthesizer to use cluster unit selection algorithm. In order to minimize acoustically defined target and join
costs, a selection is made from cluster at the time of units synthesis.
The test results indicate that almost all of the words and sentences are recognizable. The system is evaluated
using MOS, one of the most popular testing techniques in speech synthesis. The system is tested for naturalness
and intelligibility of speech. On average, 97.1% of the sentences are correctly recognized by the listeners. The
naturalness of the synthesized speech demonstrates the appropriateness of the proposed approach.
Keywords: Text-to-Speech Synthesis; Concatenative Speech Synthesis; Unit selection based speech synthesis;
Syllable based concatenation; Consonants and Vowels
voice building process. Section 5 shows the results of pharyngeal consonants which were apparently part of
perceptual testing and finally conclusion and the ancient Ge'ez language and which, along with
recommendation are given in Sections 6 and 7. [x'/]a velar or uvular ejective fricative, make it
easy to distinguish spoken Tigrinya from related
2. The Tigrinya Language
languages such as Amharic [9]. These are exception
The script of Tigrinya is phonetic in nature. It characteristics from Amharic beside their accent,
uses different choice of units: a word, phrase, has 39 manner and place of articulation
consonants and 7 vowels [5, 9]. The orthographic
representation of the language is organized into
orders. Each of the 39 consonants has seven orders
(derivatives). Six of them are CV combinations while
the 7th is the consonant itself. The way Tigrinya
orthographic characters are written is very similar to
the way they are spoken. It means Tigrinya is a
phonetic language. The mapping of the written form Figure 1: Tigrinya Syllabic Structure
and the spoken form is one to one except the A syllable in Tigrinya is made up of only /cv/ and
epenthetic vowel. Characters representing the same /cvc/ (a consonant + a vowel) or (a consonant + a
consonant followed by different vowels are similar in vowel + a consonant). The vowel is a syllabic
shape. For example, here are the characters nucleus, while the first and the last consonants of the
representing: /he/, /hu/, /hi/, /ha/, /hie/, /h/ and /ho/: syllable are an onset and a coda respectively [9]. The
. Tigrinya native speaker, for example, can divide the
The total number of orthographic Symbols of the words (sabara) and (biili) into three and two syllables
language exceed 273. Like other languages, Tigrinya respectively. For example, /qaatala/ (Figure 2) is one
also has its own typical phonological and word which has three syllables in it. Some syllables
morphological features that characterize it. Among of Tigrinya have a nucleus or peak and an onset,
these, we found gemination of consonants and the while other syllables have a coda in addition to an
use of the automatic epenthetic vowel to be very onset and a peak. Observe the following:
critical for naturalness in Tigrinya speech synthesis.
Tigrinya language has special property in its spoken
form (CV or CVC sequence of the acoustic form of Word
the orthographic representation).
2.1 Phonology of Tigrinya Word RH RH RH
Phonology is the study of the distribution and
patterning of speech sounds in a language and of the
tacit rules governing pronunciation [4]. In O N C O N C O N C
phonology, phoneme is the fundamental unit that
describes how speech conveys linguistic meaning.
The phoneme represents a class of sounds that
q a t a l a
convey the same meaning. The meaning of a word is Figure 2
dependent on the phoneme that it contains [4]. Moreover, each onset and coda position is
Tigrinya has a fairly typical set of phonemes for occupied by a consonant, where as a nucleus position
an Ethiopian Semitic language. That is a set of is occupied by a vowel.
ejective consonants and the usual seven-vowel 2.2 Gemination
system. Unlike many of the modern Ethiopian
Semitic languages, Tigrinya has preserved the two Longer duration of identical segments, adjacent
consonants or vowels that are the same can form
HiLCoE Journal of Computer Science and Technology, Vol. 1, No. 1 15
germination. In Tigrinya sequence of vowels is not Additionally, it is not known to have been tested or
permissible. Whenever sequences of vowels occur, proven in any related manner.
either one of the vowels must be deleted or Alam et al. in [8] proposed a TTS system that
epenthetic segments are inserted between the vowels. creates the voice data for festival, and additionally
However, we do find geminated Tigrinya segments, extends the use of festival to its embedded scheme
with the exception of laryngeals and pharyngeals that scripting interface to incorporate Bangla language
may be geminated in only very limited environments support. The researchers TTS implementation used
as indicated in [9]. two different kinds of concatenative methods
Consonant germination may bring meaning supported in Festival: unit selection and multi-syn
differences in words. If we compare /zawara/ he got unit selection [8].
roaming and /zawwara/ he drove, /halifu/ he The researchers on their future work indicated
passed and /hallifu/ he excelled. There is a that a number of future plans need to be made to
difference of meaning in each pair. In each pair, we develop the complete TTS system for Bangla
observe a geminated or ungeminated medial language including the following: document analysis,
consonant that brings a meaning difference in each of text analysis, phonetic analysis, developing large
them. number pronunciation lexicon, automatic lexicon
2.3 Insertions entries instead of adding manually, find out LTS or
Grapheme-to-Phoneme (G2P) rule so that it can
Insertion is one way for arriving at a well formed handle unknown words), prosody analysis, and
or acceptable assignment of syllable structure. The waveform synthesis by diphone technique. In
syllable structure of Tigrinya is either /cv/ or /cvc/. conclusion, the researchers observed that unit
Insertion, unlike deletion, is the appearance of new selection and multisyn unit selection has a drawback
elements in a formerly unoccupied position. The because of the requirement of large set of speech
epenthetic (inserted) segments may appear word corpus.
initially, word medially or word finally. There are
Eker in [6] has found a research which exploits
several Tigrinya epenthetic segments, vowels and
the Turkish language structure and tried to
consonants, in different positions. Observe the
implement the system that takes a text as its input. It
following:
assumes that the text consists of words and it
asraha he made others to work processes word by word [6]. When a word is
Awassaxom he made them to add obtained from the text, it is passed to a unit that can
The morpheme /a/ is added to the root process word as text and produces the corresponding
consonants/srh/ speech. This part separates the word into diphones;
zii + asriih-a - zasriiha using diphone database, it gets a speech file
zii + awassaxa - zawassaxa corresponding to diphone and its pitch value.
Finally it concatenates the previously recorded
3. Literature Review speech segments using PSOLA algorithm and
In [7] Indian Natural Language Processing Lab manages to produce sound. As a future work, the
Centre for Development of Advanced Computing researcher recommended that the first thing should
(CDAC) uses multiform speech unit to develop the be done is to complete the diphone database and
speech synthesis. It primarily uses syllable and apply more experiments on words. The produced
phonemes. The speech corpus contains most frequent output is acceptable for small sentences, but it
words and initials. They have been segmented and requires much time for long sentences. Therefore, in
labelled into different speech units as required for order to have a real-time reading system, the system
development of a Hindi speech synthesis system. should be faster. In conclusion, this means that the
This research doesnt indicate as to whether any kind method used in this paper is an applicable one which
of implementation has been made or not. with some effort on completing and preparing a
16 Unit Selection Based Text-to-Speech Synthesizer for Tigrinya Language
better diphone database will result in a system that popular method of performing speech synthesis
will produce more understandable output for all recently and is found to differ from older types of
Turkish words. synthesis by generally sounding more natural and
We have observed that few research attempts spontaneous than formant synthesis or diphone based
were made on local languages. One of the few concatenative synthesis. Unit selection synthesis is
attempts made was by Sebsibe H/Mariam et al. in proven to score higher than other methods in listener
[13]. Their focus was issues that need to be ratings of quality but it involves a tedious recording
considered in developing a concatenative speech many hours of speech by a single speaker.
synthesizer. They have tried to describe the issues to In this research we tried to explore the nature of
be considered in developing a concatenative speech Tigrinya script representation of the phone set, rules
synthesizer for Amharic language. The complexity of of letter to sound, Tigrinya syllable structure and
the syllable structure of the language, the phonetic syllabification rules that would show the voice
nature of the language, and the result of the building process. To do these researches we used the
perceptual test of the synthesizer has been discussed. following.
The researchers tried to explore the nature of Transliteration scheme based on orthographic
Amharic script representation of the phone set, ordering of the script and acoustic similarity of
Amharic syllable structure and syllabification rules, the letters were defined using ASCII
and showed the voice building process. Having noted characters.
that the quality of speech synthesiser for Amharic In Festvox, the phone set of the language is
was not high, they recommended on the need in the described with the corresponding features like
future to work on improvement of the quality voicing, tongue position, tongue height, place
desired. They suggested that this can be done by: of articulation, and manner.
1) Proper selection of unit. Since the language is Experiments on Phonology of Tigrinya word,
phonetic, syllable as a basic unit may outperform phone set and we try to cover all phonemes
the phone as a basic unit. defined a transliteration scheme using ASCII
2) Optimal selection of corpus, which proportionally characters.
covers all basic units and variations, will give
5. Design and integration of Tigrinya unit
better quality.
selection into festival frame work
Based on the reviewed made so far and
knowledge of the researchers, none of the works so The speech inventory is divided into clusters,
far have tried to design grapheme to phoneme where each cluster holds units of the same phone
converter or letter to sound speech synthesizer for class based on their phonetic and prosodic context.
Tigrinya. None of them show prototype for natural An outline of the steps to build a unit selection
sound for Tigrinya, which synthesize by accepting synthesizer are given below. A more detailed
normalized Tigrinya texts and generate prosodic description of same is available in [10, 13].
features (i.e., intonation, stress) using syllable based Design speech and text corpus
approach. The main focus in this work is to find the Creating LTS rules and phone set
proper quality speech corpus, which matches the
Building utterance structures
quality of synthetic speech from synthesizers
Generating speech unit clusters
including linguistic tasks and develop naturally
Building the unit synthesizer
sounding text to speech for Tigrinya language.
What have been done in each step to build unit
4. Methodology selection voice for Tigrinya is explained below.
After an extensive literature review regarding Tigrinya proverbs, articles, newspapers, magazine
concatenative speech synthesis method, unit selection and bible sentences are collected from different
concatenative synthesis is found to be the most sources and are primary data. We selected a native
HiLCoE Journal of Computer Science and Technology, Vol. 1, No. 1 17
speaker of the language and tried to record in quite An outline of the steps to build a unit selection
environment by a male speaker using PRRAT. We synthesizer are given below. A more detailed
used wave surfer for manual labelling of the recorded description of same is available in [1, 2, 6, 10].
voices. In this paper, we built a corpus of around
13171 words. The script of this speech corpus is
selected from a large text corpus (around 84000
characters). The corpus is designed to cover the
frequently used syllable and context as much as
possible.
The input to the TTS system is the transliteration
of a text in Tigrinya. The pronunciation generation
module generates the sequence of basic units using a
lexicon of units and letter-to-sound rules. The lexicon
is a list of all speech units - monosyllables, bi-
syllables and tri-syllables, present in the waveform
repository. The letter-to-sound rules are framed in
such a way that each word is split into its largest
constituent syllable units As the pronunciation of
most of the words in Tigrinya can be predicted from
their orthography, these rules suffice to generate
Figure 3: The system architecture of the Tigrinya
correct pronunciations. The unit selection algorithm speech synthesizer using cluster unit selection
generates a target specification for the speech units
5.2 Description of the Implementation Design
that have been identified and picks the best sequence
of speech units that minimize both the target cost and After we collected speech and text corpus, the
the join cost. The waveforms of these speech units next step was to check recorded utterances against
are then concatenated to produce synthetic speech. the transcription text in order to design the prompt
list in Festival format and correct the label manually.
5.1 System Architecture
Appropriate modifications have been made to get
As the system architecture shown in Figure 3, the them ready to be used in voice building process. By
synthesizer has text analysis and speech synthesis doing so the speech and text corpora has been built.
parts. The text analysis part uses grapheme to
5.2.1 Labelling the Utterance
phoneme converter to match the word to its
pronunciation whereas the synthesis part selects the The process that generates the labelled utterance
best sequence of units for target specification is labelling. Labelling is the process of giving a label
produced at the end of text analysis, and finally for each speech signal in the utterance. Unit selection
generates the speech from of the speech parameters. synthesizers are highly sensitive to the accuracy of
Defining the phone-set of the language labelling. Bad labels will adversely affect the quality
Tokenization and text normalization of synthesis in a number of ways [2, 13].
Incorporation of letter-to-sound rules The phone label itself can be incorrect, potentially
causing the wrong word to be said, or said with an
Incorporation of syllabification rules
undesired accent. However, it is time taking and
Assignment of stress patterns to the syllables
laborious, as part of our efforts to improve speech
in the word
synthesis, we have labelled the speech database
Assignment of duration to phones thoroughly using a tool called Wave Surfer.
Generation of f0 contour
Once a speech repository is in place, the
repository is integrated with the Festival framework.
18 Unit Selection Based Text-to-Speech Synthesizer for Tigrinya Language
5.2.2 Creating letter-to-sound rules and phone- provides the label files for each sentence in the
set prompt list.
A comprehensive set of letter-to-sound rules was 5.2.4 Building utterance structures for the
created to syllabify the input text into the syllable- database
like units. These rules are framed in such a way that
The utterance structure holds all the relevant
each word is split into its largest constituent syllable
phonetic and prosodic information related to a speech
unit. The phone set, which is a list of basic sound
unit within this data structure. The phonetic
units for Tigrinya that the synthesizer supports, was
information in an utterance structure describes the
created by enumerating all the speech units identified
position of the speech unit in the word it appears and
in the syllabification process.
the information of units adjacent to it. Prosodic
5.2.3 Incorporation of Tigrinya Phone set and information holds information about the duration and
Grapheme to Phoneme Converter pitch of the unit. Festival provides relevant scripts for
building utterance structures for each speech unit.
The phone-set definition is the first text analysis
module in which every phoneme of the alphabet is 5.2.5 Generating speech unit clusters
classified according to phone features like consonant The process includes building coefficients for
voicing and vowel height. The second text analysis acoustic distances (MFCC, F0 and energy
module is the lexicon module. coefficients), creating distance tables for each class
The Tigrinya phone set is incorporated in Festival of units based on acoustic distances and generation of
corresponding to their characterizing features. Each features for building CART trees.
phone has eight features that describe how the vocal
organs behave when the sound is uttered. These
5.2.6 Building the unit synthesizer
features are vowel/consonant identification, Using the letter-to-sound rules, phone set and
consonant voicing, place of articulation, consonant clusters of each speech unit built in the previous
type, vowel length, vowel height, vowel front ness, steps, Festival generates the necessary files that need
and lip rounding. to be used along with the core Festival speech
The grapheme to phoneme converter is used to synthesizer to build a unit selection synthesizer for
convert an orthographic text into its corresponding Tigrinya using appropriate scripts.
phonetic representation. After incorporation of The rules of the language in relation to epithetic
Tigrinya phone set and the Tigrinya grapheme to vowel insertion, gemination and syllabification effect
phoneme converter into Festival, it provides the label on speech synthesis to determine the pronunciation
files for each sentence in the prompt list. We have of given Tigrinya words based on its spelling, in the
made manual label correction using the label process of grapheme to phoneme converter.
automatically generated along with the Epenthetic vowel insertion and germination rule
corresponding wave file. for Tigrinya are adopted from [11] and modified
The grapheme to phoneme converter is used to with:
convert an orthographic text into its corresponding 1. Accept input words and scan from left to right.
phonetic representation. We implemented the 2. If consonant cluster occur at word initially
grapheme to phoneme conversion architecture by position, insert epenthetic vowel between
making modification of syllabification algorithm them.
proposed in [11]. A C# based syllabification program
Exception: If the first phoneme is consonant
is implemented which is graphical based system and
and the next consonant is glide/w/
modified into C++ command line based G2P system
pharyngeal/h/ plain/x/ (rule #1).
is done as per the requirement of Festival tools. After
3. If three consonants are appeared in sequence
incorporation of Tigrinya phone set and the Tigrinya
word medially or word finally, position insert
grapheme to phoneme converter into Festival, it
HiLCoE Journal of Computer Science and Technology, Vol. 1, No. 1 19
epenthetic vowel before the third consonant extent of naturalness and intelligibility of synthetic
(rule #2). speech generated by the speech synthesizer.
Exception: if the middle consonant sonority is 6.1 Data Preparation and Prototype Testing
greater than the rest insert epenthetic vowel
after next the first consonant. We conducted perceptua1 tests on 6 people who
are native speakers of Tigrinya: 2 females and 4
4. If a cluster of consonant contains the
males. All subjects are between 30 to 60 years old.
germination and singleton in sequence, insert
Each subject listens to all of the 6 sentences with
epenthetic vowel after the geminated
various lengths selected from the data set used in the
consonant (rule #3).
voice construction and gives his/her ranking value
5. If a cluster of consonant contains the singleton
for the naturalness and intelligibility of the speech.
and geminate in sequence insert epithetic
They evaluated based on the quality of the speech
vowel after the singleton consonants (rule #4).
output by giving a measure of quality.
6. If a cluster of consonant contains two different
Based on the result found, we can conclude that
germinations in sequence, insert epenthetic
proper selection of units done by the TTS has great
vowel between the two geminate consonants
role for perceived naturalness and intelligibility of
(rule #5).
synthetic speech sounds.
7. If the sonority of the final consonant is greater
The results show that regarding the question as to
than that of the proceeding consonants, the
whether the voice is good to listen to or not, 38.8%
epenthetic vowel is inserted between the final
considered the voice is very good, 58.3% of them
consonant clusters (rule #6).
thought that the voice was good and 2.7 %
8. If a consonant cluster occurs at word final considered the voice unnatural. From the result it is
position, insert epenthetic vowel /i/. clear that more than 97.1% of the listeners found it to
9. Repeat steps 2 to 7 until the entire phoneme be ok and none of the listeners found it to be
are parsed in the phonemes list. excellent, fair or very poor. In general, the output
6. Perceptual Evaluation and Experimental was acceptable by most of the listeners. When
Results compared to a previous work done on unit selection
in [8] for the Bangla language in Festival framework
Perceptual evaluation is essential to determine the at sentence level, the average score is 90.1%, thus
quality of synthesized speech [13, 14]. The when compared to this thesis it has improved by
perceptual evaluation in this paper investigates the 7.1% [8].
naturalness and intelligibility of Tigrinya TTS. In this
research work mean opinion score (MOS) is used to 6.2 Summary
test the output of the synthesized speech. MOS is an Even if the experiment is conducted on a small
evaluation technique where evaluators indicate their scale, the results obtained are promising. From this
assessments on a scale ranging from bad (1) to result, it appears to indicate that with an ever
excellent (5). Then the average score of the opinion increasing size of speech database, the unit
given will be taken as the performance of the system synthesizer would be able to produce natural speech
[6, 12, 14]. with high flexibility and intelligibility. However,
As we stated in the above section, the impact in every feature of the Tigrigna language was not
perception is assessed by comparing the evaluation considered. This paper has achieved promising result
average result to be obtained from the score ranks by defining
given by native speakers at the end of their Transliteration scheme to work with Tigrigna
perceptual judgment for the synthetic speech scripts
produced by the synthesizer. Subsequently, the Incorporated phone set, Syllabification rules,
perceptual tests were carried out to evaluate the and Letter to sound rules
20 Unit Selection Based Text-to-Speech Synthesizer for Tigrinya Language
MSc Thesis, Faculty of Informatics, Addis [14] Hyunsong Chung, Duration Models and the
Ababa University, Ethiopia, 2011. Perceptual Evaluation of Spoken Korean,
[12] Sebsibe H/Mariam, S P Kishore, Alan W Black, Proceedings of ISCA Archive, France, 2002.
Rohit Kumar, and Rajeev Sangal, Unit [15] AlanW Black and Kevin A. Lenzo, Building
Selection Voice for Amharic using Festivox, Synthetic Voices, For FestVox 2.1 Edition.
5th ISCA Speech Synthesis Workshop, 2007.
Pittsburgh, pp. 103-107, 2005.
[13] Yonas Demeke, Duration modeling of
phonemes for Amharic text to speech system,
M Sc Thesis, Faculty of Informatics, Addis
Ababa University, Ethiopia, 2011.