0% found this document useful (0 votes)
67 views26 pages

Chapter-3: Theory of TTS

TTS Theory

Uploaded by

Boby Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views26 pages

Chapter-3: Theory of TTS

TTS Theory

Uploaded by

Boby Joseph
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

Chapter-3

Theory of TTS
Theory of TTS

CHAPTER- 3

THEORY OF TTS

No. Title of the contents Page no.


3.1 Introduction 11
3.2 Sound elements for speech synthesis 15
3.2.1 Classification of speech 16
3.2.2 Elements of a language 18
3.3 Methods and approaches to speech synthesis 18
3.4 Language study 25
3.4.1 The consonants 26
3.4.2 Vowels 27
3.4.3 Consonant conjuncts 28
3.5 Present scenario of TTS systems 29
3.5.1 DECTalk 29
3.5.2 Bell labs text-to-speech 30
3.5.3 Laureate 30
3.5.4 SoftVoice 31
3.5.5 CNET PSOLA 31
3.5.6 ORATOR 32
3.5.7 Eurovocs 32
3.5.8 Lernout & Hauspie‘s 33
3.5.9 Apple plain talk 33
3.5.10 Silpa 34

10
Theory of TTS

CHAPTER-3

THEORY OF TTS

3.1 INTRODUCTION:

 Speech synthesis is the artificial production of human speech usually produced


by means of computers. Speech synthesis systems are often called text to
speech (TTS) systems in reference to their ability to convert text into speech.
The most important qualities of a speech synthesis system are naturalness and
intelligibility. Naturalness describes how closely the output sounds like human
speech. Intelligibility is the ease with which the output is understood.

 The ideal speech synthesis is both natural and intelligible. Speech synthesis
systems usually try to maximize both characteristics. The two primary
technologies for generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has its own strengths and
weaknesses. The intended use of a synthesis system typically determines
which approach has been used. The following figure 3.1 shows speech
synthesis technologies and their sub-types.

SPEECH SYNTHESIS

Concatenative Formant Articulatory HMM-based


synthesis synthesis synthesis synthesis

Word Syllabic Hybrid Unit selection


synthesis synthesis synthesis synthesis

Fig. 3.1: Speech synthesis technologies

11
Theory of TTS

Different methods used to produce synthesized speech can be classified into three
main groups:
i) Articulatory synthesis: This method attempts to model the human speech
production systems by controlling the speech articulators (e.g. jaw, tongue,
lips). Articulatory synthesis is based on physical models of the human speech
production system. Due to lack of knowledge of complex human articulation
organs, articulatory synthesis has not lead to quality speech synthesis.

ii) Formant synthesis: This method models the pole frequencies of speech signal
or transfer function of vocal tracts based on the source-filter-model. Format
speech synthesis is based on rules which describe the resonant frequencies of
the vocal tract. The formant method uses the source filter model of speech
production, where speech is modeled by parameters of the quality speech. It
sounds unnatural, since it is difficult to estimate the vocal tract model and
source parameters accurately.

iii) Concatenative synthesis: This method uses different length prerecorded


samples derived from natural speech. This technique uses stored basic speech
units (segments), which are concatenated to the word sequences according to a
pronunciation dictionary. Special signal processing techniques to smooth the
unit transitions and to model the intonation are used. This method needs a
large database and results in speech distortion at concatenation point. But
concatenative synthesis produces more natural speech output as compared to
other methods and speech distortion at concatenation joint can be improved
with spectral distortion reduction methods [2].

 Speech synthesis is an artificial production of human speech usually produced


by means of computers. Speech Synthesis is used in: spoken dialog systems,
application for blind and visually-impaired persons, applications in
telecommunication, eyes and hands free applications, speech enhancement,
train/flight announcement systems, speaker recognition, speaker validation,
spoken dialog systems, text to speech systems, blind source separation, voice
activity detection and voiced/unvoiced decoder.

12
Theory of TTS

 The simple speech interface, which plays an important role in human-


computer interaction, has considerable social needs. The main directions of
speech interaction are recognition and synthesis technologies. A speech
recognition technology is regarded as the input from human to machine and in
contrast with it; speech synthesis is considered as machine‘s output
technologies.

 Most of the information in digital world is accessible to a few who can read or
understand a particular language. Language technologies can provide solutions
in the form of natural interfaces so that digital content can reach to the masses
and facilitate the exchange of information across different people speaking
different languages.

 These technologies play a crucial role in multi-lingual societies such as India


which has about 1652 dialects/native languages. While Hindi written in
Devanagari script, is the official language, the other 17 languages recognized
by the constitution of India are: 1) Assamese 2) Tamil 3) Malayalam 4)
Gujarati 5) Telugu 6) Oriya 7) Urdu 8) Bengali 9) Sanskrit 10) Kashmiri 11)
Sindhi 12) Punjabi 13) Konkani 14) Marathi 15) Manipuri 16) Kannada and
17) Nepali.

 Seamless integration of speech recognition, machine translation and speech


synthesis systems could facilitate the exchange of information between two
people speaking two different languages. The basic units of the writing system
in Indian languages are characters which are an orthographic representation
of the speech sounds.

 A character in an Indian language script is close to a syllable and can be


typically of the following form: C, V, CV, VC, CCV and CVC, where C is a
consonant and V is a vowel. All Indian language scripts have a common
phonetic base. A universal phone-set consists of about 35 consonants and
about 18 vowels.

13
Theory of TTS

 The scripts of Indian languages are phonetic in nature. There is more or less
one to one correspondence between what is written and what is spoken.
However, in Hindi and Marathi the inherent vowel (short /a/) associated with
the consonant is not pronounced depending on the context. This is referred to
as inherent vowel suppression (IVS) or schwa deletion [3].

 The main goal in processing the speech signal is to obtain a more convenient
or more useful representation of the information contained in the speech
signal. Time domain processing method directly deals with the waveform of
speech signal. Speech signal can be represented in terms of time domain
measurements as average zero-crossing rate, energy and auto-correlation
function.

 These representations make digital processing simpler. In particular the


amplitude of unvoiced segment is generally much lower than the amplitude of
voiced segment. Thus simple time-domain processing techniques are capable
of providing useful representation of such signal features as intensity,
excitation mode, pitch and possibly even vocal tract parameters such as
formant frequencies.

 State-of the-art speech synthesis systems demands for a high overall quality.
However, synthesized speech still lacks naturalness. To help to achieve more
natural sounding speech synthesis, not only construction of rich database is
important but precise alignment is also vital. This research work increases the
naturalness in concatenative TTS systems. To improve the overall
performance of TTS system, this work focuses on:

1) More naturalness implementation in current Marathi TTS through context


based speech synthesis using IMF (initial, middle, final) syllable positions.
2) This work implements a hybrid model of synthesis (words and syllables are
used in database preparation) to improve the overall performance of TTS
system.

14
Theory of TTS

3) Linguist guidance for contextual analysis and database preparation is


considered. Optimization of database and consideration of proper contexts
resulted in improvement in natural speech synthesis output.
4) This research work focuses on position based syllabification with the help of
neural and non-neural syllable formation approaches.
5) Spectral mismatch calculation and reduction for improving joint speech
quality of hybrid TTS system.
6) Paragraphs based on contextual analysis and language rules are good testing
models for this Marathi TTS system.

3.2 SOUND ELEMENTS FOR SPEECH SYNTHESIS:

Existing speech synthesis systems use different sound elements. The most common
are:
 phones
 diphones
 phone clusters
 syllables

1) The phone is the smallest sound element, which cannot be segmented. It


represents the typical kind of sound or sound nuance. Sounds (phones), which
are phonetically similar, belong to the same phoneme.

2) A diaphone begins at the second half of a phone (stationary area) and ends at
the first half of the next phone (stationary area). Thus, a diaphone always
contains a sound transition. Diphones are more suitable sound elements for
speech synthesis. Compared with phones, segmentation is simpler. The time
duration of diphones is longer and the segment boundaries are easier to detect.

3) Phone clusters are sequences of vowels or consonants. According to the


position of the sound sequences phone clusters are split into initial, medial and

15
Theory of TTS

final cluster. Medial cluster very often can be divided into initial and final
clusters.

4) A syllable is a phonetic-phonological basic unit of a word. It consists of


syllable start, syllable core and syllable end. Syllables are not influenced by
neighboring sound elements. The segmentation of syllables is relatively easy.
A disadvantage for synthesis purposes is that if only syllables are used, a huge
number of syllables are required. Hence in present Marathi TTS system hybrid
(most common words and syllables) approach is used [4].

5) A Synthesis-by-concatenation text-to-speech (TTS) synthesizer must be


capable of automatically producing speech. It uses the most commonly used
words in the audio database. Audio database consists of already recorded
words and textual database consists of details of these recorded words.

6) Syllable database is in textual form. Using these syllables a large number of


new words can be synthesized using concatenation. These syllables or new
words are not stored in audio form but are synthesized as per requirement
during runtime. Audio database needs more memory than textual database.
Hybrid concatenative synthesis, using most commonly used words and
syllables reduces memory usage and increases naturalness [5].

3.2.1 Classification of speech:


As per the mode of excitation speech sounds can be divided into three broad classes:
voiced, unvoiced and plosive sounds.

a) Voiced sounds
1) Voiced sounds have a source due to periodic glottal excitation, which can be
approximated by an impulse train in time domain and by harmonics in
frequency domain.
2) Voiced sounds are generated in throat.
3) They are produced because air from lungs is forced through the vocal cords.
4) Vocal cords vibrate periodically and generate pulses called glottal pulses.
5) Characterized by

16
Theory of TTS

i) High energy levels


ii) Very distinct resonant and formant frequencies
6) The rate at which the vocal cords vibrate determines the pitch.
7) Glottal pulses pass through the vocal tract where some frequencies resonate.
8) Rate at which the vocal cords vibrate depends on the air pressure in the lungs
and the tension in the vocal cords, both of which can be controlled by the
speaker to vary the pitch of the sound being produced.
9) The range of pitch for an adult male is from about 50 Hz to about 250 Hz, with
an average value of about 120 Hz. For an adult female the upper limit of the
range is much higher than male, perhaps as high as 500 Hz.

b) Unvoiced sounds
1) Unvoiced sounds are non-periodic in nature. Examples of sub-types of
unvoiced sounds are given below:
i) Fricatives: Fricatives are consonants produced by forcing air through a
narrow channel made by placing two articulators close together. Example
of fricative is ―thin‖.
ii) Plosives: In phonetics, a plosive consonant also known as an oral stop is a
consonant that is made by blocking a part of the mouth so that no air can
pass through, and the pressure increases behind the place where it is
blocked, and when the air is allowed to pass through again, this sound is
created. Example of plosive is ―top‖.
iii) Whispered: This sound is to speak with soft, hushed sounds, using the
breath, lips, etc., but with no vibration of the vocal cords. Example of
whispered sound is ―he‖.
2) Produced by turbulent air flow through the vocal tract.
3) Unvoiced sounds are present in mouth.
4) Vocal cords are open.
5) Pitch information is unimportant.
6) Characterized by
i) Higher frequencies than voiced sounds
ii) Lower energy than voiced sounds
7) Can be modeled as a random sequence.
8) Nasals, ―s‖ sounds are unvoiced sounds.
17
Theory of TTS

9) In the production of unvoiced sounds the vocal cords do not vibrate.

c) Plosive sounds
Plosive sounds, for example the ‗nwh²‘(puh) sound at the beginning of the word
‗pnZ‘(pin) or the ‗Swh²‘(duh) sound at the beginning of "S\'(daf), are produced by
creating yet another type of excitation. For this class of sound, the vocal tract
is closed at some point; the air pressure is allowed to build up and then
suddenly released. The rapid release of this pressure provides a transient
excitation of the vocal tract. The transient excitation may occur with or
without vocal cord vibration to produce voiced [such as S\ (daf)] or unvoiced
[such as pnZ (pin)] plosive sounds.

3.2.2 Elements of a language:


1) Phoneme: Phonemes differentiate words of a language.
2) Syllables: One or more phonemes are used to form syllables.
3) Words: One or more syllables are combined to form words.
4) Phrases, sentences: One or more words are combined to form phrases or
sentences.
5) Linguistics : Study of the arrangement of speech sounds according to the rules
of a language

3.3 METHODS AND APPROACHES TO SPEECH SYNTHESIS:

Synthesized speech can be produced by several different methods. All of these have
some benefits and deficiencies. A detail description of different methods of speech
synthesis is given in the following paragraphs.

1) Articulatory synthesis:
 Articulatory synthesis refers to computational techniques for synthesizing
speech based on models of the human vocal tract and the articulation
processes occurring there. The first articulatory synthesis regularly used for

18
Theory of TTS

laboratory experiments was developed at Haskins laboratories in the mid-


1970s by Philip Rubin, Tom Baer and Paul Mermelstein.

 This synthesis, known as ASY, was based on vocal tract models developed at
Bell laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker
and colleagues.

 Articulatory synthesis tries to model the human vocal organs as accurately as


possible, so it is potentially the most satisfying method to produce high-quality
synthetic speech. On the other hand, it is also one of the most difficult
methods to implement and the computational load is also considerably high.

 The first articulatory model was based on a table of vocal tract area functions
from larynx to lips for each phonetic segment. For rule-based synthesis the
articulatory control parameters may be for example lip aperture, lip protrusion,
tongue tip height, tongue tip position, tongue height, tongue position and velic
aperture.

 Until recently, articulatory synthesis models have not been incorporated into
commercial speech synthesis systems. A notable exception is the NeXT-based
system originally developed and marketed by Trillium sound research, a spin-
off company of the University of Calgary, where much of the original research
was conducted.

 Following the demise of the various incarnations of NeXT (started by Steve


Jobs in the late 1980s and merged with Apple Computer in 1997), the Trillium
software was published under the GNU (general public license), with work
continuing as GNU speech. The system, first tested in 1994, provides full
articulator-based text-to-speech conversion using a waveguide or
transmission-line analog of the human oral and nasal tracts.

 When speaking, the vocal tract muscles cause articulators to move and change
shape of the vocal tract, which causes different sounds. The data for

19
Theory of TTS

articulatory model is usually derived from x-ray analysis of natural speech.


However, this data is usually 2-D, where as the real vocal tract is naturally 3-
D. So the rule-based articulatory synthesis is very difficult to optimize due to
unavailability of sufficient data of the motions of the articulators during
speech.

 Advantage of articulatory synthesis is that the vocal tract models allow


accurate modeling of transients due to abrupt area changes, whereas formant
synthesis models only spectral behavior.

2) Formant synthesis:
 Formant synthesis does not use human speech samples at runtime. Instead, the
synthesized speech output is created using an acoustic model. Parameters such
as fundamental frequency, voicing and noise levels are varied over time to
create a waveform of artificial speech. This method is sometimes called rule-
based synthesis.

 Many systems based on formant synthesis technology generate artificial,


robotic-sounding speech that would never be mistaken for human speech.
However, maximum naturalness is not always the goal of a speech synthesis
system. Formant synthesis systems have advantages over concatenative
systems. Formant-synthesized speech can be reliably intelligible, even at very
high speeds, avoiding the acoustic glitches that commonly plague
concatenative systems [6].

 High-speed synthesized speech is used by the visually impaired to quickly


navigate computers using a screen reader. Formant synthesis needs less
memory than concatenative systems because they do not have a database of
speech samples. They can therefore be used in embedded systems, where
memory and microprocessor power are especially limited.

 Examples of non-real-time but highly accurate intonation control in formant


synthesis include the work done in the late 1970s for the Texas instruments

20
Theory of TTS

toy ‗Speak & Spell‘ and in the early 1980s Sega arcade machines. Creating
proper intonation for these projects was painstaking and the results have yet to
be matched by real-time text-to-speech interfaces. Probably the most widely
used synthesis method during the last few decades has been formant synthesis,
which is based on the source–filter model of speech.

 There are two basic structures in general, parallel and cascade, but for better
performance some kind of combination of these is usually used. At least three
formants are generally required to produce intelligible speech and up to five
formants to produce high quality speech.

 Each formant is usually modeled with a two-pole resonator, which enables


both the formant frequency (pole-pair frequency) and its bandwidth to be
specified (Donovan 1996). Rule–based formant synthesis is based on a set of
rules used to determine the parameters necessary to synthesize a desired
utterance using a formant synthesizer [7].

 A cascade formant synthesizer consists of band-pass resonators connected in


series and the output of each formant resonator is applied to the input of the
following one. The cascade structure needs only format frequencies as control
information. The main advantage of the cascade structure is that the relative
formant amplitudes for vowels do not need individual controls. The cascade
structure has been found better for non-nasal voiced sounds because it needs
less control information than parallel structure.

 A parallel formant synthesizer consists of resonators connected in parallel.


Sometimes extra resonators for nasals are used. The excitation signal is
applied to all formants simultaneously and their outputs are summed.
Adjacent outputs of formant resonators must be summed in opposite phase to
avoid unwanted zeros or anti-resonances in the frequency response
(O'Saughnessy 1987). The parallel structure enables controlling of bandwidth
and gains for each formant individually and thus needs more control
information. The parallel structure has been found to be better for nasals,

21
Theory of TTS

fricatives, and stop-consonants, but some vowels cannot be modeled with


parallel formant synthesizer as well as with the cascade one.

3) Concatenative synthesis:
 Concatenative synthesis is based on the concatenation (or stringing together)
of segments of recorded speech. Generally, concatenative synthesis produces
the most natural-sounding synthesized speech. However, differences between
natural variations in speech and the nature of the automated techniques for
segmenting the waveforms sometimes results in audible glitches in the output.
But due to voice conversion techniques, concatenative synthesis can be of
different voice. Also due to different database reduction mechanisms, the
storage problem is controlled.

 Connecting prerecorded natural utterances is probably the easiest way to


produce intelligible and natural sounding synthetic speech. However
concatenative synthesizers are usually limited to one speaker and one voice.
As preparation of database is time consuming; only one speaker can record the
database and hence speech output is of one voice. As this approach is making
use of concatenation of natural speech segments, the storage of these segments
need more memory than other approaches of speech synthesis.

 One of the most important aspects in concatenative synthesis is to find the


correct unit length. The selection is usually a trade-off between longer and
shorter units. With longer unit high naturalness, less concatenation points and
good control of coarticulation are achieved, but the amount of required units
and memory are increased. With shorter units, less memory is needed, but the
sample collecting and labeling procedure becomes more difficult and complex.
In present systems units used are usually words, syllables, demi-syllables,
phonemes, diphones and sometimes even triphones [8].

 Concatenative synthesis can be divided into three types depending on


concatenation unit used,

22
Theory of TTS

1. Word synthesis: Here prerecorded words are concatenated to produce


output speech.
2. Syllabic synthesis: Syllables are concatenated to form words and output
speech is produced.
3. Hybrid syllabic synthesis: Both words as well as syllables are used to
produce output speech.

Hybrid approach is followed in this research work.

4) HMM based synthesis:


 HMM-based synthesis is a synthesis method based on Hidden Markov models,
also called statistical parametric synthesis. In this system, the frequency
spectrum (vocal tract), fundamental frequency (vocal source) and duration
(prosody) of speech are modeled simultaneously by HMMs. Speech
waveforms are generated from HMMs themselves based on the maximum
likelihood criterion.

 HMM-based TTS system consists of two stages, the training stage and the
synthesis stage. In the training stage, phoneme HMMs is trained using speech
database. Spectrum and f0 are modeled by multi-stream HMMs in which
output distributions for spectral and f0 parts are modeled using continuous
probability distribution and multi-space probability distribution (MSD)
respectively.

 To model variations of spectrum and f0, phonetic, prosodic and linguistic


contextual factors such as phoneme identity factors, stress related factors and
location factors are taken into account. Then, a decision tree based context
clustering technique is separately applied to the spectral and f0 parts of the
context dependent phoneme HMMs. Finally state durations are modeled by
multi-dimensional Gaussian distributions and the state clustering technique is
applied to the duration models.

23
Theory of TTS

 In the synthesis stage, first an arbitrarily given text is transformed into a


context dependent phoneme label sequence. According to the label sequence,
a sentence HMM is constructed by concatenating context dependent phoneme
HMMs. Phoneme durations are determined using state duration distributions.
Then spectral and f0 parameter sequences are obtained based on ML criterion
from the sentence HMM. Finally, by using the MLSA filter, speech is
synthesized from the generated Mel-cepstral and f0 parameter sequences.

Advantages of HMM synthesis


1) HMM synthesis provides a means to automatically train the specification-
to-parameter module, thus bypassing the problems associated with hand-
written rules.
2) The trained models can produce high quality synthesis and have the
advantages of being compact and amenable to modification for voice
transformation and other purposes.

Disadvantages of HMM synthesis


1) The speech has to be generated by a parametric model so no matter how
naturally the models generate parameters the final quality is very much
dependent on the parameter-to-speech technique used.
2) Even with the dynamic constrains the model generate somewhat ‗safe‘
observations and fail to generate some of the more interesting and delicate
phenomena in speech.

5) Sine wave synthesis:


 Sine wave synthesis is based on a well known assumption that the speech
signal can be represented as a sum of sine waves with time varying amplitude
and frequencies. Sine wave synthesis is a technique for synthesizing speech by
replacing the formants (main bands of energy) with pure tone whistles.

 The first sine wave synthesis program (SWS) for the automatic creation of
stimuli for perceptual experiments was developed by Philip Robin at Haskins
laboratory in the 1970s. This program was subsequently used by Robert

24
Theory of TTS

Remez, Philip Robin, David Pisoni and other colleagues to show that the
listeners can perceive continuous speech without traditional speech cues. This
work paved the way for the view of speech as a dynamic pattern of trajectories
through articulatory-acoustic space.

3.4 LANGUAGE STUDY:

 There are 15 officially recognized Indian scripts. These scripts are broadly
divided into two categories, namely Brahmi scripts and Perso-Arabic Scripts.
The Brahmi scripts consist of Devanagari, Punjabi, Gujarati, Oriya, Bengali,
Assamese, Telugu, Kannada, Malayalam and Tamil.

Hindi, Marathi and Sanskrit languages use Devanagari script.

 The spelling of languages written in Devanagari is partly phonetic in the sense


that a word written in it can only be pronounced in one way, but not all
possible pronunciations can be written perfectly. Devanagari has 34
consonants (vyanjan) and 12 vowels (svar). A syllable (akshar) is formed by
the combination of zero or one consonant and one vowel. All these scripts
mentioned above are written in a nonlinear fashion. The division between
consonant and vowel is applied for all the Indian scripts.

 Marathi is an Indo-Aryan, Indo-European and Indo-Iranian language spoken


by about 96.75 million people mainly in Maharashtra and neighboring states
of Maharashtra. Marathi is also spoken in Israel and Mauritius. Marathi is
thought to be a descendent of Maharashtri, one of the Prakrit languages, which
is developed from Sanskrit. The vowels getting attached to the consonant are
not in one direction. They can be placed either on the top or the bottom of
consonant. Vowels can also be placed on the left or right of consonant.
Following example shows all these vowel symbols (suffices) used with
consonants.

25
Theory of TTS

e.g. $z , x , o , ¡ etc.

 Since the origin of all Indian scripts is the same, they share the common
phonetic structure. In all of the scripts basic consonant, vowels and their
phonetic representation is also same. Typically the alphabets get divided into
following categories:

3.4.1 The consonants:


 A consonant is a sound in spoken language (letter of the alphabet denoting
such a sound) that has no sounding voice (vocal sound) of its own, but must
rely on a nearby vowel with which it can sound (sonant). Each consonant has
the specialty of having an inherent vowel, generally the short vowel ‗A²‘ (Aa).
So, the consonant d (Va) represents not just d² (Va) but also A² (Aa).

 There are 34 consonants in Marathi as shown below. However, in the presence


of a dependent vowel, the inherent vowel associated with a consonant is
overridden by the dependent vowel. Consonants may also be rendered as half
forms, which are presentation forms used to depict the initial consonant in the
cluster. These half forms do not have an inherent vowel.

 Some Marathi characters have alternate presentation forms whose choice


depends on neighboring consonants. This variability is especially notable for a

(Ra) which has numerous forms such as ©, — , è e.g. a (Ra) is represented using

different signs in the words like à&ßV (prapta), H¥ Vr (kruti), XOm© (darja), Iè`m
(kharya). Consonants in Marathi language are shown in the following figure
3.2

Consonants

26
Theory of TTS

Fig. 3.2: Consonants in Marathi language

3.4.2 Vowels:
 A vowel is a sound in spoken language that has a sounding voice (vocal
sound) of its own; it is produced by comparatively open configuration of the
vocal tract. Unlike a consonant (a non-vowel), a vowel can be sounded on its
own. A single vowel sound forms the basis of a syllable. Vowels in Marathi
language are shown in the following figure 3.3:

Vowels and vowel diacritics

Fig. 3.3: Vowels in Marathi language

There are mainly two representations for vowels.

1. Independent vowel
The writing system tends to treat independent vowels as orthographic CV syllables in
which the consonant is null. These vowels are placed on the consonants either in the
beginning or after the consonant. Each of these vowels is pronounced separately.
The independent vowels are used to write words that start with a vowel. Example A§Va
(antar)

27
Theory of TTS

Typical vowels are independent vowels: A(Aa), Am(Aaa), B(Ei), B©(Eee), C(U), D (Uoo),
E(Ea), Eo(Aai), Amo(O), Am¡(Au), A¨(Aam), A:(Aha)

2. Dependent vowel
The dependent vowels serve as the common manner for writing non-inherent vowels.
They do not stand-alone; rather they are depicted in combination with base letterform.
Explicit appearance of dependent vowel in a syllable overrides the inherent vowel of a
single consonant. Marathi has a collection of non-spacing dependent vowel signs that
may occur above or below a consonant as well as a spacing dependent vowel sign that
may occur to the left or right of a consonant. In Marathi there is only one spacing
dependent vowel that occurs to the left of the consonant i.e. {

Usage of dependent vowels: V Vm {V Vr Vw Vy Vo V¡ Vmo Vm¡ V¨ V: (ta, taa, ti, tee, tu, too, te,
tai, to, tau, tam, taha)

3. Halant
A halant sign ‗² ‗ known as virama or vowel omission sign serves to cancel the
inherent vowel of the consonant to which it is applied. Such a consonant is known as
dead consonant. The halant is bond to a dead consonant as a combining mark.

e.g. V (consonant) + ² (halant) = V²

3.4.3 Consonant conjuncts:


Consonant conjuncts serve as an orthographic abbreviation of two or more adjacent
letterforms. This abbreviation takes place only in the context of a consonant cluster.
An orthographic consonant cluster is defined as a sequence of characters that
represents one or more dead consonants followed by a normal live consonant or an
independent vowel.

e.g. V² (ta) + ` (ya) = Ë` (tya)

28
Theory of TTS

3.5 PRESENT SCENARIO OF TTS SYSTEMS:

 First commercial speech synthesis systems were mostly hardware based and
the developing process was very time-consuming and expensive. Since
computers have become more and more powerful most synthesis today is
software based systems. Software based systems are easy to configure and
update and they are also much less expensive than the hardware systems.
However, a standalone hardware device may still be the best solution when a
portable system is needed.

 The speech synthesis process can be divided into high-level and low-level
synthesis. A low-level synthesis is the actual device which generates the
output sound from information provided by high-level device in some format,
for example in phonetic representation. A high-level synthesis is responsible
for generating the input data to the low-level device including correct text pre-
processing, pronunciation and prosodic information. Most synthesis contains
both high and low level system, but due to specific problems with methods
they are sometimes developed separately.

3.5.1 DECTalk:
 Digital equipment corporation (DEC) has long traditions with speech
synthesis. The DECtalk system is originally descended from MITalk and
Klattalk. The present system is available for American English, German and
Spanish. It offers nine different voice personalities, four male, four female and
one child.

 The system is capable to say most proper names, e-mail and URL addresses,
as well as supports a customized pronunciation dictionary. It also has
punctuation control for pauses, pitch, stress and the voice control commands
may be inserted in a text file for use by DECtalk software applications. But
sound of this product is still robotic. The software version has three special
modes, speech-to-wave mode, the log-file mode and the text-to-memory
mode.

29
Theory of TTS

3.5.2 Bell labs text-to-speech:


 AT&T Bell Laboratories (Lucent technologies) has very long traditions with
speech synthesis since the demonstration of VODER in 1939. The first full
TTS system was demonstrated in Boston 1972 and released in 1973. It was
based on articulatory model developed by Cecil Coker (Klatt 1987). The
development process of the present concatenative synthesis system was started
by Joseph Olive in mid 1970's (Bell Labs 1997). Present system is based on
concatenation of diaphones, context-sensitive allophonic units or even of
triphones. Due to this type of segmentation unit, the concatenation joints are
more and resulting speech naturalness gets affected.

 The current system is available for English, French, Spanish, Italian, German,
Russian, Romanian, Chinese and Japanese. The architecture of the current
system is entirely modular (Möbius et al. 1996). It is designed as pipeline
where each of 13 modules handles one particular step for the process. Any
change in one of the 13 blocks will not affect other blocks but to implement
this single change, the interface between all blocks needs to be modified every
time and this integration should be always smooth. This is the biggest
disadvantage of this system.

3.5.3 Laureate:
 Laureate is a speech synthesis system developed during last two decades at BT
laboratories (British Telecom). To achieve good platform independence
Laureate is written in standard ANSI C and it has a modular architecture.
(Gaved 1993, Morton 1987). The Laureate system is optimized for telephony
applications so that lots of attentions have been paid for text normalization and
pronunciation fields. The system supports multi-channel capabilities and other
features needed in telecommunication applications.

 The current version of Laureate is available only for British and American
English with several different accents. Prototype versions for French and
Spanish also exist and several other European languages are under
development. A talking head for the system has been recently introduced

30
Theory of TTS

(Breen et al. 1996). More information including several pre-generated sound


examples and interactive demo is available at the Laureate home page (BT
Laboratory 1998). This system is more prone for one kind of application,
telephony. This system cannot be extended for other applications as the more
emphasis is on front end processing.

3.5.4 SoftVoice:
 SoftVoice Inc. has over 25 years of experience in speech synthesis. The latest
version of SVTTS is the fifth generation multilingual TTS system for
Windows is available for English and Spanish with 20 present voices
including males, females, children, robots and aliens. Languages and
parameters may be changed dynamically during speech. More languages are
under development and the user may also create an unlimited number of own
voices. The input text may contain over 30 different control commands for
speech features. Speech rate is adjustable between 20 and 800 words per
minute and the fundamental frequency or pitch between 10 and 2000 Hz. Pitch
modulation effects such as vibrato, perturbation and excursion are also
included. This system is concentrating more on emotional part of synthesis
than naturalness or quality improvement of speech output.

 Vocal quality may be set as normal, breathy or whispering and the singing is
also supported. The output speech may be listened in either word-by-word or
letter-by letter modes. The system can return mouth shape data for animation
and has capable to send synchronization data for the other user's applications.
The basic architecture of the present system is based on formant synthesis.

3.5.5 CNET PSOLA:


 The latest commercial product is available from ElanInformatique as
ProVerbe TTS system. The concatenation unit used is diphone sampled at 8
KHz rate. The ProVerbe speech unit is a serial (RS232 or RS458) connected
external device (150x187x37 mm) optimized for telecommunication
applications like e-mail reading via telephone.

31
Theory of TTS

 The system is available for American and British English, French, German,
and Spanish. The pitch and speaking rate are adjustable and the system
contains a complete telephone interface allowing connection directly to the
public network. ProVerbe has an ISA connected internal device which is
capable for multichannel operation. Internal device is available for Russian
language and has the same features as serial unit. This system is having
limited applications and also available for limited languages. It is not yet
extended for other languages.

3.5.6 ORATOR:
 ORATOR is a TTS system developed by Bell communications research
(Bellcore). The synthesis is based on demi-syllable concatenation (Santen
1997, Macchi et al. 1993, Spiegel 1993). The latest ORATOR version
provides probably one of the most natural sounding speeches available today.
Special attention on text processing and pronunciation of proper names for
American English is given and the system is thus suitable for telephone
applications. The current version of ORATOR is available only for American
English and supports several platforms, such as Windows NT, Sun and DEC
stations. This system is limited to some languages and for limited platforms.
This system is developed more from the point of view of front end processing.
Demi-syllables results in more number of concatenation points and hence the
performance of this system is not as good as present or latest systems.

3.5.7 Eurovocs:
 Eurovocs is a text-to-speech synthesis developed by Technologies &
Revalidate (T&R) in Belgium. It is a small (200 x 110 x 50 mm, 600g)
external device with built-in speaker and it can be connected to any system or
computer which is capable to send ASCII via standard serial interface RS232.
No additional software on computer is needed.

 Eurovocs system uses the text-to-speech technology of Lernout and Hauspie


speech products described in the following section and it is available for
Dutch, French, German, Italian and American English. One Eurovocs device

32
Theory of TTS

can be programmed with two languages. The system also supports personal
dictionaries. Recently introduced improved version contains Spanish and some
improvements in speech quality. Only two languages at a time can be used
with this type of product. Available for few languages and this product is an
external device which needs to be connected to computer.

3.5.8 Lernout & Hauspie‟s:


 Lernout & Hauspie‘s (L&H) has several TTS products with different features
depending on the markets they are used. Different products are available and
optimized for application fields, such as computers and multimedia
(TTS2000/M), telecommunications (TTS2000/T), automotive electronics
(TTS3000/A), consumer electronics (TTS3000/C).

 All versions are available for American English and first two for German,
Dutch, Spanish, Italian and Korean (Lernout & Hauspie 1997). Several other
languages such as Japanese, Arabic and Chinese are under development.
Products have a customizable vocabulary tool that permits the user to add
special pronunciations of words which do not succeed with normal
pronunciation rules. With a special transplanted prosody tool it is possible to
copy duration and intonation values from recorded speech for commonly used
sentences which may be used for example in information and announcement
systems.

 Different versions are available for different applications. Single product


cannot be used for different applications. All versions are available only for
American English. Other languages are still under development.

3.5.9 Apple plain talk:


 Apple has developed three different speech synthesis systems for their
Macintosh personal computers. Systems have different level of quality for
different requirements. The Plain-talk products are available for Macintosh
computers only and they are downloadable, free from the Apple homepage.

33
Theory of TTS

 MacinTalk2 is the wavetable synthesis with ten built-in voices. It uses only
150 kilobytes of memory, but has also the lowest quality of Plain Talk family,
but runs on almost every Macintosh system.

 MacinTalk3 is a formant synthesis with 19 different voices and with


considerably better speech quality compared to MacinTalk2. It supports
singing voices and some special effects. The system requires at least
Macintosh with a 68030 processor and about 300 KB of memory. MacinTalk3
has the largest set of different sounds.

 MacinTalkPro is the highest quality product of the family based on


concatenative synthesis. The system requirements are considerably higher than
in other versions, but it has three adjustable quality levels for slower machines.
Pro version requires 68040 PowerPC processor with operating system version
7.0 and uses about 1.5 MB of memory. The pronunciations are derived from a
dictionary of about 65,000 words and 5,000 common names.

 The database size of this system is very large. Processor requirement is very
high. Although this system has provision of different voices, the processor
capacity and memory requirement is considerably high.

3.5.10 Silpa:
 Silpa stands for Swathanthra Indian language computing project. It is a web
platform to host the freedom for software language processing applications
easily. It is a web framework and a set of applications for processing Indian
languages in many ways. In other words, it is a platform for porting existing
and upcoming language processing applications to the web. Silpa can be used
as a python library or as a web service from other applications.

 This is a web application for language processing application. It is a web


platform for language processing but not an independent synthesis system.

The product range of text-to-speech synthesis is very wide and it is quite unreasonable
to present all possible products or systems available out there.

34

You might also like