Chapter-3: Theory of TTS
Chapter-3: Theory of TTS
Theory of TTS
Theory of TTS
CHAPTER- 3
THEORY OF TTS
10
Theory of TTS
CHAPTER-3
THEORY OF TTS
3.1 INTRODUCTION:
The ideal speech synthesis is both natural and intelligible. Speech synthesis
systems usually try to maximize both characteristics. The two primary
technologies for generating synthetic speech waveforms are concatenative
synthesis and formant synthesis. Each technology has its own strengths and
weaknesses. The intended use of a synthesis system typically determines
which approach has been used. The following figure 3.1 shows speech
synthesis technologies and their sub-types.
SPEECH SYNTHESIS
11
Theory of TTS
Different methods used to produce synthesized speech can be classified into three
main groups:
i) Articulatory synthesis: This method attempts to model the human speech
production systems by controlling the speech articulators (e.g. jaw, tongue,
lips). Articulatory synthesis is based on physical models of the human speech
production system. Due to lack of knowledge of complex human articulation
organs, articulatory synthesis has not lead to quality speech synthesis.
ii) Formant synthesis: This method models the pole frequencies of speech signal
or transfer function of vocal tracts based on the source-filter-model. Format
speech synthesis is based on rules which describe the resonant frequencies of
the vocal tract. The formant method uses the source filter model of speech
production, where speech is modeled by parameters of the quality speech. It
sounds unnatural, since it is difficult to estimate the vocal tract model and
source parameters accurately.
12
Theory of TTS
Most of the information in digital world is accessible to a few who can read or
understand a particular language. Language technologies can provide solutions
in the form of natural interfaces so that digital content can reach to the masses
and facilitate the exchange of information across different people speaking
different languages.
13
Theory of TTS
The scripts of Indian languages are phonetic in nature. There is more or less
one to one correspondence between what is written and what is spoken.
However, in Hindi and Marathi the inherent vowel (short /a/) associated with
the consonant is not pronounced depending on the context. This is referred to
as inherent vowel suppression (IVS) or schwa deletion [3].
The main goal in processing the speech signal is to obtain a more convenient
or more useful representation of the information contained in the speech
signal. Time domain processing method directly deals with the waveform of
speech signal. Speech signal can be represented in terms of time domain
measurements as average zero-crossing rate, energy and auto-correlation
function.
State-of the-art speech synthesis systems demands for a high overall quality.
However, synthesized speech still lacks naturalness. To help to achieve more
natural sounding speech synthesis, not only construction of rich database is
important but precise alignment is also vital. This research work increases the
naturalness in concatenative TTS systems. To improve the overall
performance of TTS system, this work focuses on:
14
Theory of TTS
Existing speech synthesis systems use different sound elements. The most common
are:
phones
diphones
phone clusters
syllables
2) A diaphone begins at the second half of a phone (stationary area) and ends at
the first half of the next phone (stationary area). Thus, a diaphone always
contains a sound transition. Diphones are more suitable sound elements for
speech synthesis. Compared with phones, segmentation is simpler. The time
duration of diphones is longer and the segment boundaries are easier to detect.
15
Theory of TTS
final cluster. Medial cluster very often can be divided into initial and final
clusters.
a) Voiced sounds
1) Voiced sounds have a source due to periodic glottal excitation, which can be
approximated by an impulse train in time domain and by harmonics in
frequency domain.
2) Voiced sounds are generated in throat.
3) They are produced because air from lungs is forced through the vocal cords.
4) Vocal cords vibrate periodically and generate pulses called glottal pulses.
5) Characterized by
16
Theory of TTS
b) Unvoiced sounds
1) Unvoiced sounds are non-periodic in nature. Examples of sub-types of
unvoiced sounds are given below:
i) Fricatives: Fricatives are consonants produced by forcing air through a
narrow channel made by placing two articulators close together. Example
of fricative is ―thin‖.
ii) Plosives: In phonetics, a plosive consonant also known as an oral stop is a
consonant that is made by blocking a part of the mouth so that no air can
pass through, and the pressure increases behind the place where it is
blocked, and when the air is allowed to pass through again, this sound is
created. Example of plosive is ―top‖.
iii) Whispered: This sound is to speak with soft, hushed sounds, using the
breath, lips, etc., but with no vibration of the vocal cords. Example of
whispered sound is ―he‖.
2) Produced by turbulent air flow through the vocal tract.
3) Unvoiced sounds are present in mouth.
4) Vocal cords are open.
5) Pitch information is unimportant.
6) Characterized by
i) Higher frequencies than voiced sounds
ii) Lower energy than voiced sounds
7) Can be modeled as a random sequence.
8) Nasals, ―s‖ sounds are unvoiced sounds.
17
Theory of TTS
c) Plosive sounds
Plosive sounds, for example the ‗nwh²‘(puh) sound at the beginning of the word
‗pnZ‘(pin) or the ‗Swh²‘(duh) sound at the beginning of "S\'(daf), are produced by
creating yet another type of excitation. For this class of sound, the vocal tract
is closed at some point; the air pressure is allowed to build up and then
suddenly released. The rapid release of this pressure provides a transient
excitation of the vocal tract. The transient excitation may occur with or
without vocal cord vibration to produce voiced [such as S\ (daf)] or unvoiced
[such as pnZ (pin)] plosive sounds.
Synthesized speech can be produced by several different methods. All of these have
some benefits and deficiencies. A detail description of different methods of speech
synthesis is given in the following paragraphs.
1) Articulatory synthesis:
Articulatory synthesis refers to computational techniques for synthesizing
speech based on models of the human vocal tract and the articulation
processes occurring there. The first articulatory synthesis regularly used for
18
Theory of TTS
This synthesis, known as ASY, was based on vocal tract models developed at
Bell laboratories in the 1960s and 1970s by Paul Mermelstein, Cecil Coker
and colleagues.
The first articulatory model was based on a table of vocal tract area functions
from larynx to lips for each phonetic segment. For rule-based synthesis the
articulatory control parameters may be for example lip aperture, lip protrusion,
tongue tip height, tongue tip position, tongue height, tongue position and velic
aperture.
Until recently, articulatory synthesis models have not been incorporated into
commercial speech synthesis systems. A notable exception is the NeXT-based
system originally developed and marketed by Trillium sound research, a spin-
off company of the University of Calgary, where much of the original research
was conducted.
When speaking, the vocal tract muscles cause articulators to move and change
shape of the vocal tract, which causes different sounds. The data for
19
Theory of TTS
2) Formant synthesis:
Formant synthesis does not use human speech samples at runtime. Instead, the
synthesized speech output is created using an acoustic model. Parameters such
as fundamental frequency, voicing and noise levels are varied over time to
create a waveform of artificial speech. This method is sometimes called rule-
based synthesis.
20
Theory of TTS
toy ‗Speak & Spell‘ and in the early 1980s Sega arcade machines. Creating
proper intonation for these projects was painstaking and the results have yet to
be matched by real-time text-to-speech interfaces. Probably the most widely
used synthesis method during the last few decades has been formant synthesis,
which is based on the source–filter model of speech.
There are two basic structures in general, parallel and cascade, but for better
performance some kind of combination of these is usually used. At least three
formants are generally required to produce intelligible speech and up to five
formants to produce high quality speech.
21
Theory of TTS
3) Concatenative synthesis:
Concatenative synthesis is based on the concatenation (or stringing together)
of segments of recorded speech. Generally, concatenative synthesis produces
the most natural-sounding synthesized speech. However, differences between
natural variations in speech and the nature of the automated techniques for
segmenting the waveforms sometimes results in audible glitches in the output.
But due to voice conversion techniques, concatenative synthesis can be of
different voice. Also due to different database reduction mechanisms, the
storage problem is controlled.
22
Theory of TTS
HMM-based TTS system consists of two stages, the training stage and the
synthesis stage. In the training stage, phoneme HMMs is trained using speech
database. Spectrum and f0 are modeled by multi-stream HMMs in which
output distributions for spectral and f0 parts are modeled using continuous
probability distribution and multi-space probability distribution (MSD)
respectively.
23
Theory of TTS
The first sine wave synthesis program (SWS) for the automatic creation of
stimuli for perceptual experiments was developed by Philip Robin at Haskins
laboratory in the 1970s. This program was subsequently used by Robert
24
Theory of TTS
Remez, Philip Robin, David Pisoni and other colleagues to show that the
listeners can perceive continuous speech without traditional speech cues. This
work paved the way for the view of speech as a dynamic pattern of trajectories
through articulatory-acoustic space.
There are 15 officially recognized Indian scripts. These scripts are broadly
divided into two categories, namely Brahmi scripts and Perso-Arabic Scripts.
The Brahmi scripts consist of Devanagari, Punjabi, Gujarati, Oriya, Bengali,
Assamese, Telugu, Kannada, Malayalam and Tamil.
25
Theory of TTS
e.g. $z , x , o , ¡ etc.
Since the origin of all Indian scripts is the same, they share the common
phonetic structure. In all of the scripts basic consonant, vowels and their
phonetic representation is also same. Typically the alphabets get divided into
following categories:
(Ra) which has numerous forms such as ©, — , è e.g. a (Ra) is represented using
different signs in the words like à&ßV (prapta), H¥ Vr (kruti), XOm© (darja), Iè`m
(kharya). Consonants in Marathi language are shown in the following figure
3.2
Consonants
26
Theory of TTS
3.4.2 Vowels:
A vowel is a sound in spoken language that has a sounding voice (vocal
sound) of its own; it is produced by comparatively open configuration of the
vocal tract. Unlike a consonant (a non-vowel), a vowel can be sounded on its
own. A single vowel sound forms the basis of a syllable. Vowels in Marathi
language are shown in the following figure 3.3:
1. Independent vowel
The writing system tends to treat independent vowels as orthographic CV syllables in
which the consonant is null. These vowels are placed on the consonants either in the
beginning or after the consonant. Each of these vowels is pronounced separately.
The independent vowels are used to write words that start with a vowel. Example A§Va
(antar)
27
Theory of TTS
Typical vowels are independent vowels: A(Aa), Am(Aaa), B(Ei), B©(Eee), C(U), D (Uoo),
E(Ea), Eo(Aai), Amo(O), Am¡(Au), A¨(Aam), A:(Aha)
2. Dependent vowel
The dependent vowels serve as the common manner for writing non-inherent vowels.
They do not stand-alone; rather they are depicted in combination with base letterform.
Explicit appearance of dependent vowel in a syllable overrides the inherent vowel of a
single consonant. Marathi has a collection of non-spacing dependent vowel signs that
may occur above or below a consonant as well as a spacing dependent vowel sign that
may occur to the left or right of a consonant. In Marathi there is only one spacing
dependent vowel that occurs to the left of the consonant i.e. {
Usage of dependent vowels: V Vm {V Vr Vw Vy Vo V¡ Vmo Vm¡ V¨ V: (ta, taa, ti, tee, tu, too, te,
tai, to, tau, tam, taha)
3. Halant
A halant sign ‗² ‗ known as virama or vowel omission sign serves to cancel the
inherent vowel of the consonant to which it is applied. Such a consonant is known as
dead consonant. The halant is bond to a dead consonant as a combining mark.
28
Theory of TTS
First commercial speech synthesis systems were mostly hardware based and
the developing process was very time-consuming and expensive. Since
computers have become more and more powerful most synthesis today is
software based systems. Software based systems are easy to configure and
update and they are also much less expensive than the hardware systems.
However, a standalone hardware device may still be the best solution when a
portable system is needed.
The speech synthesis process can be divided into high-level and low-level
synthesis. A low-level synthesis is the actual device which generates the
output sound from information provided by high-level device in some format,
for example in phonetic representation. A high-level synthesis is responsible
for generating the input data to the low-level device including correct text pre-
processing, pronunciation and prosodic information. Most synthesis contains
both high and low level system, but due to specific problems with methods
they are sometimes developed separately.
3.5.1 DECTalk:
Digital equipment corporation (DEC) has long traditions with speech
synthesis. The DECtalk system is originally descended from MITalk and
Klattalk. The present system is available for American English, German and
Spanish. It offers nine different voice personalities, four male, four female and
one child.
The system is capable to say most proper names, e-mail and URL addresses,
as well as supports a customized pronunciation dictionary. It also has
punctuation control for pauses, pitch, stress and the voice control commands
may be inserted in a text file for use by DECtalk software applications. But
sound of this product is still robotic. The software version has three special
modes, speech-to-wave mode, the log-file mode and the text-to-memory
mode.
29
Theory of TTS
The current system is available for English, French, Spanish, Italian, German,
Russian, Romanian, Chinese and Japanese. The architecture of the current
system is entirely modular (Möbius et al. 1996). It is designed as pipeline
where each of 13 modules handles one particular step for the process. Any
change in one of the 13 blocks will not affect other blocks but to implement
this single change, the interface between all blocks needs to be modified every
time and this integration should be always smooth. This is the biggest
disadvantage of this system.
3.5.3 Laureate:
Laureate is a speech synthesis system developed during last two decades at BT
laboratories (British Telecom). To achieve good platform independence
Laureate is written in standard ANSI C and it has a modular architecture.
(Gaved 1993, Morton 1987). The Laureate system is optimized for telephony
applications so that lots of attentions have been paid for text normalization and
pronunciation fields. The system supports multi-channel capabilities and other
features needed in telecommunication applications.
The current version of Laureate is available only for British and American
English with several different accents. Prototype versions for French and
Spanish also exist and several other European languages are under
development. A talking head for the system has been recently introduced
30
Theory of TTS
3.5.4 SoftVoice:
SoftVoice Inc. has over 25 years of experience in speech synthesis. The latest
version of SVTTS is the fifth generation multilingual TTS system for
Windows is available for English and Spanish with 20 present voices
including males, females, children, robots and aliens. Languages and
parameters may be changed dynamically during speech. More languages are
under development and the user may also create an unlimited number of own
voices. The input text may contain over 30 different control commands for
speech features. Speech rate is adjustable between 20 and 800 words per
minute and the fundamental frequency or pitch between 10 and 2000 Hz. Pitch
modulation effects such as vibrato, perturbation and excursion are also
included. This system is concentrating more on emotional part of synthesis
than naturalness or quality improvement of speech output.
Vocal quality may be set as normal, breathy or whispering and the singing is
also supported. The output speech may be listened in either word-by-word or
letter-by letter modes. The system can return mouth shape data for animation
and has capable to send synchronization data for the other user's applications.
The basic architecture of the present system is based on formant synthesis.
31
Theory of TTS
The system is available for American and British English, French, German,
and Spanish. The pitch and speaking rate are adjustable and the system
contains a complete telephone interface allowing connection directly to the
public network. ProVerbe has an ISA connected internal device which is
capable for multichannel operation. Internal device is available for Russian
language and has the same features as serial unit. This system is having
limited applications and also available for limited languages. It is not yet
extended for other languages.
3.5.6 ORATOR:
ORATOR is a TTS system developed by Bell communications research
(Bellcore). The synthesis is based on demi-syllable concatenation (Santen
1997, Macchi et al. 1993, Spiegel 1993). The latest ORATOR version
provides probably one of the most natural sounding speeches available today.
Special attention on text processing and pronunciation of proper names for
American English is given and the system is thus suitable for telephone
applications. The current version of ORATOR is available only for American
English and supports several platforms, such as Windows NT, Sun and DEC
stations. This system is limited to some languages and for limited platforms.
This system is developed more from the point of view of front end processing.
Demi-syllables results in more number of concatenation points and hence the
performance of this system is not as good as present or latest systems.
3.5.7 Eurovocs:
Eurovocs is a text-to-speech synthesis developed by Technologies &
Revalidate (T&R) in Belgium. It is a small (200 x 110 x 50 mm, 600g)
external device with built-in speaker and it can be connected to any system or
computer which is capable to send ASCII via standard serial interface RS232.
No additional software on computer is needed.
32
Theory of TTS
can be programmed with two languages. The system also supports personal
dictionaries. Recently introduced improved version contains Spanish and some
improvements in speech quality. Only two languages at a time can be used
with this type of product. Available for few languages and this product is an
external device which needs to be connected to computer.
All versions are available for American English and first two for German,
Dutch, Spanish, Italian and Korean (Lernout & Hauspie 1997). Several other
languages such as Japanese, Arabic and Chinese are under development.
Products have a customizable vocabulary tool that permits the user to add
special pronunciations of words which do not succeed with normal
pronunciation rules. With a special transplanted prosody tool it is possible to
copy duration and intonation values from recorded speech for commonly used
sentences which may be used for example in information and announcement
systems.
33
Theory of TTS
MacinTalk2 is the wavetable synthesis with ten built-in voices. It uses only
150 kilobytes of memory, but has also the lowest quality of Plain Talk family,
but runs on almost every Macintosh system.
The database size of this system is very large. Processor requirement is very
high. Although this system has provision of different voices, the processor
capacity and memory requirement is considerably high.
3.5.10 Silpa:
Silpa stands for Swathanthra Indian language computing project. It is a web
platform to host the freedom for software language processing applications
easily. It is a web framework and a set of applications for processing Indian
languages in many ways. In other words, it is a platform for porting existing
and upcoming language processing applications to the web. Silpa can be used
as a python library or as a web service from other applications.
The product range of text-to-speech synthesis is very wide and it is quite unreasonable
to present all possible products or systems available out there.
34