0401229

Uploaded by

Ananthakrishnan K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views4 pages

0401229

Uploaded by

Ananthakrishnan K

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Alan W Black† Heiga Zen‡ Keiichi Tokuda‡

†
Language Technology Institute, Carnegie Mellon University, Pittsburgh, PA
‡
Department of Computer Science and Engineering, Nagoya Institute of Technology, Nagoya, JAPAN
E-mail address: † [email protected], ‡ {zen,tokuda}@ics.nitech.ac.jp

ABSTRACT modiſcation of the selected pieces of natural speech are carried out,
This paper gives a general overview of techniques in statisti- thus limiting the output speech to the style of that in the original
cal parametric speech synthesis. One of the instances of these recordings.
techniques, called HMM-based generation synthesis (or simply With a desire for more control over the speech variation,
HMM-based synthesis), has recently been shown to be very ef- larger databases containing examples of different styles are required.
fective in generating acceptable speech synthesis. This paper also IBM’s stylistic synthesis [5] is a good example but is limited by the
contrasts these techniques with the more conventional unit selec- amount of variations that can be recorded.
tion technology that has dominated speech synthesis over the last In direct contrast to this selecting of actual instances of speech
ten years. Advantages and disadvantages of statistical parametric from a database, statistical parametric speech synthesis has also
synthesis are highlighted as well as identifying where we expect the grown in popularity over the last few years. Statistical parametric
key developments to appear in the immediate future. synthesis might be most simply described as generating the average
of some set of similarly sounding speech segments. This contrasts
Index Terms— Speech synthesis, hidden Markov models directly with the desire in unit selection to keep the natural unmodi-
ſed speech units, but using parametric models offers other beneſts.
1. BACKGROUND In both the Blizzard Challenge 2005 and 2006 ( [6, 7]) where
a common speech database is provided to participants to build a
With the increase in power and resources of computer technology, synthetic voice, the results from listening tests have shown that one
building natural sounding synthetic voices has progressed from a of the instances of statistical parametric synthesis techniques called
knowledge-based activity to a data-based one. Rather than hand- HMM-based generation synthesis (or even HMM-based synthe-
crafting each phonetic unit and its applicable contexts, high-quality sis) offers more preferred (through MOS tests) and more understand-
synthetic voices may be built from sufficiently diverse single speaker able (through WER scores) synthesis. Although even the proponents
databases of natural speech. We can see a progression from ſxed of statistical parametric synthesis feel that the best examples of unit
inventories, found in diphone systems [1] to the more general, but selection are better than the best examples of statistical parametric
more resource consuming, techniques of unit selection synthesis synthesis, overall it appears that quality of statistical parametric syn-
where appropriate sub-word units are automatically selected from thesis has already reached a quality that can stand in its own right.
large data-bases of natural speech [2]. The quality issue really comes down to the fact that given a para-
ATR ν-talk [3] was the ſrst to show the effectiveness of auto- metric representation it is necessary to reconstruct the speech from
matic selection of appropriate units, then CHATR [2] generalized those parameters. The reconstruction process is still not ideal. Al-
these techniques to multiple languages and an automatic training though modeling the spectral and prosody features is relatively well
scheme. Unit selection techniques have risen to be the dominant syn- deſned, models of the residual/excitation are still yet to be fully de-
thesis technique. The quality of the output derives directly from the veloped, though composite models like STRAIGHT [8] are proving
quality of the recordings, and it appears that the larger the database to be useful.
the better the coverage. Commercial systems have exploited these The following section gives a more formal deſnition of unit se-
technique to bring us a new level of synthetic speech. However, lection techniques that will allow a easier contrast it to statistical
although certainly successful, there is always the issue of spurious parametric synthesis. Then statistical parametric speech synthesis is
errors. When a desired sentence happens to require phonetic and more formally deſned, speciſcally based on the implementation on
prosody contexts that are under represented in a database, the qual- the HMM-based speech synthesis system (HTS) [9, 10]. The ſnal
ity of the synthesizer can be severely degraded. Even though this sections discuss some of the advantages in a statistical parametric
may be a rare event, a single bad join in an utterance can ruin the framework highlighting some of the existing a future directions.
listeners ƀow.
It is not possible to guarantee that bad joins and/or inappropriate
units do not occur, simply because of the vast number of possible 2. UNIT SELECTION SYNTHESIS
combinations that could occur. However for particular applications
it is often possible to almost always avoid them. Limited domain There seems to be two basic techniques in unit selection, though
synthesizers [4], where the database is designed for the particular they are theoretically not very different. Hunt and Black presented
application, go a long way to making almost all the synthetic output a selection model [2], which actually existed previously in ATR ν-
near perfect. talk. The basic notion is that of a target cost, how well a candidate
However in spite of the desire for perfect synthesis all the time, unit from the database matches the desired unit, and a concatena-
there are limitations in the unit selection technique. No (or little) tion cost which deſnes how well two selected units combine. Unit

selection requires the optimization of both these costs over the utter- Speech signal
ances. SPEECH
DATABASE
The deſnition of target cost between a candidate unit ui and a Excitation
parameter
Spectral
parameter
desired unit ti is extraction extraction
Excitation Spectral

p parameters parameters
t
C (ti , ui ) = wtj C tj (ti , ui ), (1) Label
Training of HMM

j=1
Training part
where j indexes over all features (typically phonetic and prosodic Synthesis part
TEXT
contexts are used). Concatenation cost is deſned as context-dependent HMMs
Text analysis & duration models

q
Label
C c (ui−1 , ui ) = wck Ckc (ui−1 , ui ).
Parameter generation
(2) from HMM
k=1 Excitation Spectral
parameters parameters
Though in this case k may include spectral and acoustic features. Excitation
generation
Synthesis
filter
SYNTHESIZED
SPEECH
Weights (wtj and wck ) have to be found for each feature, and actu-
ally implementations used a combination of trained and hand tuned
weights.
The second direction, ( [11] and similarly [12]) use a clustering Fig. 1. Overview of a typical HMM-based speech synthesis system.
method that allows the target cost to effectively be precalculated.
Units of the same type are clustered into a decision tree that asks
questions about features available at synthesis time (e.g. phonetic thesized speech. There seems to be three factors which degrade the
and prosody context). quality: vocoder, modeling accuracy, and over-smoothing.
All of these techniques depend on a acoustic distance measure The synthesized speech by the HMM-based generation synthesis
which should be correlated with human perception. approach sounds buzzy since it is based on the vocoding technique.
These apparently unit selection speciſc issues are mentioned To alleviate this problem, a high quality vocoder such as multi-band
here because they have speciſc counterparts in statistical paramet- excitation scheme [18–21] or STRAIGHT [8] have been integrated.
ric synthesis. Several groups have recently applied LSP-type parameters instead
of mel-cepstral coefficients to the HMM-based generation synthesis
approach [22, 23].
3. STATISTICAL PARAMETRIC SYNTHESIS
The basic system uses ML-estimated HMMs as its acoustic
models. Because this system generates speech parameters from
3.1. Overview of a typical system
its acoustic models, model accuracy highly affects the quality of
Figure 1 is a block diagram of a typical HMM-based speech synthe- synthesized speech. To improve its modeling accuracy, a number
sis system [9]. It consists of training and synthesis parts. of advanced acoustic models and training frameworks such as hid-
The training part is similar to those used in speech recogni- den semi-Markov models (HSMMs) [24], trajectory HMMs [25],
tion systems. The main difference is that both spectrum (e.g., mel- buried Markov models [26], trended HMMs [27], stochastic Markov
cepstral coefficients [13] and their dynamic features) and excitation graphs [28], minimum generation error (MGE) criterion [29], and
(e.g., log F0 and its dynamic features) parameters are extracted from variational Bayesian approach [30] have been investigated.
a speech database and modeled by context-dependent HMMs (pho- In the basic system, the speech parameter generation algorithm
netic, linguistic, and prosodic contexts are taken into account). To (typically case 1 described by Tokuda et al. [16]) is used to gener-
model log F0 sequence which includes unvoiced regions properly, ate spectral and excitation parameters from HMMs. By taking ac-
multi-space probability distributions [14] are used for the state out- count of constraints between the static and dynamic features, it can
put stream for log F0 . Each HMM has state duration densities to generate smooth speech parameter trajectories. However, the gen-
model the temporal structure of speech [15]. As a result, the system erated spectral and excitation parameters are often over-smoothed.
models spectrum, excitation, and durations in a uniſed framework. Synthesized speech using over-smoothed spectral parameters sounds
The synthesis part does the inverse operation of speech recogni- muffled. To reduce this effect and enhance the speech quality, post-
tion. First, an arbitrarily given text corresponding an utterance to be ſltering [18, 22], a conditional speech parameter generation algo-
synthesized is converted to a context-dependent label sequence and rithm [31], or a speech parameter generation algorithm considering
then the utterance HMM is constructed by concatenating the context- global variance [32] have been used.
dependent HMMs according to the label sequence. Secondly, state Advantages of the HMM-based generation synthesis approach
durations of the HMM are determined based on the state duration are
probability density functions. Thirdly, the speech parameter gener-
1) its voice characteristics can be easily modiſed,
ation algorithm (typically, case 1 in [16]) generates the sequence of
mel-cepstral coefficients and log F0 values that maximize their out- 2) it can be applied to various languages with little modiſcation,
put probabilities. Finally, a speech waveform is synthesized directly
3) a variety of speaking styles or emotional speech can be syn-
from the generated mel-cepstral coefficients and F0 values using the
thesized using the small amount of speech data,
MLSA ſlter [17] with binary pulse or noise excitation.
4) techniques developed in ASR can be easily applied,
3.2. Advantages and disadvantages 5) its footprint is relatively small.
The biggest disadvantage of the HMM-based generation synthesis The voice characteristics in 1) can be changed by transform-
approach against the unit selection approach is the quality of syn- ing HMM parameters appropriately because the system generates

IV 1230
speech waveforms from the HMMs themselves. For example, ei- likelihoods are used as “costs” for unit selection [59, 60]. Among of
ther a speaker adaptation [33, 34], a speaker interpolation [35], or these approaches, [57] and [60] use frame-sized units, and [61] use
an eigenvoice technique [36] was applied to this system, and it was generated longer trajectories to provide “costs” for unit selection.
shown that the system could modify voice characteristics. Multilin- Another type of hybrid approaches uses statistical models as a prob-
gual support in 2) can be easily realized because in this system only abilistic smoother for unit selection [62, 63]. Unifying unit selection
contextual factors are dependent on each language. Japanese [9], and HMM-based generation synthesis is also investigated [64].
Mandarin [37, 38], Korean [39], English [40], German [41], Por- In the future, we may converge at an optimal form of corpus-
tuguese [42, 43], Swedish [44], Finnish [45, 46], Slovenian [47], based speech synthesis fusing generation and selection approaches.
Croatian [48], Arabic [19], Farsi [49], and Polyglot [50] systems
have already been developed by various groups. Speaking styles 5. CONCLUSION
and emotional voices in 3) can be constructed by re-estimating ex-
isting average voice models with only a few utterances using adap- We can see that statistical parametric speech synthesis oﬀers a wide
tation techniques [51–53]. As for 4), we can employ a number of range of techniques to improve spoken output. Its more complex
useful technologies developed for the HMM-based speech recogni- models, when compared to standard unit selection, allow for general
tion. For example, structured precision matrix models, which can solutions, without necessarily requiring recording speech in all pho-
approximate full covariance models well using the small number of netic and prosodic contexts. The pure unit selection view requires
parameters, have successfully been applied to the system [23]. Small very large databases to cover examples of all desired prosodic, pho-
footprints in 5) can be realized by storing statistics of HMMs rather netic and stylistic variation. In contrast statistical parametric synthe-
than multi-templates of speech units. For example, footprints of the sis allows for models to be combined and adapted thus not requiring
Nitech’s Blizzard Challenge 2005 voices were less than 2 MBytes instances of all possible combinations of contexts.
with no compression [54] .

6. ACKNOWLEDGMENTS
4. RELATION AND HYBRID APPROACHES
This work was partly supported by the MEXT e-Society project.
4.1. Relation between two approaches This work was also partly supported by the US National Science
Foundation under grant number 0415021 ”SPICE: Speech Process-
Some of clustering-based unit selection approaches uses HMM-
ing Interactive Creation and Evaluation Toolkit for new Languages.”
based state clustering [11]. In this case, the structure is very sim-
Any opinions, ſndings, and conclusions or recommendations ex-
ilar to that of the HMM-based generation synthesis approach. The
pressed in this material are those of the authors and do not neces-
essential difference between the clustering-based unit-selection ap-
sarily reƀect the views of the National Science Foundation.
proach and the HMM-based generation synthesis approach is that
each cluster in the generation approach is represented by statistics of
7. REFERENCES
the cluster instead of multi-templates of speech units.
In the HMM-based generation synthesis approach, distributions [1] J. Olive, A. Greenwood, and J. Coleman, Acoustics of American En-
for spectrum, F0 , and duration are clustered independently. Accord- glish Speech: A Dynamic Approach, Springer Verlag, 1993.
ingly, it has different decision trees for each of spectrum, F0 , and [2] A. Hunt and A. Black, “Unit selection in a concatenative speech syn-
duration. On the other hand, unit selection systems often use regres- thesis system using a large speech database,” in ICASSP, 1996, pp.
373–376.
sion trees (or CART) for prosody prediction. The decision trees for
F0 and duration in the HMM-based generation synthesis approach [3] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, “ATR ν-TALK
speech synthesis system,” in ICSLP, 1992, pp. 483–486.
are essentially equivalent to the regression trees in the unit selection
[4] A. Black and K. Lenzo, “Limited domain synthesis,” in ICSLP, 2000,
systems. However, in the unit selection systems, leaves of one of pp. 411–414.
trees must have speech waveforms: other trees are used to calculate
[5] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and J. Pitrelli,
target costs, to prune waveform candidates, or to give features for “A corpus-based approach to <AHEM/> expressive speech synthesis
constructing the trees for speech waveforms. authors,” in ISCA SSW5, 2004.
It is noted that in the HMM-based generation synthesis ap- [6] C. Bennett, “Large scale evaluation of corpus-based synthesizers: Re-
proach, likelihoods of static feature parameters and dynamic feature sults and lessons from the Blizzard Challenge 2005,” in Interspeech,
parameters corresponds to the target costs and concatenation costs, 2005, pp. 105–108.
respectively. It is easy to understand, if we approximate each state [7] C. Bennett and A. Black, “Blizzard Challenge 2006,” in Blizzard Chal-
output distribution by a discrete distribution or instances of frame lenge Workshop, 2006.
samples in the cluster: when the dynamic feature is calculated as the [8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne, “Restructuring
speech representations using a pitch-adaptive time-frequency smooth-
difference between neighboring static features, the ML-based gen- ing and an instantaneous-frequency-based f0 extraction: possible role
eration results in a frame-wise DP search like unit selection. Thus of a repetitive structure in sounds,” Speech Communication, vol. 27,
HMM-based parameter generation can be viewed as an analogue pp. 187–207, 1999.
version of unit selection. [9] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Simultaneous modeling of spectrum, pitch and duration in HMM-
based speech synthesis,” in Eurospeech, 1999, pp. 2347–2350.
4.2. Hybrid approaches [10] K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, A.W. Black,
and T. Nose, “The HMM-based speech synthesis system (HTS),”
As a natural consequence of the above viewpoints, there are also http://hts.ics.nitech.ac.jp/.
hybrid approaches. [11] R. Donovan and P. Woodland, “Improvements in an HMM-based
Some of these approaches use spectrum parameters, F0 values, speech synthesiser,” in Eurospeech, 1995, pp. 573–576.
and durations (or a part of them) generated from HMM to calcu- [12] A. Black and P. Taylor, “Automatically clustering similar units for unit
late acoustic target costs for unit selection [55–58]. Similarly, HMM selection in speech synthesis,” in Eurospeech, 1997, pp. 601–604.

IV 1231
[13] T. Fukada, K. Tokuda, Kobayashi T., and S. Imai, “An adaptive al- [38] Y. Qian, F. Soong, Y. Chen, and M. Chu, “An HMM-based Mandarin
gorithm for mel-cepstral analysis of speech,” in ICASSP, 1992, pp. Chinese text-to-speech system,” in ISCSLP, 2006.
137–140.
[39] S.-J. Kim, J.-J. Kim, and M.-S. Hahn, “Implementation and evaluation
[14] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space of an HMM-based Korean speech synthesis system,” IEICE Trans. Inf.
probability distribution HMM,” IEICE Trans. Inf. & Syst., vol. E85-D, & Syst., vol. E89-D, pp. 1116–1119, 2006.
no. 3, pp. 455–464, 2002.
[40] K. Tokuda, H. Zen, and A. Black, “An HMM-based speech synthesis
[15] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, system applied to English,” in IEEE Speech Synthesis Workshop, 2002.
“Duration modeling for HMM-based speech synthesis,” in ICSLP,
1998, pp. 29–32. [41] C. Weiss, R. Maia, K. Tokuda, and W. Hess, “Low resource HMM-
based speech synthesis applied to German,” in ESSP, 2005.
[16] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura,
“Speech parameter generation algorithms for HMM-based speech syn- [42] R. Maia, H. Zen, K. Tokuda, T. Kitamura, and F. Resende Jr., “Towards
thesis,” in ICASSP, 2000, pp. 1315–1318. the development of a Brazilian Portuguese text-to-speech system based
on HMM,” in Eurospeech, 2003, pp. 2465–2468.
[17] S. Imai, “Cepstral analysis synthesis on the mel frequency scale,” in
ICASSP 83, 1983, pp. 93–96. [43] M. Barros, R. Maia, K. Tokuda, D. Freitas, and F. Resende Jr., “HMM-
based European Portuguese speech synthesis,” in Interspeech, 2005,
[18] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, pp. 2581–2584.
“Mixed excitation for HMM-based speech synthesis,” in Eurospeech,
2001, pp. 2263–2266. [44] A. Lundgren, An HMM-based text-to-speech system applied to
Swedish, Master thesis, Royal Institute of Technology (KTH), 2005.
[19] O. Abdel-Hamid, S. Abdou, and M. Rashwan, “Improving Arabic
HMM based speech synthesis quality,” in Interspeech, 2006, pp. 1332– [45] T. Ojala, Auditory quality evaluation of present Finnish text-to-speech
1335. systems, Master thesis, Helsinki University of Technology, 2006.
[20] C. Hemptinne, Integration of the harmonic plus noise model into the [46] M. Vainio, A. Suni, and P. Sirjola, “Developing a Finnish concept-to-
hidden Markov model-based speech synthesis system, Master thesis, speech system,” in 2nd Baltic conference on HLT, 2005, pp. 201–206.
IDIAP Research Institute, 2006. [47] B. Vesnicer and F. Mihelic, “Evaluation of the Slovenian HMM-based
[21] S.-J. Kim and M.-S. Hahn, “Two-band excitation for HMM-based speech synthesis system,” in TSD, 2004, pp. 513–520.
speech synthesis,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 1, pp. [48] S. Martincic-Ipsic and I. Ipsic, “Croatian HMM-based speech synthe-
378–381, 2007. sis,” Journal of Computing and Information Technology, vol. 14, no. 4,
[22] Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, “USTC pp. 307–313, 2006.
system for Blizzard Challenge 2006 an improved HMM-based speech [49] M. Homayounpour and S. Mehdi, “Farsi speech synthesis using hid-
synthesis method,” in Blizzard Challenge Workshop, 2006. den Markov model and decision trees,” The CSI Journal on Computer
[23] H. Zen, T. Toda, and K. Tokuda, “The Nitech-NAIST HMM-based Science and Engineering, vol. 2, no. 1&3 (a), 2004.
speech synthesis system for the Blizzard Challenge 2006,” in Blizzard [50] J. Latorre, K. Iwano, and S. Furui, “Polyglot synthesis using a mixture
Challenge Workshop, 2006. of monolingual corpora,” in ICASSP, 2005, vol. 1, pp. 1–4.
[24] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hid- [51] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of
den semi-Markov model based speech synthesis,” in Interspeech, 2004, various speaking styles and emotions for HMM-based speech synthe-
pp. 1185–1180. sis,” in Interspeech, 2003, pp. 2461–2464.
[25] H. Zen, K. Tokuda, and T. Kitamura, “An introduction of trajectory [52] J. Yamagishi, Average-Voice-Based Speech Synthesis, Ph.D. thesis,
model into HMM-based speech synthesis,” in ISCA SSW5, 2004. Tokyo Institute of Technology, 2006.
[26] I. Bulyko, M. Ostendorf, and J. Bilmes, “Robust splicing costs and [53] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style
efficient search with BMM models for concatenative speech synthesis,” adaptation technique for speech synthesis using HSMM and supraseg-
in ICASSP, 2002, pp. 461–464. mental features,” IEICE Trans. Inf. & Syst., vol. E89-D, no. 3, pp.
[27] J. Dines and S. Sridharan, “Trainable speech synthesis with trended 1092–1099, 2006.
hidden Markov models,” in ICASSP, 2001, pp. 833–837. [54] H. Zen, T. Toda, M. Nakamura, and T. Tokuda, “Details of the Nitech
[28] M. Eichner, M. Wolff, S. Ohnewald, and R. Hoffman, “Speech synthe- HMM-based speech synthesis system for the Blizzard Challenge 2005,”
sis using stochastic Markov graphs,” in ICASSP, 2001, pp. 829–832. IEICE Trans. Inf. & Syst., vol. E90-D, no. 1, pp. 325–333, 2007.
[29] Y.-J. Wu and R.-H. Wang, “Minimum generation error training for [55] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, “XIMERA:
HMM-based speech synthesis,” in ICASSP, 2006, pp. 89–92. A new TTS from ATR based on corpus-based technologies,” in ISCA
[30] Y. Nankaku, H. Zen, K. Tokuda, T. Kitamura, and T. Masuko, “A SSW5, 2004.
Bayesian approach to HMM-based speech synthesis,” in Tech. rep. of [56] S. Rouibia and Rosec, “Unit selection for speech synthesis based on a
IEICE, 2003, vol. 103, pp. 19–24. new acoustic target cost,” in Interspeech, 2005, pp. 2565–2568.
[31] T. Masuko, K. Tokuda, and T. Kobayashi, “A study on conditional [57] T. Hirai and S. Tenpaku, “Using 5 ms segments in concatenative speech
parameter generation from HMM based on maximum likelihood crite- synthesis,” in ISCA SSW5, 2004.
rion,” in Autumn Meeting of ASJ, 2003, pp. 209–210.
[58] J.-H. Yang, Z.-W. Zhao, Y. Jiang, G.-P. Hu, and X.-R. Wu, “Multi-
[32] T. Toda and K. Tokuda, “Speech parameter generation algorithm con- tier non-uniform unit selection for corpus-based speech synthesis,” in
sidering global variance for HMM-based speech synthesis,” in Eu- Blizzard Challenge Workshop, 2006.
rospeech, 2001, pp. 2801–2804.
[59] N. Mizutani, K. Tokuda, and T. Kitamura, “Concatenative speech syn-
[33] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteris- thesis based on HMM,” in Autumn meeting of ASJ, 2002, pp. 241–242.
tics conversion for HMM-based speech synthesis system,” in ICASSP,
1997, pp. 1611–1614. [60] Z. Ling and R. Wang, “HMM-based unit selection using frame sized
speech segments,” in Interspeech, 2006, pp. 2034–2037.
[34] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of
pitch and spectrum for HMM-based speech synthesis using MLLR,” in [61] J. Kominek and A. Black, “The Blizzard Challenge 2006 CMU entry
ICASSP, 2001, pp. 805–808. introducing hybrid trajectory-selection synthesis,” in Blizzard Chal-
lenge Workshop, 2006.
[35] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Speaker interpolation in HMM-based speech synthesis system,” in [62] M. Plumpe, A. Acero, H. Hon, and X. Huang, “HMM-based smoothing
Eurospeech, 1997, pp. 2523–2526. for concatenative speech synthesis,” in ICSLP, 1998, pp. 2751–2754.
[36] K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and [63] J. Wouters and M. Macon, “Unit fusion for concatenative speech syn-
T. Kitamura, “Eigenvoices for HMM-based speech synthesis,” in IC- thesis,” in ICSLP, 2000, pp. 302–305.
SLP, 2002, pp. 1269–1272. [64] P. Taylor, “Unifying unit selection and hidden Markov model speech
[37] H. Zen, J. Lu, J. Ni, K. Tokuda, and H. Kawai, “HMM-based prosody synthesis,” in Interspeech, 2006, pp. 1758–1761.
modeling and synthesis for Japanese and Chinese speech synthesis,”
Tech. Rep. TR-SLT-0032, ATR-SLT, 2003.

IV 1232

Towards Responsible Machine Translation - Ethical and Legal Considerations in Machine Translation
No ratings yet
Towards Responsible Machine Translation - Ethical and Legal Considerations in Machine Translation
242 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
No ratings yet
Challenges in Speech Synthesis: David Suendermann, Harald Höge, and Alan Black
15 pages
The HMM-based Speech Synthesis System (HTS) Version 2.0
No ratings yet
The HMM-based Speech Synthesis System (HTS) Version 2.0
6 pages
Text To Speech System For Punjabi Using Festival Framework
No ratings yet
Text To Speech System For Punjabi Using Festival Framework
5 pages
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
No ratings yet
Using 5 Ms Segments in Concatenative Speech Synthesis: Toshio Hirai Seiichi Tenpaku
6 pages
Speech Synthesis in Indian Languages: Abstract
No ratings yet
Speech Synthesis in Indian Languages: Abstract
4 pages
Neural Speech Synthesis
No ratings yet
Neural Speech Synthesis
63 pages
Speech Notes of Paper
No ratings yet
Speech Notes of Paper
5 pages
Nonlinear Speech Synthesis
No ratings yet
Nonlinear Speech Synthesis
8 pages
Keller 01 Naturalness
No ratings yet
Keller 01 Naturalness
12 pages
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
No ratings yet
Bhaashika: Telugu Tts System: Dr. K.V.N.Sunitha
9 pages
Festival Hindi Pxc3893287
No ratings yet
Festival Hindi Pxc3893287
6 pages
A Corpus-Based Concatenative Speech Synthesis System For Turkish
No ratings yet
A Corpus-Based Concatenative Speech Synthesis System For Turkish
15 pages
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
No ratings yet
Speech Recognition Using Matrix Comparison: Vishnupriya Gupta
3 pages
Synthesis: Models of Speech
No ratings yet
Synthesis: Models of Speech
6 pages
Speechsynthesis
No ratings yet
Speechsynthesis
6 pages
Text To Speech Synthesis TTS
No ratings yet
Text To Speech Synthesis TTS
7 pages
Text To Speech Conversion: Muhammad Amar (19L-1916)
No ratings yet
Text To Speech Conversion: Muhammad Amar (19L-1916)
4 pages
Reconocimiento de Voz - MATLAB
No ratings yet
Reconocimiento de Voz - MATLAB
5 pages
The Development of Pashto Speech Synthesis System
No ratings yet
The Development of Pashto Speech Synthesis System
4 pages
2002 Data Driven Segment
No ratings yet
2002 Data Driven Segment
4 pages
Method To Study Speech Synthesis
No ratings yet
Method To Study Speech Synthesis
43 pages
Abhijit Pradhan, Aswin Shanmugam S, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy
No ratings yet
Abhijit Pradhan, Aswin Shanmugam S, Anusha Prakash, Kamakoti Veezhinathan, Hema Murthy
5 pages
Arabic Text To Speech Synthesizer
No ratings yet
Arabic Text To Speech Synthesizer
14 pages
Automatic Speech Recognition 2
No ratings yet
Automatic Speech Recognition 2
22 pages
Development of TTS For Palilanguage
No ratings yet
Development of TTS For Palilanguage
9 pages
Chapter-3: Theory of TTS
No ratings yet
Chapter-3: Theory of TTS
26 pages
Synopsis
No ratings yet
Synopsis
11 pages
Marathi Speech Synthesis A Review
No ratings yet
Marathi Speech Synthesis A Review
4 pages
Text-To-Speech Synthesis Using Concatena
No ratings yet
Text-To-Speech Synthesis Using Concatena
4 pages
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
No ratings yet
Removal of Spectral Discontinuity in ConcatenatedSpeech Waveform
5 pages
Implementation of Speech Synthesis System Using Neural Networks
No ratings yet
Implementation of Speech Synthesis System Using Neural Networks
4 pages
Voice Recognition System Speech To Text
No ratings yet
Voice Recognition System Speech To Text
5 pages
A Tutorial On Speech Synthesis Models
No ratings yet
A Tutorial On Speech Synthesis Models
8 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Aes2001 Bonada PDF
100% (1)
Aes2001 Bonada PDF
10 pages
The At&t German Text-To-Speech System: Realistic Linguistic Description
No ratings yet
The At&t German Text-To-Speech System: Realistic Linguistic Description
4 pages
Formant Synthesis
No ratings yet
Formant Synthesis
21 pages
Convention Paper 5452: Audio Engineering Society
100% (1)
Convention Paper 5452: Audio Engineering Society
10 pages
V4I4-1307 Uuu
No ratings yet
V4I4-1307 Uuu
6 pages
Redaction HTK Amazigh Speech
No ratings yet
Redaction HTK Amazigh Speech
15 pages
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
No ratings yet
A Dynamical System Model For Generating Fundamental Frequency For Speech Synthesis
15 pages
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
No ratings yet
Write: Get Unlimited Access To The Best of Medium For Less Than $1/week
19 pages
Using Gaussian Mixtures For Hindi Speech Recognition System
No ratings yet
Using Gaussian Mixtures For Hindi Speech Recognition System
14 pages
From Sound Modeling To Analysis-Synthesis of Sounds: 1 Scope of The Article
No ratings yet
From Sound Modeling To Analysis-Synthesis of Sounds: 1 Scope of The Article
9 pages
Speech Recognition Application
No ratings yet
Speech Recognition Application
13 pages
Unit 2 Sound or Audio System
No ratings yet
Unit 2 Sound or Audio System
29 pages
Pribil 2020
No ratings yet
Pribil 2020
4 pages
Spasov Ski 2015
No ratings yet
Spasov Ski 2015
8 pages
Automatic Sound Signals Quality Estimation
No ratings yet
Automatic Sound Signals Quality Estimation
16 pages
Comparison of Urdu Text To Speech Synthesis Using Unit Selection and HMM Based Techniques PDF
No ratings yet
Comparison of Urdu Text To Speech Synthesis Using Unit Selection and HMM Based Techniques PDF
5 pages
Speech Recognition Using A DSP: Lunds Universitet
No ratings yet
Speech Recognition Using A DSP: Lunds Universitet
12 pages
Speech Analysis-Synthesis Based On The Periodic-Aperiodic Decomposition
No ratings yet
Speech Analysis-Synthesis Based On The Periodic-Aperiodic Decomposition
9 pages
CCS369 - TSS-Unit 5
No ratings yet
CCS369 - TSS-Unit 5
23 pages
Speech Trainer Kit Using Laryngeal Vibrations
No ratings yet
Speech Trainer Kit Using Laryngeal Vibrations
5 pages
Thesis mns25 PDF
No ratings yet
Thesis mns25 PDF
163 pages
Thesis Mns25
No ratings yet
Thesis Mns25
163 pages
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
From Everand
BERT Foundations and Applications: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Artificial Intelligence and Natural Algorithms
From Everand
Artificial Intelligence and Natural Algorithms
Rijwan Khan
No ratings yet
Data Science and Interdisciplinary Research: Recent Trends and Applications
From Everand
Data Science and Interdisciplinary Research: Recent Trends and Applications
Pradeep Kumar Singh
No ratings yet
Text To Speech
No ratings yet
Text To Speech
5 pages
CSIT Final Internship Report - BIT-007-17
No ratings yet
CSIT Final Internship Report - BIT-007-17
31 pages
Avaya Reference Programming 7 1
No ratings yet
Avaya Reference Programming 7 1
127 pages
Versatile Speech Databases For High Quality Synthesis For Basque
No ratings yet
Versatile Speech Databases For High Quality Synthesis For Basque
5 pages
Ai in Speech Recognition
No ratings yet
Ai in Speech Recognition
24 pages
Generating AI Text To Video A Comprehensive Guide
No ratings yet
Generating AI Text To Video A Comprehensive Guide
4 pages
Pfe Python
No ratings yet
Pfe Python
57 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
IoT Based Assistive Device For Deaf Dumb and Blind
No ratings yet
IoT Based Assistive Device For Deaf Dumb and Blind
11 pages
MS AI Red Teaming
No ratings yet
MS AI Red Teaming
21 pages
Tts 06 Module
No ratings yet
Tts 06 Module
7 pages
Amazon Polly: Developer Guide
No ratings yet
Amazon Polly: Developer Guide
256 pages
1822 B.E Cse Batchno 10
No ratings yet
1822 B.E Cse Batchno 10
58 pages
Virtual Assistant: Project Bachelor of Technology CSE
No ratings yet
Virtual Assistant: Project Bachelor of Technology CSE
11 pages
Lit
No ratings yet
Lit
6 pages
Job and Career
No ratings yet
Job and Career
10 pages
Artificial Intelligence Notes
No ratings yet
Artificial Intelligence Notes
9 pages
483 Article Text 918 1 10 20240812
No ratings yet
483 Article Text 918 1 10 20240812
15 pages
Iot Project - Chatbot
No ratings yet
Iot Project - Chatbot
10 pages
English Sample Unit: Let's Talk! Stage 1
No ratings yet
English Sample Unit: Let's Talk! Stage 1
6 pages
Alpine CarAudio 2013-14
No ratings yet
Alpine CarAudio 2013-14
40 pages
B.M.S. College of Engineering: Department of Machine Learning
No ratings yet
B.M.S. College of Engineering: Department of Machine Learning
27 pages
Metainfo 2
No ratings yet
Metainfo 2
87 pages
SyncVox Performance On Non-Indic Languages and Code-Switched Speech Scenarios
No ratings yet
SyncVox Performance On Non-Indic Languages and Code-Switched Speech Scenarios
3 pages
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt
No ratings yet
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt
13 pages
Tess2Speech: An Intelligent Character Recognition-To-Speech Application For Android Using Google's Tesseract Optical Character Recognition Engine
No ratings yet
Tess2Speech: An Intelligent Character Recognition-To-Speech Application For Android Using Google's Tesseract Optical Character Recognition Engine
197 pages
Sensors 19 01911 v2
No ratings yet
Sensors 19 01911 v2
24 pages
Towards An AI To Win Ghana's National Science and Maths Quiz
No ratings yet
Towards An AI To Win Ghana's National Science and Maths Quiz
7 pages
Harnessing AI For Accessibility
No ratings yet
Harnessing AI For Accessibility
2 pages

0401229

Uploaded by

0401229

Uploaded by

STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Alan W Black† Heiga Zen‡ Keiichi Tokuda‡

1424407281/07/$20.00 ©2007 IEEE IV 1229 ICASSP 2007

You might also like

0401229

Uploaded by

0401229

Uploaded by

STATISTICAL PARAMETRIC SPEECH SYNTHESIS

Alan W Black† Heiga Zen‡ Keiichi Tokuda‡

1­4244­0728­1/07/$20.00 ©2007 IEEE IV ­ 1229 ICASSP 2007

You might also like

1424407281/07/$20.00 ©2007 IEEE IV 1229 ICASSP 2007