0401229
0401229
ABSTRACT modiſcation of the selected pieces of natural speech are carried out,
This paper gives a general overview of techniques in statisti- thus limiting the output speech to the style of that in the original
cal parametric speech synthesis. One of the instances of these recordings.
techniques, called HMM-based generation synthesis (or simply With a desire for more control over the speech variation,
HMM-based synthesis), has recently been shown to be very ef- larger databases containing examples of different styles are required.
fective in generating acceptable speech synthesis. This paper also IBM’s stylistic synthesis [5] is a good example but is limited by the
contrasts these techniques with the more conventional unit selec- amount of variations that can be recorded.
tion technology that has dominated speech synthesis over the last In direct contrast to this selecting of actual instances of speech
ten years. Advantages and disadvantages of statistical parametric from a database, statistical parametric speech synthesis has also
synthesis are highlighted as well as identifying where we expect the grown in popularity over the last few years. Statistical parametric
key developments to appear in the immediate future. synthesis might be most simply described as generating the average
of some set of similarly sounding speech segments. This contrasts
Index Terms— Speech synthesis, hidden Markov models directly with the desire in unit selection to keep the natural unmodi-
ſed speech units, but using parametric models offers other beneſts.
1. BACKGROUND In both the Blizzard Challenge 2005 and 2006 ( [6, 7]) where
a common speech database is provided to participants to build a
With the increase in power and resources of computer technology, synthetic voice, the results from listening tests have shown that one
building natural sounding synthetic voices has progressed from a of the instances of statistical parametric synthesis techniques called
knowledge-based activity to a data-based one. Rather than hand- HMM-based generation synthesis (or even HMM-based synthe-
crafting each phonetic unit and its applicable contexts, high-quality sis) offers more preferred (through MOS tests) and more understand-
synthetic voices may be built from sufficiently diverse single speaker able (through WER scores) synthesis. Although even the proponents
databases of natural speech. We can see a progression from ſxed of statistical parametric synthesis feel that the best examples of unit
inventories, found in diphone systems [1] to the more general, but selection are better than the best examples of statistical parametric
more resource consuming, techniques of unit selection synthesis synthesis, overall it appears that quality of statistical parametric syn-
where appropriate sub-word units are automatically selected from thesis has already reached a quality that can stand in its own right.
large data-bases of natural speech [2]. The quality issue really comes down to the fact that given a para-
ATR ν-talk [3] was the ſrst to show the effectiveness of auto- metric representation it is necessary to reconstruct the speech from
matic selection of appropriate units, then CHATR [2] generalized those parameters. The reconstruction process is still not ideal. Al-
these techniques to multiple languages and an automatic training though modeling the spectral and prosody features is relatively well
scheme. Unit selection techniques have risen to be the dominant syn- deſned, models of the residual/excitation are still yet to be fully de-
thesis technique. The quality of the output derives directly from the veloped, though composite models like STRAIGHT [8] are proving
quality of the recordings, and it appears that the larger the database to be useful.
the better the coverage. Commercial systems have exploited these The following section gives a more formal deſnition of unit se-
technique to bring us a new level of synthetic speech. However, lection techniques that will allow a easier contrast it to statistical
although certainly successful, there is always the issue of spurious parametric synthesis. Then statistical parametric speech synthesis is
errors. When a desired sentence happens to require phonetic and more formally deſned, speciſcally based on the implementation on
prosody contexts that are under represented in a database, the qual- the HMM-based speech synthesis system (HTS) [9, 10]. The ſnal
ity of the synthesizer can be severely degraded. Even though this sections discuss some of the advantages in a statistical parametric
may be a rare event, a single bad join in an utterance can ruin the framework highlighting some of the existing a future directions.
listeners ƀow.
It is not possible to guarantee that bad joins and/or inappropriate
units do not occur, simply because of the vast number of possible 2. UNIT SELECTION SYNTHESIS
combinations that could occur. However for particular applications
it is often possible to almost always avoid them. Limited domain There seems to be two basic techniques in unit selection, though
synthesizers [4], where the database is designed for the particular they are theoretically not very different. Hunt and Black presented
application, go a long way to making almost all the synthetic output a selection model [2], which actually existed previously in ATR ν-
near perfect. talk. The basic notion is that of a target cost, how well a candidate
However in spite of the desire for perfect synthesis all the time, unit from the database matches the desired unit, and a concatena-
there are limitations in the unit selection technique. No (or little) tion cost which deſnes how well two selected units combine. Unit
j=1
Training part
where j indexes over all features (typically phonetic and prosodic Synthesis part
TEXT
contexts are used). Concatenation cost is deſned as context-dependent HMMs
Text analysis & duration models
q
Label
C c (ui−1 , ui ) = wck Ckc (ui−1 , ui ).
Parameter generation
(2) from HMM
k=1 Excitation Spectral
parameters parameters
Though in this case k may include spectral and acoustic features. Excitation
generation
Synthesis
filter
SYNTHESIZED
SPEECH
Weights (wtj and wck ) have to be found for each feature, and actu-
ally implementations used a combination of trained and hand tuned
weights.
The second direction, ( [11] and similarly [12]) use a clustering Fig. 1. Overview of a typical HMM-based speech synthesis system.
method that allows the target cost to effectively be precalculated.
Units of the same type are clustered into a decision tree that asks
questions about features available at synthesis time (e.g. phonetic thesized speech. There seems to be three factors which degrade the
and prosody context). quality: vocoder, modeling accuracy, and over-smoothing.
All of these techniques depend on a acoustic distance measure The synthesized speech by the HMM-based generation synthesis
which should be correlated with human perception. approach sounds buzzy since it is based on the vocoding technique.
These apparently unit selection speciſc issues are mentioned To alleviate this problem, a high quality vocoder such as multi-band
here because they have speciſc counterparts in statistical paramet- excitation scheme [18–21] or STRAIGHT [8] have been integrated.
ric synthesis. Several groups have recently applied LSP-type parameters instead
of mel-cepstral coefficients to the HMM-based generation synthesis
approach [22, 23].
3. STATISTICAL PARAMETRIC SYNTHESIS
The basic system uses ML-estimated HMMs as its acoustic
models. Because this system generates speech parameters from
3.1. Overview of a typical system
its acoustic models, model accuracy highly affects the quality of
Figure 1 is a block diagram of a typical HMM-based speech synthe- synthesized speech. To improve its modeling accuracy, a number
sis system [9]. It consists of training and synthesis parts. of advanced acoustic models and training frameworks such as hid-
The training part is similar to those used in speech recogni- den semi-Markov models (HSMMs) [24], trajectory HMMs [25],
tion systems. The main difference is that both spectrum (e.g., mel- buried Markov models [26], trended HMMs [27], stochastic Markov
cepstral coefficients [13] and their dynamic features) and excitation graphs [28], minimum generation error (MGE) criterion [29], and
(e.g., log F0 and its dynamic features) parameters are extracted from variational Bayesian approach [30] have been investigated.
a speech database and modeled by context-dependent HMMs (pho- In the basic system, the speech parameter generation algorithm
netic, linguistic, and prosodic contexts are taken into account). To (typically case 1 described by Tokuda et al. [16]) is used to gener-
model log F0 sequence which includes unvoiced regions properly, ate spectral and excitation parameters from HMMs. By taking ac-
multi-space probability distributions [14] are used for the state out- count of constraints between the static and dynamic features, it can
put stream for log F0 . Each HMM has state duration densities to generate smooth speech parameter trajectories. However, the gen-
model the temporal structure of speech [15]. As a result, the system erated spectral and excitation parameters are often over-smoothed.
models spectrum, excitation, and durations in a uniſed framework. Synthesized speech using over-smoothed spectral parameters sounds
The synthesis part does the inverse operation of speech recogni- muffled. To reduce this effect and enhance the speech quality, post-
tion. First, an arbitrarily given text corresponding an utterance to be ſltering [18, 22], a conditional speech parameter generation algo-
synthesized is converted to a context-dependent label sequence and rithm [31], or a speech parameter generation algorithm considering
then the utterance HMM is constructed by concatenating the context- global variance [32] have been used.
dependent HMMs according to the label sequence. Secondly, state Advantages of the HMM-based generation synthesis approach
durations of the HMM are determined based on the state duration are
probability density functions. Thirdly, the speech parameter gener-
1) its voice characteristics can be easily modiſed,
ation algorithm (typically, case 1 in [16]) generates the sequence of
mel-cepstral coefficients and log F0 values that maximize their out- 2) it can be applied to various languages with little modiſcation,
put probabilities. Finally, a speech waveform is synthesized directly
3) a variety of speaking styles or emotional speech can be syn-
from the generated mel-cepstral coefficients and F0 values using the
thesized using the small amount of speech data,
MLSA ſlter [17] with binary pulse or noise excitation.
4) techniques developed in ASR can be easily applied,
3.2. Advantages and disadvantages 5) its footprint is relatively small.
The biggest disadvantage of the HMM-based generation synthesis The voice characteristics in 1) can be changed by transform-
approach against the unit selection approach is the quality of syn- ing HMM parameters appropriately because the system generates
IV 1230
speech waveforms from the HMMs themselves. For example, ei- likelihoods are used as “costs” for unit selection [59, 60]. Among of
ther a speaker adaptation [33, 34], a speaker interpolation [35], or these approaches, [57] and [60] use frame-sized units, and [61] use
an eigenvoice technique [36] was applied to this system, and it was generated longer trajectories to provide “costs” for unit selection.
shown that the system could modify voice characteristics. Multilin- Another type of hybrid approaches uses statistical models as a prob-
gual support in 2) can be easily realized because in this system only abilistic smoother for unit selection [62, 63]. Unifying unit selection
contextual factors are dependent on each language. Japanese [9], and HMM-based generation synthesis is also investigated [64].
Mandarin [37, 38], Korean [39], English [40], German [41], Por- In the future, we may converge at an optimal form of corpus-
tuguese [42, 43], Swedish [44], Finnish [45, 46], Slovenian [47], based speech synthesis fusing generation and selection approaches.
Croatian [48], Arabic [19], Farsi [49], and Polyglot [50] systems
have already been developed by various groups. Speaking styles 5. CONCLUSION
and emotional voices in 3) can be constructed by re-estimating ex-
isting average voice models with only a few utterances using adap- We can see that statistical parametric speech synthesis offers a wide
tation techniques [51–53]. As for 4), we can employ a number of range of techniques to improve spoken output. Its more complex
useful technologies developed for the HMM-based speech recogni- models, when compared to standard unit selection, allow for general
tion. For example, structured precision matrix models, which can solutions, without necessarily requiring recording speech in all pho-
approximate full covariance models well using the small number of netic and prosodic contexts. The pure unit selection view requires
parameters, have successfully been applied to the system [23]. Small very large databases to cover examples of all desired prosodic, pho-
footprints in 5) can be realized by storing statistics of HMMs rather netic and stylistic variation. In contrast statistical parametric synthe-
than multi-templates of speech units. For example, footprints of the sis allows for models to be combined and adapted thus not requiring
Nitech’s Blizzard Challenge 2005 voices were less than 2 MBytes instances of all possible combinations of contexts.
with no compression [54] .
6. ACKNOWLEDGMENTS
4. RELATION AND HYBRID APPROACHES
This work was partly supported by the MEXT e-Society project.
4.1. Relation between two approaches This work was also partly supported by the US National Science
Foundation under grant number 0415021 ”SPICE: Speech Process-
Some of clustering-based unit selection approaches uses HMM-
ing Interactive Creation and Evaluation Toolkit for new Languages.”
based state clustering [11]. In this case, the structure is very sim-
Any opinions, ſndings, and conclusions or recommendations ex-
ilar to that of the HMM-based generation synthesis approach. The
pressed in this material are those of the authors and do not neces-
essential difference between the clustering-based unit-selection ap-
sarily reƀect the views of the National Science Foundation.
proach and the HMM-based generation synthesis approach is that
each cluster in the generation approach is represented by statistics of
7. REFERENCES
the cluster instead of multi-templates of speech units.
In the HMM-based generation synthesis approach, distributions [1] J. Olive, A. Greenwood, and J. Coleman, Acoustics of American En-
for spectrum, F0 , and duration are clustered independently. Accord- glish Speech: A Dynamic Approach, Springer Verlag, 1993.
ingly, it has different decision trees for each of spectrum, F0 , and [2] A. Hunt and A. Black, “Unit selection in a concatenative speech syn-
duration. On the other hand, unit selection systems often use regres- thesis system using a large speech database,” in ICASSP, 1996, pp.
373–376.
sion trees (or CART) for prosody prediction. The decision trees for
F0 and duration in the HMM-based generation synthesis approach [3] Y. Sagisaka, N. Kaiki, N. Iwahashi, and K. Mimura, “ATR ν-TALK
speech synthesis system,” in ICSLP, 1992, pp. 483–486.
are essentially equivalent to the regression trees in the unit selection
[4] A. Black and K. Lenzo, “Limited domain synthesis,” in ICSLP, 2000,
systems. However, in the unit selection systems, leaves of one of pp. 411–414.
trees must have speech waveforms: other trees are used to calculate
[5] E. Eide, A. Aaron, R. Bakis, W. Hamza, M. Picheny, and J. Pitrelli,
target costs, to prune waveform candidates, or to give features for “A corpus-based approach to <AHEM/> expressive speech synthesis
constructing the trees for speech waveforms. authors,” in ISCA SSW5, 2004.
It is noted that in the HMM-based generation synthesis ap- [6] C. Bennett, “Large scale evaluation of corpus-based synthesizers: Re-
proach, likelihoods of static feature parameters and dynamic feature sults and lessons from the Blizzard Challenge 2005,” in Interspeech,
parameters corresponds to the target costs and concatenation costs, 2005, pp. 105–108.
respectively. It is easy to understand, if we approximate each state [7] C. Bennett and A. Black, “Blizzard Challenge 2006,” in Blizzard Chal-
output distribution by a discrete distribution or instances of frame lenge Workshop, 2006.
samples in the cluster: when the dynamic feature is calculated as the [8] H. Kawahara, I. Masuda-Katsuse, and A. Cheveigne, “Restructuring
speech representations using a pitch-adaptive time-frequency smooth-
difference between neighboring static features, the ML-based gen- ing and an instantaneous-frequency-based f0 extraction: possible role
eration results in a frame-wise DP search like unit selection. Thus of a repetitive structure in sounds,” Speech Communication, vol. 27,
HMM-based parameter generation can be viewed as an analogue pp. 187–207, 1999.
version of unit selection. [9] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Simultaneous modeling of spectrum, pitch and duration in HMM-
based speech synthesis,” in Eurospeech, 1999, pp. 2347–2350.
4.2. Hybrid approaches [10] K. Tokuda, H. Zen, J. Yamagishi, T. Masuko, S. Sako, A.W. Black,
and T. Nose, “The HMM-based speech synthesis system (HTS),”
As a natural consequence of the above viewpoints, there are also http://hts.ics.nitech.ac.jp/.
hybrid approaches. [11] R. Donovan and P. Woodland, “Improvements in an HMM-based
Some of these approaches use spectrum parameters, F0 values, speech synthesiser,” in Eurospeech, 1995, pp. 573–576.
and durations (or a part of them) generated from HMM to calcu- [12] A. Black and P. Taylor, “Automatically clustering similar units for unit
late acoustic target costs for unit selection [55–58]. Similarly, HMM selection in speech synthesis,” in Eurospeech, 1997, pp. 601–604.
IV 1231
[13] T. Fukada, K. Tokuda, Kobayashi T., and S. Imai, “An adaptive al- [38] Y. Qian, F. Soong, Y. Chen, and M. Chu, “An HMM-based Mandarin
gorithm for mel-cepstral analysis of speech,” in ICASSP, 1992, pp. Chinese text-to-speech system,” in ISCSLP, 2006.
137–140.
[39] S.-J. Kim, J.-J. Kim, and M.-S. Hahn, “Implementation and evaluation
[14] K. Tokuda, T. Masuko, N. Miyazaki, and T. Kobayashi, “Multi-space of an HMM-based Korean speech synthesis system,” IEICE Trans. Inf.
probability distribution HMM,” IEICE Trans. Inf. & Syst., vol. E85-D, & Syst., vol. E89-D, pp. 1116–1119, 2006.
no. 3, pp. 455–464, 2002.
[40] K. Tokuda, H. Zen, and A. Black, “An HMM-based speech synthesis
[15] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, system applied to English,” in IEEE Speech Synthesis Workshop, 2002.
“Duration modeling for HMM-based speech synthesis,” in ICSLP,
1998, pp. 29–32. [41] C. Weiss, R. Maia, K. Tokuda, and W. Hess, “Low resource HMM-
based speech synthesis applied to German,” in ESSP, 2005.
[16] K. Tokuda, T. Yoshimura, T. Masuko, T. Kobayashi, and T. Kitamura,
“Speech parameter generation algorithms for HMM-based speech syn- [42] R. Maia, H. Zen, K. Tokuda, T. Kitamura, and F. Resende Jr., “Towards
thesis,” in ICASSP, 2000, pp. 1315–1318. the development of a Brazilian Portuguese text-to-speech system based
on HMM,” in Eurospeech, 2003, pp. 2465–2468.
[17] S. Imai, “Cepstral analysis synthesis on the mel frequency scale,” in
ICASSP 83, 1983, pp. 93–96. [43] M. Barros, R. Maia, K. Tokuda, D. Freitas, and F. Resende Jr., “HMM-
based European Portuguese speech synthesis,” in Interspeech, 2005,
[18] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, pp. 2581–2584.
“Mixed excitation for HMM-based speech synthesis,” in Eurospeech,
2001, pp. 2263–2266. [44] A. Lundgren, An HMM-based text-to-speech system applied to
Swedish, Master thesis, Royal Institute of Technology (KTH), 2005.
[19] O. Abdel-Hamid, S. Abdou, and M. Rashwan, “Improving Arabic
HMM based speech synthesis quality,” in Interspeech, 2006, pp. 1332– [45] T. Ojala, Auditory quality evaluation of present Finnish text-to-speech
1335. systems, Master thesis, Helsinki University of Technology, 2006.
[20] C. Hemptinne, Integration of the harmonic plus noise model into the [46] M. Vainio, A. Suni, and P. Sirjola, “Developing a Finnish concept-to-
hidden Markov model-based speech synthesis system, Master thesis, speech system,” in 2nd Baltic conference on HLT, 2005, pp. 201–206.
IDIAP Research Institute, 2006. [47] B. Vesnicer and F. Mihelic, “Evaluation of the Slovenian HMM-based
[21] S.-J. Kim and M.-S. Hahn, “Two-band excitation for HMM-based speech synthesis system,” in TSD, 2004, pp. 513–520.
speech synthesis,” IEICE Trans. Inf. & Syst., vol. E90-D, no. 1, pp. [48] S. Martincic-Ipsic and I. Ipsic, “Croatian HMM-based speech synthe-
378–381, 2007. sis,” Journal of Computing and Information Technology, vol. 14, no. 4,
[22] Z.-H. Ling, Y.-J. Wu, Y.-P. Wang, L. Qin, and R.-H. Wang, “USTC pp. 307–313, 2006.
system for Blizzard Challenge 2006 an improved HMM-based speech [49] M. Homayounpour and S. Mehdi, “Farsi speech synthesis using hid-
synthesis method,” in Blizzard Challenge Workshop, 2006. den Markov model and decision trees,” The CSI Journal on Computer
[23] H. Zen, T. Toda, and K. Tokuda, “The Nitech-NAIST HMM-based Science and Engineering, vol. 2, no. 1&3 (a), 2004.
speech synthesis system for the Blizzard Challenge 2006,” in Blizzard [50] J. Latorre, K. Iwano, and S. Furui, “Polyglot synthesis using a mixture
Challenge Workshop, 2006. of monolingual corpora,” in ICASSP, 2005, vol. 1, pp. 1–4.
[24] H. Zen, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura, “Hid- [51] J. Yamagishi, K. Onishi, T. Masuko, and T. Kobayashi, “Modeling of
den semi-Markov model based speech synthesis,” in Interspeech, 2004, various speaking styles and emotions for HMM-based speech synthe-
pp. 1185–1180. sis,” in Interspeech, 2003, pp. 2461–2464.
[25] H. Zen, K. Tokuda, and T. Kitamura, “An introduction of trajectory [52] J. Yamagishi, Average-Voice-Based Speech Synthesis, Ph.D. thesis,
model into HMM-based speech synthesis,” in ISCA SSW5, 2004. Tokyo Institute of Technology, 2006.
[26] I. Bulyko, M. Ostendorf, and J. Bilmes, “Robust splicing costs and [53] M. Tachibana, J. Yamagishi, T. Masuko, and T. Kobayashi, “A style
efficient search with BMM models for concatenative speech synthesis,” adaptation technique for speech synthesis using HSMM and supraseg-
in ICASSP, 2002, pp. 461–464. mental features,” IEICE Trans. Inf. & Syst., vol. E89-D, no. 3, pp.
[27] J. Dines and S. Sridharan, “Trainable speech synthesis with trended 1092–1099, 2006.
hidden Markov models,” in ICASSP, 2001, pp. 833–837. [54] H. Zen, T. Toda, M. Nakamura, and T. Tokuda, “Details of the Nitech
[28] M. Eichner, M. Wolff, S. Ohnewald, and R. Hoffman, “Speech synthe- HMM-based speech synthesis system for the Blizzard Challenge 2005,”
sis using stochastic Markov graphs,” in ICASSP, 2001, pp. 829–832. IEICE Trans. Inf. & Syst., vol. E90-D, no. 1, pp. 325–333, 2007.
[29] Y.-J. Wu and R.-H. Wang, “Minimum generation error training for [55] H. Kawai, T. Toda, J. Ni, M. Tsuzaki, and K. Tokuda, “XIMERA:
HMM-based speech synthesis,” in ICASSP, 2006, pp. 89–92. A new TTS from ATR based on corpus-based technologies,” in ISCA
[30] Y. Nankaku, H. Zen, K. Tokuda, T. Kitamura, and T. Masuko, “A SSW5, 2004.
Bayesian approach to HMM-based speech synthesis,” in Tech. rep. of [56] S. Rouibia and Rosec, “Unit selection for speech synthesis based on a
IEICE, 2003, vol. 103, pp. 19–24. new acoustic target cost,” in Interspeech, 2005, pp. 2565–2568.
[31] T. Masuko, K. Tokuda, and T. Kobayashi, “A study on conditional [57] T. Hirai and S. Tenpaku, “Using 5 ms segments in concatenative speech
parameter generation from HMM based on maximum likelihood crite- synthesis,” in ISCA SSW5, 2004.
rion,” in Autumn Meeting of ASJ, 2003, pp. 209–210.
[58] J.-H. Yang, Z.-W. Zhao, Y. Jiang, G.-P. Hu, and X.-R. Wu, “Multi-
[32] T. Toda and K. Tokuda, “Speech parameter generation algorithm con- tier non-uniform unit selection for corpus-based speech synthesis,” in
sidering global variance for HMM-based speech synthesis,” in Eu- Blizzard Challenge Workshop, 2006.
rospeech, 2001, pp. 2801–2804.
[59] N. Mizutani, K. Tokuda, and T. Kitamura, “Concatenative speech syn-
[33] T. Masuko, K. Tokuda, T. Kobayashi, and S. Imai, “Voice characteris- thesis based on HMM,” in Autumn meeting of ASJ, 2002, pp. 241–242.
tics conversion for HMM-based speech synthesis system,” in ICASSP,
1997, pp. 1611–1614. [60] Z. Ling and R. Wang, “HMM-based unit selection using frame sized
speech segments,” in Interspeech, 2006, pp. 2034–2037.
[34] M. Tamura, T. Masuko, K. Tokuda, and T. Kobayashi, “Adaptation of
pitch and spectrum for HMM-based speech synthesis using MLLR,” in [61] J. Kominek and A. Black, “The Blizzard Challenge 2006 CMU entry
ICASSP, 2001, pp. 805–808. introducing hybrid trajectory-selection synthesis,” in Blizzard Chal-
lenge Workshop, 2006.
[35] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura,
“Speaker interpolation in HMM-based speech synthesis system,” in [62] M. Plumpe, A. Acero, H. Hon, and X. Huang, “HMM-based smoothing
Eurospeech, 1997, pp. 2523–2526. for concatenative speech synthesis,” in ICSLP, 1998, pp. 2751–2754.
[36] K. Shichiri, A. Sawabe, K. Tokuda, T. Masuko, T. Kobayashi, and [63] J. Wouters and M. Macon, “Unit fusion for concatenative speech syn-
T. Kitamura, “Eigenvoices for HMM-based speech synthesis,” in IC- thesis,” in ICSLP, 2000, pp. 302–305.
SLP, 2002, pp. 1269–1272. [64] P. Taylor, “Unifying unit selection and hidden Markov model speech
[37] H. Zen, J. Lu, J. Ni, K. Tokuda, and H. Kawai, “HMM-based prosody synthesis,” in Interspeech, 2006, pp. 1758–1761.
modeling and synthesis for Japanese and Chinese speech synthesis,”
Tech. Rep. TR-SLT-0032, ATR-SLT, 2003.
IV 1232