DONG 2010 Acta Automatica Sinica
DONG 2010 Acta Automatica Sinica
DONG 2010 Acta Automatica Sinica
ZHOU Tao1
DONG Cheng-Yu2
WANG Hai-La2
Abstract Prosodic structure generation is the key component in improving the intelligibility and naturalness of synthetic speech for
a text-to-speech (TTS) system. This paper investigates the problem of automatic segmentation of prosodic word and prosodic phrase,
which are two fundamental layers in the hierarchical prosodic structure of Mandarin, and presents a two-stage prosodic structure
generation strategy. Conditional random elds (CRF) models are built for both prosodic word and prosodic phrase prediction at
the front end with dierent feature selections. Besides, a transformation-based error-driven learning (TBL) modication module is
introduced in the back end to amend the initial prediction. Experiment results show that the approach combining CRF and TBL
achieves an F-score of 94.66 %.
Key words Text-to-speech (TTS), prosodic structure generation, conditional random elds (CRF), transformation-based errordriven learning (TBL)
1570
Prosodic structure
Vol. 36
Fig. 1
2.1
No. 11
Feature selection
Types of features
Feature
Meaning
LastType
Word
WLen
POS
Word&&POS
WLen&&POS
Besides, two more types, DistPre and DistNxt, are specially designed for the prosodic phrase model. To some
extent, insertion of prosodic phrase boundaries in natural
spoken language is to balance the length of the constituents
in the output. Hence, it is not surprising that most PP
breaks occur in the middle part of a long sentence, and a
prosodic phrase is usually 5 7 syllables long, but rarely
shorter than 3 or longer than 9 syllables. For this reason,
we took into consideration length measures by including
DistPre and DistNxt in our templates for prosodic phrase
prediction, which means the distance from current boundary to the last and next nearest PP location.
For some type, we make an extension in the range from
previous two words to the following two words. Take feature POS for example, ve features are designed, as shown
1571
in Table 2.
What is more, the certain combination of the two types
can hold some special information for prosodic boundaries. Some features are generated in this way as shown
in Table 3.
Table 2
Feature
Meaning
POS2
POS1
POS
POS+1
POS+2
Table 3
Feature
Meaning
Word2POS2
Word1POS1
Word0POS0
Word+1POS+1
Transformation-based error-driven
learning modication module
1572
corpus is rst processed by the corpus preprocessor component, which extracts the features used for prosodic
structure prediction. The second component is the TBL
trainer component, which learns a set of rules by the
transformation-based error-driven learning algorithm. This
component also needs the prepared rule template as the input. The raw rule template is stored in a le and parsed
by the template parser component.
Testing process applies the rules generated by TBL
trainer for prosodic structure prediction. The features
used for prosodic structure prediction are word, part-ofspeech, character, prosodic word boundary, and prosodic
phrase boundary. In real application, the information can
be gotten from word segmenter, part-of-speech tagger, and
pinyin-lookup module. In our package, the testing corpus
is extracted from the annotated corpus and rst processed
by the corpus preprocessor component, which extracts the
features used for prosodic structure prediction from the annotated corpus. Then, we apply all the rules learned from
the TBL training system to prosodic structure prediction.
3.3
Fig. 2
Framework of TBL
This paper applies TBL to the prosodic structure prediction of Mandarin TTS system.
According to the counting result of the performance with
CRF model, this paper presents 20 Chinese characters to
TBL learning for further disambiguation. These 20 Chinese characters occur with most mistakes frequently in the
test corpus, such as , , and .
For each character, we prepare two sets of corpus, one
for training and another for testing. The two corpora are
extracted from the annotated People s Daily Corpus 2000,
which was annotated manually by the Institute of Computational Linguistics.
The whole TBL modication module contains two processes: the training process and the testing process.
The training process accepts the annotated corpus. The
Table 5
Vol. 36
Type
Type
R0 (0.921 775)
T0
R9 (0.910 721)
T4+T0
R1 (0.869 216)
T1
T5+T0
R2 (0.865 277)
T2
T1,T0
R3 (0.868 538)
T3
T2,T0
R4 (0.862 322)
T4
T3,T0
R5 (0.859 158)
T5
T4,T0
R6 (0.908 081)
T1+T0
T5,T0
R7 (0.910 238)
T2+T0
T1+T0,T0
R8 (0.907 211)
T3+T0
T2+T0,T0
Atom templates
Atom
Type
T0
POS itself
Contents
T1
Char
Char n (n = 3, 2, 1, 1, 2, 3)
Char n&Char n 1
T2
Word
Word n
Word n&Word n 1
POS 0
T3
Prosodic word
PW n
PW n&PW n 1
T4
Prosodic phrase
PP n
PP n&PP n 1
PP n&PW n 1&PW n 2
T5
Context POS
POS n
POS n&POS n 1
No. 11
3.4
4.3
Experiments
2P R
P +R
Experiments results
POS(Y, 0)&LW(Y, 1) : A B
where Y indicates the feature value. 1 indicates the
relative position to the character (we constrain the oset
of each feature within the range of (3, 3)), A and B
indicate the initial prosodic boundary and the transformed
prosodic boundary, respectively, : is used to separate the
condition and prosodic boundary tag, is used to separate the current tag (left) and correct tag (right), & is
used to separate each sub-condition. Each sub-condition
comprises two parts: function name and its parameter
list. All the items are composed of ve features: POS,
LC, LW, PW, and PP. Through TBL learning,
we could generate the corresponding rule from the template whose items just match the context of the prosodic
structure boundary.
1573
Experiment performances
Precision (%)
Recall (%)
ME (PW)
87.95
86.16
F-score (%)
87.05
CRF (PW)
90.21
92.94
90.67
CRF&TBL (PW)
93.77
95.56
94.66
ME (PP)
71.77
77.24
74.40
CRF (PP)
78.61
81.55
80.05
CRF&TBL (PP)
83.91
85.33
84.61
The ME-based method and CRF-based method are using the same training and testing corpus.
As shown in Table 6, compared with the ME-based
method, the two-stage prosodic structure prediction strategy of CRF & TBL, respectively, improves the F-score by
7.61%.
It shows that the two-stage strategy is much more effective than the ME-based method for prosodic boundaries prediction. Moreover, compared with the simple CRF
based method, the method that combines CRF and TBL
works well.
Still, some inexible errors occur in our test of prosody
boundary prediction. Such as the following sentence
, the result of the two-stage strategy
is: /vq/PW /n/PP /vi/LW /vi/LW,
and the standard answer is: /vq/PW /n/PW
/vi/LW /vi/LW.
The strategy gives the PP prediction after the word
, as well as the real answer should be PW
boundary. Two reasons can be brought to explain such
case. For one reason, the length of the word is
three and is a verb. The strategy may probably
treat it as a V-O construction, in order to make the PP
boundary prediction. For another reason, the word
is quite new to the corpus of People s Daily 2000. This
word rarely or even not appear in the corpus, so that the
strategy is not able to get enough information neither in
CRF model training nor in TBL rule learning. Finally, the
strategy get the wrong prosody boundary prediction.
According to the two reasons above, the next work will
be making certain rule to x the error boundary and updating the corpus timely.
Due to dierence in the corpora and evaluation metric,
these results may not be comparable in all respects. Yet
from the statistics above, the approach presented in this
paper is a successful attempt towards prosodic boundary
prediction.
Conclusion
This paper makes an extensive investigation on Mandarin prosodic structure prediction of TTS system. Based
on the analyzing of a large-scale corpus, a two-stage
prosodic structure prediction strategy is proposed: conditional random elds model for the rst prediction and
transformation-based error-driven learning modication
module for the back-end of the system to amend the errors from CRF prediction.
Experiments results indicate that this approach is able to
achieve a good performance with high precision, recall, and
F-score. Moreover, the approach of the two-stage predic-
1574
Vol. 36
DONG Yuan
Received his Ph. D. degree in telecommunication from Shanghai
Jiao Tong University in 1999. From 1999
to 2001, he was with Nokia Research Center as a research and development scientist,
working on voice recognition on Nokia mobile phone. From 2001 to 2003, he worked
as a post doctoral research sta at the
Engineering Department, Cambridge University, UK, working on European speech
recognition project CORETEX. Since
2003, he has worked as an associate professor at Beijing University of Posts and Telecommunications. He is also a senior
research consultant at the France Telecom R&D Beijing. His
research interest covers speaker recognition, speech synthesis,
speech recognition, and multimedia content indexing.
E-mail: [email protected]
ZHOU Tao Received his master degree
in signals and information processing from
Beijing University of Posts and Telecommunications in 2010. His research interest covers text-to-speech synthesis and natural language processing. Corresponding
author of this paper.
E-mail: [email protected]
WANG Hai-La Received his Ph. D. degree from University of Paris, France in
1986. Then, he worked at France Telecom
R&D for six years. From 1998 to 2004,
he was with several France telecom departments. Since 2004, he has been with France
Telecom R&D Beijing as a chief technology
ocer. His research interest covers valueadded multimedia service.
E-mail: [email protected]