0% found this document useful (0 votes)

28 views11 pages

Zero Shot

Uploaded by

allfanacc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views11 pages

Zero Shot

Uploaded by

allfanacc

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with

Unsupervised Text Pretraining

Takaaki Saeki1 , Soumi Maiti2 , Xinjian Li2 , Shinji Watanabe2 ,
Shinnosuke Takamichi1 and Hiroshi Saruwatari1
1
The University of Tokyo, Japan
2
Carnegie Mellon University, USA
takaaki [email protected], {smaiti, swatanab}@andrew.cmu.edu
arXiv:2301.12596v3 [eess.AS] 27 May 2023

Abstract Synthesis

While neural text-to-speech (TTS) has achieved

human-like natural synthetic speech, multilingual Multilingual TTS
TTS systems are limited to resource-rich languages
due to the need for paired text and studio-quality Paired data for Paired data for
Text data for
language C
audio data. This paper proposes a method for zero- language A language B

shot multilingual TTS using text-only data for the Training

target language. The use of text-only data allows
the development of TTS systems for low-resource Figure 1: Our concept. We aim to build TTS model on languages for
languages for which only textual resources are which only text data is available, to support low-resource languages.
available, making TTS accessible to thousands of
languages. Inspired by the strong cross-lingual
transferability of multilingual language models, our face the challenge of data collection, when audio recordings
framework first performs masked language model for these languages are hard to obtain. In this study, we fo-
pretraining with multilingual text-only data. Then cus on the use of a text-only data for multilingual TTS as
we train this model with a paired data in a super- shown in Fig. 1. Previous research [Wu and Dredze, 2019;
vised manner, while freezing a language-aware em- Pires et al., 2019] has shown the strong cross-lingual trans-
bedding layer. This allows inference even for lan- ferability of multilingual language models such as multilin-
guages not included in the paired data but present in gual BERT [Devlin et al., 2019] in natural language pro-
the text-only data. Evaluation results demonstrate cessing (NLP) tasks. By leveraging multilingual pretrain-
highly intelligible zero-shot TTS with a character ing, the model can generalize to other languages, even if
error rate of less than 12% for an unseen language. it has never seen the target data in those languages. Our
work applies the framework of multilingual masked language
model (MLM) pretraining to TTS, with the goal of achieving
1 Introduction zero-shot cross-lingual transfer of pronunciation and prosody.
Recent advances in end-to-end neural text-to-speech synthe- Zero-shot TTS using text data enables the development of
sis (TTS) [Li et al., 2019b; Kim et al., 2021] have yielded TTS systems for languages where only textual resources are
significant improvements in naturalness and speech qual- available, which potentially opens up TTS to thousands of
ity. However, the data-intensive nature and the requirement languages [Ebrahimi and Kann, 2021; Li et al., 2022].
of paired text and studio-quality audio data have limited In this paper, we propose a multilingual TTS framework
multilingual TTS systems to resource-rich languages, which that leverages unsupervised text pretraining. Fig. 2 illus-
are small portions of the more than 6,000 languages in the trates the proposed framework. We use a typical end-to-end
world [Jr, 2005]. To address the limitation, current research in TTS architecture consisting of token embedding, encoder,
multilingual TTS aims not only to exploit resource-rich lan- and decoder. Our model also has a language-aware embed-
guages [Zen et al., 2012; Li and Zen, 2016] but also to build ding layer, which includes the token embedding layer, a lan-
models for low-resource languages [Prakash et al., 2019]. guage embedding layer, and a bottleneck layer. As shown
Previous work has addressed low-resource TTS by using in Fig. 2(a), we first pretrain the language-aware embedding
untranscribed speech data with vector-quantized variational layer and the encoder of the TTS model with multilingual text
autoencoder (VQ-VAE) [Zhang and Lin, 2020] or automatic data. We then fine-tune the encoder and decoder of the TTS
speech recognition (ASR) models [Ni et al., 2022]. Another model with paired data, while the language-aware embedding
study [Saeki et al., 2022b] has built a massively multilin- layer is frozen, as illustrated in Fig. 2(b). This allows zero-
gual TTS model jointly using paired TTS, paired ASR, un- shot TTS for a language not included in the paired data but
paired speech, and unpaired text data. However, these ap- present in the text data, as shown on the right in Fig. 2(c).
proaches still rely on speech data for the target languages and Our contributions are as follows. 1) We propose a zero-shot
We define Dtext as the text dataset. Let Ltext denote the set
of language IDs included in Dtext . First, the masked token
Decoder
sequence X m and a language ID ltext ∈ Ltext are fed to the
model. Let the token embedding sequence and language em-
Encoder Encoder bedding be Z m = (z m d
n ∈ R |n = 1, · · · , N ) and el ∈ R ,
d

es de respectively. The embedding layers output Z m and el as:

Language-aware ru hu Language-aware de ru
embedding layer en sk embedding layer hu ...
ro ... Freeze Z m = Embed(X m ; θT ), el = Embed(ltext ; θL ), (1)
<Mask>
Bytes or IPA Text Bytes or IPA Text Text Text where θT and θL denote the model parameters of the to-
ken embedding and language embedding layers, respectively.
Then the token and language embeddings obtained in Eq. (1)
(a) Unsupervised multilingual text pretraining (b) Supervised learning with paired data
are added and fed to a bottleneck layer to project them into a
hidden input vector. Let Hin = (hin,n ∈ Rd |n = 1, · · · , N )
and Hout = (hout,n ∈ Rd |n = 1, · · · , N ) denote hidden
Decoder Decoder vectors in the encoder input and output, respectively. Then
the conditional probability p(X|X−Π ) is computed as:
Encoder Encoder

Language-aware de ru
Hin = Bottleneck(Z m + el ; θB ), (2)
Language-aware es
embedding layer hu ... embedding layer Hout = Encoder(Hin ; θE ), (3)
Bytes or IPA Bytes or IPA p(X|X−Π ) = Softmax(PredictionNet(Hout ; θP )), (4)
Multilingual TTS for seen languages Zero-shot TTS for unseen language
where θB , θE , θP denote the model parameters of the bot-
(c) Inference
tleneck layer, the encoder and a prediction network, respec-
tively. In Eq. (4), Softmax(·) denotes a softmax function. We
Figure 2: Proposed framework. (a) We perform MLM pretraining on
multilingual text data and then (b) train TTS model on paired data
define the network with the model parameters {θB , θT , θL }
with frozen language-aware embedding layer. (c) Zero-shot TTS is as language-aware embedding layer, which jointly embeds
performed with language IDs that are not included in paired data. the token sequence X and the language ID ltext as in Eq. (1)
and (2). Let Π = (πk ∈ N|k = 1, · · · , K) be the indexes
of the masked tokens of length K. With the probability com-
multilingual TTS framework that achieves highly intelligible puted in Eq. (4), the training objective can be defined as:
TTS for an unseen language, resulting in a character error rate
K
of less than 12%. 2) Our method also improves TTS for seen 1 X
languages, resulting in byte-based models without grapheme- Lmlm = log p(xπk |X m ),
K
to-phoneme (G2P) modules that outperform the phoneme- k=1 (5)
based baselines. 3) Our ablation studies provide additional {θ̂E , θ̂B , θ̂T , θ̂L } = arg min Lmlm .
insights, including the effectiveness of the frozen language- θE ,θB ,θT ,θL
aware embedding layer. The experiments were conducted on
We use UTF-8 bytes or International Phonetic Alphabet
public datasets and the implementation is available1 . We en-
(IPA) symbols for the input token sequence X. For each to-
courage readers to listen to our audio samples2 .
ken type, the vocabulary V is constructed from Dtext , which
includes a start/end of sentence token ([SOS/EOS]). We ex-
2 Method tracted International IPA sequences using an open-source
Our model has a typical neural TTS model architecture con- toolkit3 . To obtain the masked token X m , we use the same
sisting of token embedding, encoder, and decoder. First, we masking ratio and category as in the original BERT pre-
use MLM pretraining with multilingual text data to learn training [Devlin et al., 2019] for each token type. Randomly,
cross-lingual representations. Then we perform supervised 12 % of the tokens are replaced with the [MASK] token, and
learning with paired data to learn the mapping from linguis- 1.5 % of them are replaced with random tokens. Also, 1.5 %
tic features to speech features. The model performs inference of the tokens are left unchanged and Lmlm is computed as in
even for languages that are not present in the paired data. Eq. (5) for those 15 % of tokens that have indices Π.

2.1 Unsupervised Multilingual Text Pretraining 2.2 Supervised Learning with Paired Data
Fig. 2(a) illustrates the unsupervised pretraining method. It Fig. 2(b) illustrates the supervised learning of the TTS model
uses multilingual text data consisting of languages that are with paired data. We define the paired data and the set of lan-
not included in the paired data. Let X = (xn ∈ V |n = guage IDs as Dpaired and Lpaired , respectively. Note that we
1, · · · , N ) denote the input text token sequence of length N , assume Lpaired ⊂ Ltext . Let Y = (y t ∈ RD |t = 1, · · · , T )
where V denotes a vocabulary constructed for pretraining. denote the speech feature sequence with the length of T .
We first initialize the model parameters {θE , θB , θT , θL } with
1
https://github.com/Takaaki-Saeki/zm-text-tts
2 3
https://takaaki-saeki.github.io/zm-tts-text demo https://github.com/espeak-ng/espeak-ng
those obtained in the pretraining described in § 2.1. Let θD Our work attempts to only use the linguistic knowledge to im-
denote the model parameter of the decoder. The speech fea- prove the zero-shot TTS. Thus, the inference process is writ-
tures are predicted with teacher forcing as: ten as L′unseen ∩ Lpaired = ∅ and L′unseen ⊂ Ltext . In the
evaluation, we denote the inference with Lunseen and L′unseen
Hout = Encoder(Bottleneck(Z + el )), (6) as Fully zero-shot TTS and Text-seen zero-shot TTS, respec-
Ŷ = Decoder(Hout , Y ; θD ), (7) tively. Fully zero-shot TTS performs zero-shot TTS without
pretraining as in the IPA-based previous method [Staib et al.,
where Z is the unmasked token embedding sequence. Note 2020], which is the baseline method in our evaluations.
that the unmasked token sequence is used in Eq. (6), while
the masked token sequence is used in Eq. (2) Let Ltts (Ŷ , Y ) 2.4 Model Architecture
denote the training objective of the TTS model. Then we Our model is an autoregressive TTS model based on Trans-
consider two types of schemes. former TTS [Li et al., 2019b], which has also been used in
Updating language-aware embedding layer We only the previous work on byte-based multilingual TTS [He et al.,
freeze the parameter of the language embedding layer θL 2021]. During the supervised learning described in § 2.2 and
while updating the rest of the parameters. Therefore the train- inference described in § 2, we use x-vector [Snyder et al.,
able model parameters can be written as 2018] for the speaker embedding and add it to the encoder
output through a projection layer. During supervised learn-
{θ̂D , θ̂E , θ̂B , θ̂T } = arg min Ltts (Ŷ , Y ). (8) ing, we use the average x-vectors computed from the training
θD ,θE ,θB ,θT data. For evaluation purposes, we perform zero-shot synthe-
Previous work has confirmed that multilingual BERT has sis with the average x-vector from the test data of the target
high cross-lingual transferability for various NLP tasks [Wu language and feed it to the model. Note that we also conduct
and Dredze, 2019]. This scheme corresponds to a simple fine- the evaluation with x-vectors from seen languages.
tuning of BERT [Wu and Dredze, 2019], which updates all For the bottleneck layer with θB , we use a residual network
the parameters during training for the downstream tasks4 . consisting of Layer Normalization [Ba et al., 2016], down
projection, ReLU [Nair and Hinton, 2010], and up projection
Freezing language-aware embedding layer We freeze the with the residual connection, which is used in previous work
bottleneck layer and the token embedding layer along with on language adaptation [Bapna and Firat, 2019].
the language embedding, updating the encoder and decoder.
The training process can be written as
3 Experimental Evaluations
{θ̂D , θ̂E } = arg min Ltts (Ŷ , Y ). (9) 3.1 Experimental Setting
θD ,θE
Dataset
In contrast to the scheme represented in Eq. (8), the scheme in
Eq. (9) preserves the parameters of the language-aware em- We carried out all the evaluations with publicly available
bedding layer to facilitate cross-lingual transfer. In the evalu- datasets. Table 1 shows the sizes of the data for each lan-
ation, we use the scheme formulated in Eq. (9), except for the guage. For the unsupervised text pretraining described in
ablation study in § 3.4. § 2.1, we used transcripts from VoxPopuli [Wang et al.,
2021], M-AILABS [Munich Artificial Intelligence Labora-
2.3 Inference tories GmbH, 2017], and CSS10 [Park and Mulc, 2019], re-
sulting in a total of about 2.8 GB of spoken text across 19
Let Lsyn denote the set of language IDs used for inference.
languages. We used CSS10 for the supervised learning de-
The text token sequence X and the language ID lsyn ∈ Lsyn
scribed in § 2.2, and we selected seven European languages
are fed to the model as in Eq. (1), and the encoder output is
as the seen languages, with Spanish as the unseen language.
predicted as in Eq. (6). Unlike Eq. (7), the speech features are
The paired data consisted of one speaker per language. It
predicted as:
should be noted that Spanish is not actually a low-resource
Ŷ = Decoder(Hout ; θD ). (10) language, but we chose to use it for evaluation purposes in
The output waveform is obtained by feeding the predicted order to 1) compare our zero-shot TTS methods with the ora-
features Ŷ to a pretrained neural vocoder. cle methods using the paired data for the target language and
Fig. 2(c) illustrates the inference process. The left and right 2) ensure a sufficient number of evaluators for the subjective
sides of the figure show the typical multilingual TTS and our evaluation. We used 5 and 100 utterances as dev and test sets,
zero-shot TTS. Previous work [Li et al., 2019a] has typically respectively, with the remaining data used for training.
assumed seen languages, and the inference is performed with
Training Details
the language IDs Lseen ⊂ Lpaired . However, it is challenging
to perform TTS for unseen languages Lunseen ∩ Lpaired = ∅. The sampling rate was set to 16 kHz. An 80-dimension
While other work [Saeki et al., 2022b] has built a massively of mel filter bank, 1024 samples of FFT length, and 256
multilingual TTS model that even achieves zero-shot TTS samples of frame shit were used for speech analysis. For
from ASR data, it uses paired data for the target languages. the pretraining described in § 2.1, we trained the model for
1.2M iterations using the Noam optimizer [Vaswani et al.,
4
We freeze the language embedding layer to address the mis- 2017] with the learning rate and warm-up step set to 1.0
match between language embedding of seen and unseen languages. and 10000, respectively. For the TTS model described in
Languages Code Text-only data
Paired data (LIDs). Multilingual w/ LIDs: We trained a multilingual
Text Audio
Transformer TTS model with the paired data of the unseen
Seen languages for evaluation Lseen language. It also used the language IDs.
German de 359MB 0.73MB 16.13h
French fr 372MB 0.94MB 19.15h Unseen language We compared Fully zero-shot TTS and
Dutch nl 336MB 0.75MB 14.10h Text-seen zero-shot TTS defined in § 2.3. In Oracle, we used
Finnish fi 308MB 0.47MB 21.36h
Hungarian hu 104MB 0.51MB 10.53h the Monolingual and Multilingual w/ LIDs, which used the
Russian ru 4.9MB 1.5MB 10.00h paired data of the unseen language. In Fully zero-shot TTS,
Greek el 0.39MB 0.39MB 4.13h we used Multilingual w/o LIDs to synthesize speech from text
Unseen language for evaluation Lunseen tokens in the unseen language. This method corresponds to
Spanish es 345MB 0.0MB (1.2MB) 0.00h (23.81h) the conventional multilingual TTS model using bytes [He et
Languages not included in CSS10 al., 2021] or IPA symbols [Staib et al., 2020].
English en 338MB
Estonian et 87MB Evaluation Metrics
Croatian hr 2.0MB To objectively measure the synthetic speech quality, we used
Italian it 334MB
Lithuanian lt 89MB mel cepstral distortion (MCD) [Fukada et al., 1992] with
Polish pl 102MB the mel cepstrum dimension set to 25. We also evaluated
Romanian ro 67MB the intelligibility using CERs computed with a multilingual
Slovak sk 94MB
Slovenian sl 81MB ASR model [Radford et al., 2022]. We used a pretrained
large model that is publicly available6 . To evaluate the nat-
Table 1: Amount of text-only and paired data for each language. uralness, we carried out listening tests to calculate five-scale
Parentheses indicate amount of original data in CSS10. mean opinion scores (MOS) of synthesized speech for each
method. Forty native speakers were recruited through Ama-
zon Mechanical Turk [Paolacci et al., 2010] for each of the
§ 2.4, we used a 6-block Transformer encoder [Vaswani et tests. Furthermore, we leveraged a publicly available auto-
al., 2017] and a 6-block Transformer decoder, with a post- matic MOS (AMOS) prediction model [Saeki et al., 2022a]
net consisting of five convolutional layers with a kernel size to evaluate the naturalness. Note that the model was trained
of five. The attention dimension and the number of atten- on English and Chinese datasets, but previous work [Seki et
tion heads were set to 512 and 8, respectively. For the bot- al., 2022] has reported that it also showed a correlation coef-
tleneck layer described in § 2.4, we set the hidden dimen- ficient higher than 0.8 for another language (Japanese).
sion after the down projection to 256. The PredictionNet in
Eq. (4) consisted of a linear layer, a GELU activation func- 3.2 Evaluation Results on Seen Languages
tion [Hendrycks and Gimpel, 2016], Layer Normalization, We evaluated our framework on the seen languages included
and a linear layer with the hidden dimension of 512. We in the paired data, as defined in § 2.3. Table 2 lists the results
also used guided attention loss [Tachibana et al., 2018] to in MCD and CER. Lower values are better for both metrics.
improve the training efficiency. For the supervised learn- As we can see, the byte-based or IPA-based models with the
ing described in § 2.2, we trained the models for 2.47M it- proposed multilingual pretraining performed the best across
erations (200 epochs). The Noam optimizer was used with all languages and metrics. Among the baselines, byte-based
the warm-up step of 50000. For the neural vocoder, we monolingual and multilingual models tended to have higher
trained HiFi-GAN [Kong et al., 2020] for 2M iterations MCD and CER than IPA-based models, and failed to synthe-
with LibriTTS [Zen et al., 2019], VCTK [Veaux et al., size intelligible speech in some languages. For example, the
2017], and CSS10. For the x-vector described in § 2.4, we baseline byte-based models showed the high CER values for
used a model trained on VoxCeleb1 and VoxCeleb2 [Na- French, which has a deep orthography, meaning that a single
grani et al., 2017] published in SpeechBrain [Ravanelli et character has different pronunciations depending on the con-
al., 2021]. We used ESPnet2-TTS [Watanabe et al., 2018; text. We observed that our method improved the byte-based
Hayashi et al., 2021] for the implementation. models and they outperformed the IPA-based baseline models
Baselines for all the metrics and languages. It is worth noting that the
proposed byte-based models even outperformed the proposed
We developed baseline models without the pretraining.
IPA-based models except for el and ru. These results suggest
Seen language Monolingual: We trained a model for each that our framework is effective in building a TTS model for
language independently. Our preliminary study found that languages without G2P modules.
Transformer TTS was unstable5 and could not synthesize in-
telligible speech in the monolingual condition due to the lack 3.3 Evaluation Results on Unseen Language
of training data. Therefore, we used Tacotron2 [Shen et al., We evaluated our method on zero-shot TTS for the unseen
2018] only for the monolingual models, as in the original pa- language defined in § 2.3. As described in § 2.4, we first
per of the dataset [Park and Mulc, 2019]. Multilingual w/o used the x-vector from the es speaker to compute the MCD.
LIDs: We trained a multilingual Transformer TTS model us- Table 3 lists the results. The baseline models showed the
ing the paired data shown in Table 1 without language IDs CERs of over 40% and MCDs of over 10.0. However, our
5 6
The original paper [Li et al., 2019b] also reports the instability. https://github.com/openai/whisper
de fr ru fi hu nl el
Method
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Natural - 2.75 - 4.52 - 2.12 - 4.73 - 4.86 - 6.22 - 7.14
Baseline (Monolingual)
Bytes monolingual 7.70 8.61 11.76 91.82 11.43 >100 8.33 56.03 10.22 93.05 7.49 15.33 10.20 85.98
IPA monolingual 7.38 4.07 8.96 17.86 11.89 25.30 7.23 27.62 7.59 24.62 7.80 19.20 8.16 21.79
Baseline (Multilingual)
Bytes multilingual w/o LIDs 7.68 37.46 8.71 41.35 9.38 45.92 6.26 29.19 6.48 33.82 8.46 46.33 7.64 36.24
Bytes multilingual w/ LIDs 6.51 13.19 10.84 55.79 12.89 >100 6.78 27.22 9.09 42.97 8.47 39.37 7.25 23.56
IPA multilingual w/o LIDs 6.31 10.64 7.44 20.86 8.10 35.32 5.53 19.56 5.59 14.03 7.76 34.49 6.90 19.33
IPA multilingual w/ LIDs 6.16 9.76 6.88 14.97 7.63 23.54 5.17 10.63 5.28 9.11 6.95 19.48 6.90 16.97
Proposed (Unsupervised text pretraining)
Bytes multilingual 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 5.01 6.05 6.52 13.74 6.57 11.75
IPA multilingual 5.88 5.52 6.61 7.72 7.25 15.85 5.18 8.62 5.30 7.37 7.00 14.42 6.53 11.06

Table 2: Evaluation results for seen languages. Bold indicates best scores in baseline and proposed methods.

es
Method es x-vector fr x-vector
MCD CER CER
Natural - 2.71 2.71
Oracle
Bytes monolingual 8.65 10.70 -
IPA monolingual 8.47 5.28 -
IPA multilingual 6.20 5.32 6.99
Baseline (Fully zero-shot TTS)
Bytes multilingual 11.22 64.07 66.45
IPA multilingual 10.75 44.75 44.37
(a) Token embedding 𝑍 (b) Encoder inputs 𝐻!"
Proposed (Text-seen zero-shot TTS)
Bytes multilingual 9.05 18.27 13.74 Figure 3: Visualization of token and language embedding. Pairs of
IPA multilingual 9.44 11.69 13.33 similar languages (es–fr and de–nl) are overlapping in token embed-
ding space, while output of bottleneck layer separates them.
Table 3: Evaluation results for unseen language.
3.4 Ablation Study
proposed text preraining improved the metrics, resulting in To further evaluate our method, we conducted several abla-
CERs of less than half for both byte and IPA-based methods. tion studies. Table 4 lists the results. Bytes multilingual rep-
Also, in contrast to the results for the seen languages, the resents the byte-based proposed method in the evaluation of
IPA-based model outperformed the byte-based one in terms § 3.2 and 3.3. Note that it used the frozen language-aware
of CER. Compared with the oracle case with the paired data embedding layer as formulated in Eq. (9). Some additional
of the unseen language, our proposed zero-shot TTS showed studies of our method are also presented in the Appendix.
higher MCD and CER but achieved only 1% difference in In W/o bottleneck layer, we excluded the bottleneck layer
CER compared to the oracle byte-based monolingual model. and simply added the token and language embedding to ob-
These results demonstrate the effectiveness of our method in tain the encoder input in Eq. (2). We found that removing
achieving intelligible zero-shot TTS for the unseen language. the bottleneck layer led to a performance drop in all the lan-
To investigate the case where the target speaker informa- guages and metrics, with an average increase of 0.53 in MCD
tion is completely unavailable, we also used the x-vector from and 4.16% in CER. The largest increase was observed in the
a seen language. We chose the fr speaker because es and fr unseen language, with an increase of 1.21 in MCD. This sug-
are both categorized as Western Romance in Glottolog [Ham- gests that the bottleneck layer, which projects the token and
marström et al., 2021]. Table 3 lists the results. Note that language embedding into the hidden input text representation
this case does not have the MCD results, since a different with nonlinear dimensionality reduction, is effective in im-
speaker than the ground-truth speech was used. We can see proving the generalization for zero-shot TTS.
that the unsupervised text pretraining also improved the zero- We also evaluated the effect of including language IDs in
shot performance when using the x-vector from the fr speaker. the proposed method by comparing it with a version that ex-
In the proposed byte-based model, the cross-lingual x-vector cluded language IDs, referred to as W/o language ID. It cor-
showed the lower CER. This might result from that the es responds to a simple multilingual BERT pretraining [Wu and
x-vector was not present in the training data whereas the fr Dredze, 2019] that uses only text tokens across different lan-
x-vector was present in the training data. guages. We observed that the use of language IDs led to an
Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Bytes multilingual 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
W/o bottleneck layer 6.06 5.01 7.15 9.09 7.71 28.52 5.33 6.47 10.26 24.01 6.99 13.74
W/o language ID 6.07 5.09 7.09 9.99 7.77 22.58 5.23 6.99 10.45 32.70 6.96 14.06
W/o initializing encoder 5.59 3.75 6.52 9.31 7.12 16.47 4.86 5.03 9.02 21.91 6.42 11.85
Updating language-aware embedding layer 6.05 6.22 6.75 6.93 7.46 11.42 5.16 8.00 9.48 17.21 6.75 10.62

Table 4: Ablation studies on training and model configurations. Bold indicates best metrics on average (Avg.).

4.5 4.5 Natural

4 Baseline (IPA monolingual)
4 Baseline (Bytes multilingual w/o LIDs)
3.5
Baseline (IPA multilingual w/o LIDs)
3 3.5 Baseline (IPA multilingual w/ LIDs)
2.5 Proposed (Bytes multilingual)
3 Proposed (IPA multilingual)
2

1.5 2.5
1
de (MOS) fr (MOS) de (AMOS) fr (AMOS) ru (AMOS) 2
fi (AMOS) hu (AMOS) nl (AMOS) el (AMOS) Avg (AMOS)

Figure 4: MOS and AMOS results for seen languages.

1.5 Error bars in MOS results represent 95% confidence intervals.

1 4.5
de hu de fr Natural
Method
MCD CER MCD CER 4 Oracle (IPA monolingual)
Oracle (IPA multilingual)
Natural - 2.75 - 2.12 3.5 Baseline (IPA multilingual)
Oracle Proposed (Bytes multilingual)
3
IPA monolingual 7.38 4.07 7.59 24.62 Proposed (IPA multilingual)
IPA multilingual 6.16 9.76 5.28 9.11 2.5

Baseline (Fully zero-shot TTS) p-value: 0.011

IPA multilingual 10.31 38.75 9.93 52.62 0.556 0.444

1.5
Proposed (Text-seen zero-shot TTS)
1
Bytes multilingual 10.00 28.01 9.40 50.11 es (MOS) es (AMOS) es (AB test)

Table 5: Analysis on different unseen languages. Figure 5: MOS, AMOS, and AB test results for unseen language.
Error bars in MOS results represent 95% confidence intervals.

average improvement of 0.5 MCD and 4.48% CER, indicat- 3.5 Dependency on Unseen Languages
ing the effectiveness of our approach in using language IDs. We conducted evaluations on the zero-shot TTS for different
In W/o initializing encoder, we did not initialize the en- unseen languages. The eight European languages included in
coder θE before the supervised leaning described in § 2.2. the paired data are composed of Indo-European and Uralic
Instead, we only initialized the parameters θT , θL , and θB language families defined in Glottolog [Hammarström et al.,
with the parameters pretrained in § 2.1. Through this eval- 2021]. In this evaluation, we selected de and hu from each
uation, we investigated whether the performance gain with of the families. During supervised learning in § 2.2, we ex-
our method resulted from the initialization of the language- cluded the paired data for each of de and hu and instead in-
aware embedding layer or the encoder. We observed that W/o cluded the paired data for es. Table 5 lists the results. We
initializing encoder resulted in an improvement of 0.04 in chose the IPA-based baseline method, which had shown bet-
MCD and only a 2.27% increase in CER on average, sug- ter results in § 3.3. We observed that the pretraining im-
gesting that our method benefits more from the pretraining of proved the CER by around 10% and MCD by around 0.3 for
the language-aware embedding layer than from the encoder. de. However, the improvement in CER for hu was limited
In Updating language-aware embedding layer, we updated to 2%, while the MCD was improved by around 0.5. These
the language-aware embedding layer during supervised learn- results suggest that the performance of our zero-shot TTS is
ing, as formulated in Eq. (8). We observed that freezing the language dependent, as observed in previous work on cross-
language-aware embedding layer led to better performance lingual transfer for NLP tasks [Wu and Dredze, 2019].
for most languages and metrics, resulting in an average dif- Fig. 3 visualize the token embedding Z and encoder in-
ference of 0.29 in MCD and 1.04% in CER. puts Hin averaged on each utterance. We used a t-distributed
stochastic neighbor embeddings (t-SNE) [der Maaten and language G2P knowledge, and previous work has built a byte-
Hinton, 2008]. We observed overlaps in the token embed- based TTS model for around 40 languages [He et al., 2021].
ding for (es, fr) and (de, nl), which are classified as Western There has been work using the phonological features derived
Romance and West Germanic in Glottolog, respectively. The from IPA to achieve the zero-shot TTS [Staib et al., 2020].
encoder inputs are separated in the embedding space for each Our framework achieves the zero-shot cross-lingual transfer
language. The results in Table 5 and the visualization sug- with bytes by leveraging multilingual text pretraining. There
gest that the cross-lingual transfer works better when simi- have been studies on using untranscribed speech data for low-
lar languages sharing the token embedding space are present resource scenarios by leveraging VQ-VAE [Zhang and Lin,
during supervised learning. However, for languages with dis- 2020] or an ASR model [Ren et al., 2019; Ni et al., 2022].
tinct token and language embeddings, the cross-lingual trans- Other work [Saeki et al., 2022b] has trained a massively
ferability might be limited. We leave the further analysis on multilingual TTS using paired TTS, paired ASR, unpaired
language dependencies as a topic for future research. speech, and unpaired text data. While it also performs text-
only training as in our work, it still uses the paired speech-text
3.6 Subjective Evaluations on Naturalness data of the target languages. Our framework is simple and
We conducted evaluations on naturalness as described in scalable, while pioneering a novel paradigm with the zero-
§ 3.1. Fig. 4 shows the results for seen languages. Note shot TTS approach that relies only on text data.
that we conducted the listening tests for de and fr. For each Cross-lingual representation learning for NLP There
language, either of the proposed methods showed the high- have been studies on learning cross-lingual representations
est MOS, while we did not observe any significant differ- that can be applied to various NLP tasks in different lan-
ence between the proposed methods and the best baseline guages [Gouws et al., 2015; Ruder et al., 2019]. Recent
method, which was the IPA-based multilingual model with work has highlighted the strong cross-lingual transferability
LIDs. To further validate our results, we also evaluated the of multilingual BERT [Devlin et al., 2019], which has been
naturalness with an AMOS prediction model, as shown in observed to perform surprisingly well when transferred to
Fig. 4. We observed that the either of the proposed meth- other languages [Wu and Dredze, 2019; Conneau and Lam-
ods showed the highest scores in all the languages. On aver- ple, 2019]. Building on this, our work leverages multilingual
age, the byte-based and IPA-based proposed models showed MLM pretraining for TTS, which improves byte-based TTS
2.89 and 2.84, respectively, while the best baseline method models without G2P knowledge and achieves zero-shot TTS.
obtained 2.837 . Additionally, we observed that the byte-based Language model pretraining for TTS Previous research
proposed model often scored higher than the IPA-based pro- has explored self-supervised text pretraining techniques for
posed models, which is consistent with the results in Table 2. TTS. BERT models have been used to extract contextual
Fig. 5 shows the results for unseen languages. The ora- embeddings and enhance the prosody of TTS [Hayashi et
cle methods had the highest MOS of 3.76 and 3.96, and the al., 2019; Xu et al., 2021]. Other studies have used
baseline zero-shot method had the lowest MOS of 3.29. The phonemes jointly with graphemes [Jia et al., 2021] or sub-
proposed methods outperformed the baseline method, and the phonemes [Zhang et al., 2022] as the inputs of the MLM pre-
byte- and IPA-based models had the MOS of 3.44 and 3.32, training. Our work proposes multilingual MLM pretraining
respectively. The AMOS results were consistent with the lis- for TTS using text tokens shared across languages, rather than
tening test results, with the proposed zero-shot TTS methods focusing on monolingual pretraining.
outperforming the baseline method. In this evaluation, the
proposed byte-based model scored 3.21 on the AMOS, while
the oracle IPA-based model scored 3.20. To further validate
5 Conclusions
the results, we conducted a preference AB test on naturalness We presented a multilingual TTS framework that leverages
with 25 rators. As shown in Fig. 5, our byte-based model unsupervised text pretraining. Our framework achieved
significantly outperformed the baseline IPA-based model. highly intelligible zero-shot TTS for an unseen language, re-
sulting in a CER of less than 12%. It also improved the
4 Related Work TTS for seen languages, with byte-based models without G2P
modules outperforming the IPA-based baselines. Our abla-
Multilingual TTS While previous work on multilingual tion studies provided additional insights, including the effec-
TTS has primarily focused on resource-rich languages [Zen tiveness of the frozen language embedding layer.
et al., 2012; Li and Zen, 2016], there is growing interest
Limitations and future work Our proposed framework
in developing TTS models on low-resource languages. Sev-
has limitations. The performance gap remains between the
eral studies have explored the input tokens shared across lan-
oracle models and our zero-shot TTS models in terms of in-
guages such as bytes [Li et al., 2019a; He et al., 2021], IPA
telligibility, speech quality, and naturalness, as seen in the
symbols [Gutkin, 2017], and articulatory features [Lux and
evaluation in § 3.3 and § 3.6. Further studies are needed to
Vu, 2022], to transfer knowledge from resource-rich to low-
improve our zero-shot TTS. Our framework also has a limi-
resource languages. Grapheme tokens can eliminate the per-
tation with language dependency, as the results in § 3.5 sug-
7
The AMOS tended to be lower than the MOS. While the MOS gest that this dependency is caused by the presence of similar
prediction model has a high correlation, it may produce errors in languages during supervised learning. Our future work will
predicting absolute values, as reported in previous work [Saeki et al., focus on studying this language dependency further and de-
2022a]. The relative relationships are more reliable in the AMOS. veloping a method that performs better for various languages.
Acknowledgments Extending the edge of tts research. arXiv preprint
Part of this work was supported by JSPS KAKENHI Grant arXiv:2110.07840, 2021.
Number 21H05054, 22H03639, and 22J12040. This work [He et al., 2021] M. He, J. Yang, L. He, and F. K
used the Bridges system [Nystrom et al., 2015], which is sup- Soong. Multilingual byte2speech models for scal-
ported by NSF award number ACI-1445606, at the Pittsburgh able low-resource speech synthesis. arXiv preprint
Supercomputing Center. We would like to thank the research arXiv:2103.03541, 2021.
teams at Google through the internship of the first author for [Hendrycks and Gimpel, 2016] D. Hendrycks and K. Gim-
providing various insights on this topic.
pel. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415, 2016.
References
[Jia et al., 2021] Y. Jia, H. Zen, J. Shen, Y. Zhang, and
[Ba et al., 2016] J. L. Ba, J. R. Kiros, and G. E Hinton. Layer
Y. Wu. PnG BERT: Augmented BERT on phonemes
normalization. arXiv preprint arXiv:1607.06450, 2016. and graphemes for neural TTS. arXiv preprint
[Bañón et al., 2020] M. Bañón, P. Chen, B. Haddow, arXiv:2103.15060, 2021.
K. Heafield, H. Hoang, M. Esplà-Gomis, M. L Forcada, [Jr, 2005] R. G Gordon Jr. Ethnologue, languages of the
A. Kamran, F. Kirefu, P. Koehn, et al. Paracrawl: Web-
world. https://www.ethnologue.com/, 2005. Accessed:
scale acquisition of parallel corpora. In Proc. ACL, pages
2023-05-27.
4555–4567, 2020.
[Bapna and Firat, 2019] A. Bapna and O. Firat. Simple, scal- [Kim et al., 2021] J. Kim, J. Kong, and J. Son. Conditional
able adaptation for neural machine translation. In Proc. variational autoencoder with adversarial learning for end-
EMNLP-IJCNLP, pages 1538–1548, 2019. to-end text-to-speech. In Proc. ICML, pages 5530–5540,
2021.
[Conneau and Lample, 2019] A. Conneau and G. Lample.
Cross-lingual language model pretraining. In Proc. [Kong et al., 2020] J. Kong, J. Kim, and J. Bae. HiFi-GAN:
NeurIPS, pages 7059–7069, 2019. Generative adversarial networks for efficient and high fi-
delity speech synthesis. Proc. NeurIPS, 33:17022–17033,
[der Maaten and Hinton, 2008] L. Van der Maaten and 2020.
G. Hinton. Visualizing data using t-sne. JMLR,
9(11):2579–2605, 2008. [Li and Zen, 2016] B. Li and H. Zen. Multi-language multi-
speaker acoustic modeling for LSTM-RNN based statis-
[Devlin et al., 2019] J. Devlin, M.-W. Chang, K. Lee, and
tical parametric speech synthesis. In Proc. Interspeech,
K. Toutanova. BERT: Pre-training of deep bidirectional pages 2468–2472, 2016.
transformers for language understanding. In Proc. NAACL,
pages 4171–4186, 2019. [Li et al., 2019a] B. Li, Y. Zhang, T. Sainath, Y. Wu, and
[Ebrahimi and Kann, 2021] A. Ebrahimi and K. Kann. How W. Chan. Bytes are all you need: End-to-end multilin-
gual speech recognition and synthesis with bytes. In Proc.
to adapt your pretrained multilingual model to 1600 lan-
ICASSP, pages 5621–5625, 2019.
guages. In Proc. ACL-IJCNLP, pages 4555–4567, 2021.
[Fukada et al., 1992] T. Fukada, K. Tokuda, T. Kobayashi, [Li et al., 2019b] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu.
and S. Imai. An adaptive algorithm for mel-cepstral anal- Neural speech synthesis with Transformer network. In
ysis of speech. In Proc. ICASSP, pages 137–140, 1992. Proc. AAAI, pages 6706–6713, 2019.
[Gouws et al., 2015] S. Gouws, Y. Bengio, and G. Corrado. [Li et al., 2022] X. Li, F. Metze, D. R Mortensen, A. W
Bilbowa: Fast bilingual distributed representations with- Black, and S. Watanabe. ASR2K: Speech recognition
out word alignments. In Proc. ICML, pages 748–756, for around 2000 languages without audio. arXiv preprint
2015. arXiv:2209.02842, 2022.
[Gutkin, 2017] A. Gutkin. Uniform multilingual multi- [Lux and Vu, 2022] F. Lux and T. Vu. Language-agnostic
speaker acoustic model for statistical parametric speech meta-learning for low-resource text-to-speech with artic-
synthesis of low-resourced languages. In Proc. Inter- ulatory features. In Proc. ACL, pages 6858–6868, 2022.
speech, pages 2183–2187, 2017. [Munich Artificial Intelligence Laboratories GmbH, 2017]
[Hammarström et al., 2021] H. Hammarström, R. Forkel, Munich Artificial Intelligence Laboratories GmbH.
M. Haspelmath, and S. Bank. Glottolog 4.5. Max Planck The M-AILABS speech dataset. https://www.caito.de/
Institute for the Science of Human History, 2021. 2019/01/the-m-ailabs-speech-dataset/, 2017. Accessed:
[Hayashi et al., 2019] T. Hayashi, S. Watanabe, T. Toda, 2023-05-27.
K. Takeda, S. Toshniwal, and K. Livescu. Pre-trained [Nagrani et al., 2017] A. Nagrani, J. S. Chung, and A. Zis-
text embeddings for enhanced text-to-speech synthesis. In serman. VoxCeleb: A large-scale speaker identification
Proc. Interspeech, pages 4430–4434, 2019. dataset. In Proc. Interspeech, pages 2616–2620, 2017.
[Hayashi et al., 2021] T. Hayashi, R. Yamamoto, [Nair and Hinton, 2010] V. Nair and G. E Hinton. Rectified
T. Yoshimura, P. Wu, J. Shi, T. Saeki, Y. Ju, Y. Ya- linear units improve restricted boltzmann machines. In
suda, S. Takamichi, and S. Watanabe. Espnet2-tts: Proc. ICML, 2010.
[Ni et al., 2022] J. Ni, L. Wang, H. Gao, K. Qian, Y. Zhang, [Snyder et al., 2018] D. Snyder, D. Garcia-Romero, G. Sell,
S. Chang, and M. Hasegawa-Johnson. Unsupervised D. Povey, and S. Khudanpur. X-vectors: Robust dnn em-
text-to-speech synthesis by unsupervised automatic speech beddings for speaker recognition. In Proc. ICASSP, pages
recognition. In Proc. Interspeech, pages 461–465, 2022. 5329–5333, 2018.
[Nystrom et al., 2015] N. A Nystrom, M. J Levine, R. Z [Staib et al., 2020] M. Staib, T. H. Teh, A. Torresquintero,
Roskies, and J Ray Scott. Bridges: a uniquely flexible D. S R. Mohan, L. Foglianti, R. Lenain, and J. Gao.
hpc resource for new communities and data analytics. In Phonological features for 0-shot multilingual speech syn-
Proc. XSEDE, pages 1–8, 2015. thesis. In Proc. Interspeech, pages 2942–2946, 2020.
[Paolacci et al., 2010] G. Paolacci, J. Chandler, and P. G [Tachibana et al., 2018] H. Tachibana, K. Uenoyama, and
Ipeirotis. Running experiments on amazon mechanical S. Aihara. Efficiently trainable text-to-speech system
turk. Judgment and Decision making, 5(5):411–419, 2010. based on deep convolutional networks with guided atten-
[Park and Mulc, 2019] K. Park and T. Mulc. CSS10: A col- tion. In Proc. ICASSP, pages 4784–4788, 2018.
lection of single speaker speech datasets for 10 languages. [Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar,
Proc. Interspeech, pages 1566–1570, 2019. J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser, and I. Polo-
[Pires et al., 2019] T. Pires, E. Schlinger, and D. Garrette. sukhin. Attention is all you need. In Proc. NeurIPS, vol-
How multilingual is multilingual BERT? In Proc. ACL, ume 30, 2017.
pages 4996–5001, 2019. [Veaux et al., 2017] C. Veaux, J. Yamagishi, K. MacDonald,
[Prakash et al., 2019] A. Prakash, A L. Thomas, S Umesh, et al. CSTR VCTK corpus: English multi-speaker corpus
and H. A Murthy. Building multilingual end-to-end speech for CSTR voice cloning toolkit. University of Edinburgh.
synthesisers for Indian languages. In Proc. SSW, pages The Centre for Speech Technology Research (CSTR), 2017.
194–199, 2019. [Wang et al., 2021] C. Wang, M. Riviere, A. Lee, A. Wu,
[Radford et al., 2022] A. Radford, J. W Kim, T. Xu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and
G. Brockman, C. McLeavey, and I. Sutskever. Robust E. Dupoux. VoxPopuli: A large-scale multilingual speech
speech recognition via large-scale weak supervision. arXiv corpus for representation learning, semi-supervised learn-
preprint arXiv:2212.04356, 2022. ing and interpretation. In Proc. ACL, pages 993–1003,
2021.
[Ravanelli et al., 2021] M. Ravanelli, T. Parcollet,
P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, [Watanabe et al., 2018] S. Watanabe, T. Hori, S. Karita,
C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, T. Hayashi, J. Nishitoba, Y. Unno, N.-E. Y. Soplin, J. Hey-
et al. SpeechBrain: A general-purpose speech toolkit. mann, M. Wiesner, N. Chen, et al. ESPnet: End-to-end
arXiv preprint arXiv:2106.04624, 2021. speech processing toolkit. Proc. Interspeech, pages 2207–
2211, 2018.
[Ren et al., 2019] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao,
and T.-Y. Liu. Almost unsupervised text to speech and au- [Wu and Dredze, 2019] S. Wu and M. Dredze. Beto, Bentz,
tomatic speech recognition. In Proc. ICML, pages 5410– Becas: The surprising cross-lingual effectiveness of
5419, 2019. BERT. In Proc. EMNLP-IJCNLP, pages 833–844, 2019.
[Ruder et al., 2019] S. Ruder, I. Vulić, and A. Søgaard. A [Xu et al., 2021] G. Xu, W. Song, Z. Zhang, C. Zhang,
survey of cross-lingual word embedding models. JAIR, X. He, and B. Zhou. Improving prosody modelling with
65:569–631, 2019. cross-utterance BERT embeddings for end-to-end speech
synthesis. In Proc. ICASSP, pages 6079–6083, 2021.
[Saeki et al., 2022a] T. Saeki, D. Xin, W. Nakata, T. Ko-
riyama, S. Takamichi, and H. Saruwatari. UTMOS: [Zen et al., 2012] H. Zen, N. Braunschweiler, S. Buchholz,
UTokyo-SaruLab system for VoiceMOS Challenge 2022. M. JF Gales, K. Knill, S. Krstulovic, and J. Latorre. Sta-
In Proc. Interspeech, pages 4521–4525, 2022. tistical parametric speech synthesis based on speaker and
language factorization. TASLP, 20(6):1713–1724, 2012.
[Saeki et al., 2022b] T. Saeki, H. Zen, Z. Chen, N. Morioka,
G. Wang, Y. Zhang, A. Bapna, A. Rosenberg, and B. Ram- [Zen et al., 2019] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J
abhadran. Virtuoso: Massive multilingual speech-text Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A corpus
joint semi-supervised learning for text-to-speech. arXiv derived from LibriSpeech for text-to-speech. In Proc. In-
preprint arXiv:2210.15447, 2022. terspeech, pages 1526–1530, 2019.
[Seki et al., 2022] K. Seki, S. Takamichi, T. Saeki, and [Zhang and Lin, 2020] H. Zhang and Y. Lin. Unsupervised
H. Saruwatari. Text-to-speech synthesis from dark data learning for sequence-to-sequence text-to-speech for low-
with evaluation-in-the-loop data selection. arXiv preprint resource languages. Proc. Interspeech, pages 3161–3165,
arXiv:2210.14850, 2022. 2020.
[Shen et al., 2018] J. Shen, R. Pang, R. J Weiss, M. Schus- [Zhang et al., 2022] G. Zhang, K. Song, X. Tan, D. Tan,
ter, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee,
RJ Skerrv-Ryan, et al. Natural TTS synthesis by condi- et al. Mixed-Phoneme BERT: Improving BERT with
tioning WaveNet on mel spectrogram predictions. In Proc. mixed phoneme and sup-phoneme representations for text
ICASSP, pages 4779–4783, 2018. to speech. In Proc. Interspeech, pages 456–460, 2022.
Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Spoken Text 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
Written Text 5.81 4.55 6.94 9.10 7.61 21.24 5.22 12.73 9.50 18.44 6.76 12.52
Spoken+Written Text 5.54 3.72 6.34 7.51 7.07 15.33 4.96 5.44 8.82 17.48 6.35 10.04

Table 6: Comparison of text data domain for unsupervised text pretraining.

Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Residual layer 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
Transformer encoder 5.77 4.63 6.36 6.61 7.17 11.25 4.90 6.50 8.89 14.15 6.44 9.97

Table 7: Comparison of bottleneck layer architecture.

A Text data for pretraining encoder as the bottleneck layer (referred to as Transformer-
encoder), comparing it with the original residual layer de-
We investigated the effect of different types of text data
tailed in § 2.4 (referred to as Residual layer). Table 7 lists the
used in pretraining Dtext on the performance of our method.
results.
As described in § 3.1, we used spoken texts from VoxPop-
uli, M-AILABS, and CSS10 in the evaluations presented in For the seen languages, the superior performance between
§ 3. However, we can acquire a large amount of text data the residual layer and the transformer encoder varied, depend-
from datasets designed for NLP or from web-crawled text ing on the specific language and evaluation metrics. However,
resources. These data typically consist of written texts and for the unseen language, the Transformer encoder showed
cover a wide range of domains. To investigate the effective- higher performance, achieving an improvement of 4.12 in
ness of using written texts for our multilingual TTS, we used CER. Looking at the average scores across all languages,
ParaCrawl [Bañón et al., 2020], a web-crawled text dataset the Transformer encoder had a slightly lower MCD, while
built for machine translation, for Dtext during the unsuper- the CER was reduced by 0.39 when using the residual layer.
vised pretraining described in § 2.1. For the investigation, These results suggest that the use of a deeper layer can im-
we randomly sampled the same amount of texts as in Table 1 prove the generalizability of the proposed model. Neverthe-
for each language. We then trained our model using the fol- less, the overall performance of both models remains compa-
lowing three different cases. 1) Spoken Text: Only using the rable in terms of average metrics.
spoken text for pretraining as in the previous evaluations, 2)
Written Text: Only using the text data from ParaCrawl, and C Effect of excluding some languages from
3) Spoken+Written Text: We combined the text data in Spo- paired data.
ken Text and Written Text. We used the byte-based proposed
model presented in § 3.2 and § 3.3. Table 6 lists the results. In this section, we have deliberately excluded several lan-
We observed that Spoken Text outperformed Written Text guages from the paired data used for the supervised learning
in all the metrics and languages, resulting in an average dif- described in § 2.2 in order to study their impact. As shown
ference of 0.3 in MCD and 2.94% in CER. These results in Table 1, the paired data originally included the languages
demonstrate the effectiveness of using spoken text for pre- de, fr, nl, fi, hu, ru, and el. As a comparison case, we first ex-
training. Spoken+Written Text showed on average 0.11 lower cluded fr, which belongs to Italic languages as es according to
MCD and 0.46% higher CER compared to Spoken Text. How- Glottolog [Hammarström et al., 2021]. We also removed de
ever, for the unseen language, Spoken+Written Text outper- and nl, which belong to Germanic languages, from the paired
formed Spoken Text in MCD and CER. These results suggest data. Consequently, in the comparison case, supervised learn-
that adding written text data can improve the generalization ing was performed only with fi, ru, hu, and el. Table 8 lists
of our TTS models for the zero-shot scenarios. the results. Original corresponds to the case shown in Ta-
ble 1, while Excluded denotes the case where only fi, ru, hu,
and el were used.
B Architecture of bottleneck layer The three languages listed on the left-hand side of Table 8
As described in § 2.4, we used the residual layer for our bot- (ru, hu, fi) represent the seen languages in both cases. In-
tleneck layer. Also, we demonstrated the effectiveness of terestingly, the Original scenario generally outperformed the
the residula bottleneck layer for both seen and unseen lan- Excluded scenario for these languages. These results indi-
guages in the evaluation presented in § 3.4. In this section, we cate that in the context of multilingual TTS training, perfor-
explored an alternative architecture for the bottleneck layer. mance can potentially be improved by including a wider va-
We conducted experiments using a single-layer Transformer riety of languages rather than restricting to similar languages.
ru hu fi de fr es
Method
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Original (de, fr, nl, fi, hu, ru, el) 7.38 10.62 5.01 6.05 4.99 5.28 5.65 3.79 6.48 7.15 9.05 18.27
Excluded (fi, hu, ru, el) 7.00 11.11 5.32 6.92 4.98 5.46 10.39 34.11 10.90 49.65 10.00 24.80

Table 8: Effects of excluding languages from paired data.

both the baseline and the proposed methods. The top part of
Fig. 6 represents results from the fifth layer, while the bottom
part corresponds to results from the sixth layer. The left side
of the figure shows the results of the baseline method, while
the right side shows the results of the proposed method.
We observe that the fifth layer of the baseline model shows
a discontinuity in the attention map, which leads to instabil-
ity of the linguistic content. Conversely, in the fifth layer of
the proposed model, the attention map is significantly more
continuous than in the baseline method. These results sug-
gest that our unsupervised text pretraining can improve cross-
attention in the absence of paired speech-text data. The re-
sults are also reflected in the intelligibility difference between
the baseline and the proposed methods presented in § 3.3.

(a) Baseline (Bytes) (b) Proposed (Bytes)

Figure 6: Cross-attention maps obtained from byte-based baseline

and proposed methods, which correspond to first attention head in
fifth and sixth layers of decoder.

This suggests the effectiveness of massively multilingual TTS

training, as supported by previous work [Saeki et al., 2022b].
The two languages (de, fr) were included in the paired data
for Original, but were excluded from the paired data for Ex-
cluded. We observed that the zero-shot setting resulted in a
decrease in performance. Comparing the de results in Table 5
with those in Table 8, we confirmed that excluding fr, nl, and
es resulted in a 6.1% performance degradation in CER. Also,
es was absent from the paired data in both the Original and
Excluded scenarios. An examination of the es results revealed
a 6.53% increase in CER for the Excluded scenario. These
results demonstrate the importance of including linguistically
similar languages in the paired data.

D Observation of cross-attention map.

In this section, we show the cross-attention maps obtained
during inference from both the byte-based proposed and the
baseline methods defined in § 3.1. In both cases, an utterance
sampled from an unseen language (Spanish) was used. Fig. 6
shows the cross-attention maps corresponding to the first at-
tention head in the fifth and sixth layers of the decoder. It
should be noted that we have cross-attention maps for other
heads and layers. The results shown in Fig. 6 are derived from
attention maps that have a diagonal shape in higher layers for

How To Choose An Effective and Sufficient Sample For An AML Program Audit
0% (1)
How To Choose An Effective and Sufficient Sample For An AML Program Audit
15 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
No ratings yet
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
5 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Unsupervised Tts by Unsupervised Automatic Speech Recognition-HIFIGAN
No ratings yet
Unsupervised Tts by Unsupervised Automatic Speech Recognition-HIFIGAN
5 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
3 pages
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
No ratings yet
How Do Source-Side Monolingual Word Embeddings Impact Neural Machine Translation?
10 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
Almost Unsupervised Text To Speech and Automatic Speech Recognition
No ratings yet
Almost Unsupervised Text To Speech and Automatic Speech Recognition
11 pages
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
No ratings yet
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
5 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
STTATTS: Unified Speech-To-Text and Text-To-Speech Model: Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
No ratings yet
STTATTS: Unified Speech-To-Text and Text-To-Speech Model: Hawau Olamide Toyin, Hao Li, Hanan Aldarmaki
11 pages
Improving Neural Machine Translation Models With Monolingual Data
No ratings yet
Improving Neural Machine Translation Models With Monolingual Data
11 pages
Language-Agnostic BERT Sentence Embedding
No ratings yet
Language-Agnostic BERT Sentence Embedding
14 pages
Thesis
No ratings yet
Thesis
37 pages
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
No ratings yet
XTTS - A Massively Multilingual Zero-Shot Text-to-Speech Model
5 pages
Google USM - Scaling Automatic Speech Recognition Beyond 100 Languages
No ratings yet
Google USM - Scaling Automatic Speech Recognition Beyond 100 Languages
20 pages
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
No ratings yet
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
13 pages
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
No ratings yet
S S - T P - S - I D: Caling Peech EXT RE Training With YN Thetic Nterleaved ATA
23 pages
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
No ratings yet
CONNEAU and Lample - 2019 - Cross-Lingual Language Model Pretraining
11 pages
Scaling Speech Technology To 1,000+ Languages
No ratings yet
Scaling Speech Technology To 1,000+ Languages
41 pages
2023 - Speak, Read and Prompt High-Fidelity Text-To-Speech With Minimal Supervision - Kharitonov Et Al
No ratings yet
2023 - Speak, Read and Prompt High-Fidelity Text-To-Speech With Minimal Supervision - Kharitonov Et Al
19 pages
S E A T - S: Ample Fficient Daptive EXT TO Peech
No ratings yet
S E A T - S: Ample Fficient Daptive EXT TO Peech
15 pages
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
No ratings yet
Towards Spontaneous Style Modeling With Semi-Supervised Pre-Training For Conversational Text-to-Speech Synthesis
5 pages
Multilinguality
No ratings yet
Multilinguality
10 pages
Cross-Lingual Language Model Pretraining PDF
No ratings yet
Cross-Lingual Language Model Pretraining PDF
10 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
No ratings yet
Comsl: A Composite Speech-Language Model For End-To-End Speech-To-Text Translation
12 pages
VoiceBox Arxiv v3 0614
No ratings yet
VoiceBox Arxiv v3 0614
32 pages
2017 - Unsupervised Neural Machine Translation PDF
No ratings yet
2017 - Unsupervised Neural Machine Translation PDF
11 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
No ratings yet
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations
14 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt
No ratings yet
InstructTTS: Modelling Expressive TTS in Discrete Latent Space With Natural Language Style Prompt
13 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
No ratings yet
From Recurrent Neural Network Techniques To Pre-Trained Models: Emphasis On The Use in Arabic Machine Translation
10 pages
2004 09813v1 PDF
No ratings yet
2004 09813v1 PDF
10 pages
LR Speech Tts ASR Combo 2020
No ratings yet
LR Speech Tts ASR Combo 2020
11 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
Suoni
No ratings yet
Suoni
38 pages
Towards The Next 1000 Languages in Multilingual Machine Translation: Exploring The Synergy Between Supervised and Self-Supervised Learning
No ratings yet
Towards The Next 1000 Languages in Multilingual Machine Translation: Exploring The Synergy Between Supervised and Self-Supervised Learning
14 pages
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
No ratings yet
NaturalSpeech 3: Zero-Shot Speech Synthesis With Factorized Codec and Diffusion Models
22 pages
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
No ratings yet
Translating Similar Languages: Role of Mutual Intelligibility in Multilingual Transformers
7 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
No ratings yet
An Embarrassingly Simple Approach For Transfer Learning From Pretrained Language Models
7 pages
DiTTo TTS
No ratings yet
DiTTo TTS
34 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Task Arithmetic For Language Expansion in Speech Translation
No ratings yet
Task Arithmetic For Language Expansion in Speech Translation
5 pages
Prompting Large Language Models With Speech Recognition Abilities
No ratings yet
Prompting Large Language Models With Speech Recognition Abilities
9 pages
Improving Text Embeddings With Large Language Models
No ratings yet
Improving Text Embeddings With Large Language Models
20 pages
Llasa
No ratings yet
Llasa
25 pages
Neural Machine Translation: Shusen Wang
No ratings yet
Neural Machine Translation: Shusen Wang
57 pages
Speechut: Bridging Speech and Text With Hidden-Unit For Encoder-Decoder Based Speech-Text Pre-Training
No ratings yet
Speechut: Bridging Speech and Text With Hidden-Unit For Encoder-Decoder Based Speech-Text Pre-Training
14 pages
Literature Survey
No ratings yet
Literature Survey
6 pages
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet
ChatGPT for Linguists: Revolutionize Language Research and Analysis with AI-Driven Insights (2024 Guide)
From Everand
ChatGPT for Linguists: Revolutionize Language Research and Analysis with AI-Driven Insights (2024 Guide)
JED RAMOS
No ratings yet
CHAPTER 6 Frequency Analysis
No ratings yet
CHAPTER 6 Frequency Analysis
38 pages
Assignment LL Foronda
No ratings yet
Assignment LL Foronda
4 pages
AZ-900T01 Microsoft Azure Fundamentals-04
No ratings yet
AZ-900T01 Microsoft Azure Fundamentals-04
28 pages
Electronics, Technology and Trends in The Online
No ratings yet
Electronics, Technology and Trends in The Online
1 page
Micromass GCT PM Protocol
No ratings yet
Micromass GCT PM Protocol
12 pages
Latex Foam Manufacturing Process
No ratings yet
Latex Foam Manufacturing Process
2 pages
Macro Preprocessor
No ratings yet
Macro Preprocessor
75 pages
SublimationPrinting101 2011edition PDF
No ratings yet
SublimationPrinting101 2011edition PDF
62 pages
Multiplexer and Demultiplexer
No ratings yet
Multiplexer and Demultiplexer
11 pages
'Case Studies'': Ar. EMMARA Abubakar Muhammad Arslan B-21878
No ratings yet
'Case Studies'': Ar. EMMARA Abubakar Muhammad Arslan B-21878
5 pages
Rev 32 P0003 28 August 2017
No ratings yet
Rev 32 P0003 28 August 2017
24 pages
Model: Service Manual
No ratings yet
Model: Service Manual
62 pages
Top 10 Mobile Marketing Questions and Answers
No ratings yet
Top 10 Mobile Marketing Questions and Answers
9 pages
UN40D5500 Trobleshooting PDF
100% (1)
UN40D5500 Trobleshooting PDF
49 pages
Chapter 5a - Concrete & Formwork
No ratings yet
Chapter 5a - Concrete & Formwork
28 pages
UvBeastV3 365nmMINI Manual-R2.1
No ratings yet
UvBeastV3 365nmMINI Manual-R2.1
20 pages
Solved Chapter 5 Worksheet Class 7
No ratings yet
Solved Chapter 5 Worksheet Class 7
2 pages
New Model Service Ratio - 15022025
No ratings yet
New Model Service Ratio - 15022025
36 pages
College Information System
68% (28)
College Information System
97 pages
Fembot
No ratings yet
Fembot
7 pages
Egmail
No ratings yet
Egmail
18 pages
Palak Project
No ratings yet
Palak Project
15 pages
Vtu Course Work
100% (2)
Vtu Course Work
7 pages
Senarai Frekuensi, Stesen Radio Di Malaysia
No ratings yet
Senarai Frekuensi, Stesen Radio Di Malaysia
2 pages
CFN Ug
No ratings yet
CFN Ug
4,069 pages
ComPact NS - 33228
No ratings yet
ComPact NS - 33228
3 pages
Exploring The Oracle Latches
No ratings yet
Exploring The Oracle Latches
52 pages
Activity No 03
No ratings yet
Activity No 03
4 pages
Differences Between Precision and Comfort Cooling
No ratings yet
Differences Between Precision and Comfort Cooling
5 pages

Zero Shot

Uploaded by

Zero Shot

Uploaded by

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with

Unsupervised Text Pretraining

While neural text-to-speech (TTS) has achieved

shot multilingual TTS using text-only data for the Training

es de respectively. The embedding layers output Z m and el as:

4.5 4.5 Natural

Figure 4: MOS and AMOS results for seen languages.

Baseline (Fully zero-shot TTS) p-value: 0.011

IPA multilingual 10.31 38.75 9.93 52.62 0.556 0.444

Table 6: Comparison of text data domain for unsupervised text pretraining.

Table 7: Comparison of bottleneck layer architecture.

Table 8: Effects of excluding languages from paired data.

(a) Baseline (Bytes) (b) Proposed (Bytes)

Figure 6: Cross-attention maps obtained from byte-based baseline

This suggests the effectiveness of massively multilingual TTS

D Observation of cross-attention map.

You might also like