Zero Shot
Zero Shot
Abstract Synthesis
Language-aware de ru
Hin = Bottleneck(Z m + el ; θB ), (2)
Language-aware es
embedding layer hu ... embedding layer Hout = Encoder(Hin ; θE ), (3)
Bytes or IPA Bytes or IPA p(X|X−Π ) = Softmax(PredictionNet(Hout ; θP )), (4)
Multilingual TTS for seen languages Zero-shot TTS for unseen language
where θB , θE , θP denote the model parameters of the bot-
(c) Inference
tleneck layer, the encoder and a prediction network, respec-
tively. In Eq. (4), Softmax(·) denotes a softmax function. We
Figure 2: Proposed framework. (a) We perform MLM pretraining on
multilingual text data and then (b) train TTS model on paired data
define the network with the model parameters {θB , θT , θL }
with frozen language-aware embedding layer. (c) Zero-shot TTS is as language-aware embedding layer, which jointly embeds
performed with language IDs that are not included in paired data. the token sequence X and the language ID ltext as in Eq. (1)
and (2). Let Π = (πk ∈ N|k = 1, · · · , K) be the indexes
of the masked tokens of length K. With the probability com-
multilingual TTS framework that achieves highly intelligible puted in Eq. (4), the training objective can be defined as:
TTS for an unseen language, resulting in a character error rate
K
of less than 12%. 2) Our method also improves TTS for seen 1 X
languages, resulting in byte-based models without grapheme- Lmlm = log p(xπk |X m ),
K
to-phoneme (G2P) modules that outperform the phoneme- k=1 (5)
based baselines. 3) Our ablation studies provide additional {θ̂E , θ̂B , θ̂T , θ̂L } = arg min Lmlm .
insights, including the effectiveness of the frozen language- θE ,θB ,θT ,θL
aware embedding layer. The experiments were conducted on
We use UTF-8 bytes or International Phonetic Alphabet
public datasets and the implementation is available1 . We en-
(IPA) symbols for the input token sequence X. For each to-
courage readers to listen to our audio samples2 .
ken type, the vocabulary V is constructed from Dtext , which
includes a start/end of sentence token ([SOS/EOS]). We ex-
2 Method tracted International IPA sequences using an open-source
Our model has a typical neural TTS model architecture con- toolkit3 . To obtain the masked token X m , we use the same
sisting of token embedding, encoder, and decoder. First, we masking ratio and category as in the original BERT pre-
use MLM pretraining with multilingual text data to learn training [Devlin et al., 2019] for each token type. Randomly,
cross-lingual representations. Then we perform supervised 12 % of the tokens are replaced with the [MASK] token, and
learning with paired data to learn the mapping from linguis- 1.5 % of them are replaced with random tokens. Also, 1.5 %
tic features to speech features. The model performs inference of the tokens are left unchanged and Lmlm is computed as in
even for languages that are not present in the paired data. Eq. (5) for those 15 % of tokens that have indices Π.
2.1 Unsupervised Multilingual Text Pretraining 2.2 Supervised Learning with Paired Data
Fig. 2(a) illustrates the unsupervised pretraining method. It Fig. 2(b) illustrates the supervised learning of the TTS model
uses multilingual text data consisting of languages that are with paired data. We define the paired data and the set of lan-
not included in the paired data. Let X = (xn ∈ V |n = guage IDs as Dpaired and Lpaired , respectively. Note that we
1, · · · , N ) denote the input text token sequence of length N , assume Lpaired ⊂ Ltext . Let Y = (y t ∈ RD |t = 1, · · · , T )
where V denotes a vocabulary constructed for pretraining. denote the speech feature sequence with the length of T .
We first initialize the model parameters {θE , θB , θT , θL } with
1
https://github.com/Takaaki-Saeki/zm-text-tts
2 3
https://takaaki-saeki.github.io/zm-tts-text demo https://github.com/espeak-ng/espeak-ng
those obtained in the pretraining described in § 2.1. Let θD Our work attempts to only use the linguistic knowledge to im-
denote the model parameter of the decoder. The speech fea- prove the zero-shot TTS. Thus, the inference process is writ-
tures are predicted with teacher forcing as: ten as L′unseen ∩ Lpaired = ∅ and L′unseen ⊂ Ltext . In the
evaluation, we denote the inference with Lunseen and L′unseen
Hout = Encoder(Bottleneck(Z + el )), (6) as Fully zero-shot TTS and Text-seen zero-shot TTS, respec-
Ŷ = Decoder(Hout , Y ; θD ), (7) tively. Fully zero-shot TTS performs zero-shot TTS without
pretraining as in the IPA-based previous method [Staib et al.,
where Z is the unmasked token embedding sequence. Note 2020], which is the baseline method in our evaluations.
that the unmasked token sequence is used in Eq. (6), while
the masked token sequence is used in Eq. (2) Let Ltts (Ŷ , Y ) 2.4 Model Architecture
denote the training objective of the TTS model. Then we Our model is an autoregressive TTS model based on Trans-
consider two types of schemes. former TTS [Li et al., 2019b], which has also been used in
Updating language-aware embedding layer We only the previous work on byte-based multilingual TTS [He et al.,
freeze the parameter of the language embedding layer θL 2021]. During the supervised learning described in § 2.2 and
while updating the rest of the parameters. Therefore the train- inference described in § 2, we use x-vector [Snyder et al.,
able model parameters can be written as 2018] for the speaker embedding and add it to the encoder
output through a projection layer. During supervised learn-
{θ̂D , θ̂E , θ̂B , θ̂T } = arg min Ltts (Ŷ , Y ). (8) ing, we use the average x-vectors computed from the training
θD ,θE ,θB ,θT data. For evaluation purposes, we perform zero-shot synthe-
Previous work has confirmed that multilingual BERT has sis with the average x-vector from the test data of the target
high cross-lingual transferability for various NLP tasks [Wu language and feed it to the model. Note that we also conduct
and Dredze, 2019]. This scheme corresponds to a simple fine- the evaluation with x-vectors from seen languages.
tuning of BERT [Wu and Dredze, 2019], which updates all For the bottleneck layer with θB , we use a residual network
the parameters during training for the downstream tasks4 . consisting of Layer Normalization [Ba et al., 2016], down
projection, ReLU [Nair and Hinton, 2010], and up projection
Freezing language-aware embedding layer We freeze the with the residual connection, which is used in previous work
bottleneck layer and the token embedding layer along with on language adaptation [Bapna and Firat, 2019].
the language embedding, updating the encoder and decoder.
The training process can be written as
3 Experimental Evaluations
{θ̂D , θ̂E } = arg min Ltts (Ŷ , Y ). (9) 3.1 Experimental Setting
θD ,θE
Dataset
In contrast to the scheme represented in Eq. (8), the scheme in
Eq. (9) preserves the parameters of the language-aware em- We carried out all the evaluations with publicly available
bedding layer to facilitate cross-lingual transfer. In the evalu- datasets. Table 1 shows the sizes of the data for each lan-
ation, we use the scheme formulated in Eq. (9), except for the guage. For the unsupervised text pretraining described in
ablation study in § 3.4. § 2.1, we used transcripts from VoxPopuli [Wang et al.,
2021], M-AILABS [Munich Artificial Intelligence Labora-
2.3 Inference tories GmbH, 2017], and CSS10 [Park and Mulc, 2019], re-
sulting in a total of about 2.8 GB of spoken text across 19
Let Lsyn denote the set of language IDs used for inference.
languages. We used CSS10 for the supervised learning de-
The text token sequence X and the language ID lsyn ∈ Lsyn
scribed in § 2.2, and we selected seven European languages
are fed to the model as in Eq. (1), and the encoder output is
as the seen languages, with Spanish as the unseen language.
predicted as in Eq. (6). Unlike Eq. (7), the speech features are
The paired data consisted of one speaker per language. It
predicted as:
should be noted that Spanish is not actually a low-resource
Ŷ = Decoder(Hout ; θD ). (10) language, but we chose to use it for evaluation purposes in
The output waveform is obtained by feeding the predicted order to 1) compare our zero-shot TTS methods with the ora-
features Ŷ to a pretrained neural vocoder. cle methods using the paired data for the target language and
Fig. 2(c) illustrates the inference process. The left and right 2) ensure a sufficient number of evaluators for the subjective
sides of the figure show the typical multilingual TTS and our evaluation. We used 5 and 100 utterances as dev and test sets,
zero-shot TTS. Previous work [Li et al., 2019a] has typically respectively, with the remaining data used for training.
assumed seen languages, and the inference is performed with
Training Details
the language IDs Lseen ⊂ Lpaired . However, it is challenging
to perform TTS for unseen languages Lunseen ∩ Lpaired = ∅. The sampling rate was set to 16 kHz. An 80-dimension
While other work [Saeki et al., 2022b] has built a massively of mel filter bank, 1024 samples of FFT length, and 256
multilingual TTS model that even achieves zero-shot TTS samples of frame shit were used for speech analysis. For
from ASR data, it uses paired data for the target languages. the pretraining described in § 2.1, we trained the model for
1.2M iterations using the Noam optimizer [Vaswani et al.,
4
We freeze the language embedding layer to address the mis- 2017] with the learning rate and warm-up step set to 1.0
match between language embedding of seen and unseen languages. and 10000, respectively. For the TTS model described in
Languages Code Text-only data
Paired data (LIDs). Multilingual w/ LIDs: We trained a multilingual
Text Audio
Transformer TTS model with the paired data of the unseen
Seen languages for evaluation Lseen language. It also used the language IDs.
German de 359MB 0.73MB 16.13h
French fr 372MB 0.94MB 19.15h Unseen language We compared Fully zero-shot TTS and
Dutch nl 336MB 0.75MB 14.10h Text-seen zero-shot TTS defined in § 2.3. In Oracle, we used
Finnish fi 308MB 0.47MB 21.36h
Hungarian hu 104MB 0.51MB 10.53h the Monolingual and Multilingual w/ LIDs, which used the
Russian ru 4.9MB 1.5MB 10.00h paired data of the unseen language. In Fully zero-shot TTS,
Greek el 0.39MB 0.39MB 4.13h we used Multilingual w/o LIDs to synthesize speech from text
Unseen language for evaluation Lunseen tokens in the unseen language. This method corresponds to
Spanish es 345MB 0.0MB (1.2MB) 0.00h (23.81h) the conventional multilingual TTS model using bytes [He et
Languages not included in CSS10 al., 2021] or IPA symbols [Staib et al., 2020].
English en 338MB
Estonian et 87MB Evaluation Metrics
Croatian hr 2.0MB To objectively measure the synthetic speech quality, we used
Italian it 334MB
Lithuanian lt 89MB mel cepstral distortion (MCD) [Fukada et al., 1992] with
Polish pl 102MB the mel cepstrum dimension set to 25. We also evaluated
Romanian ro 67MB the intelligibility using CERs computed with a multilingual
Slovak sk 94MB
Slovenian sl 81MB ASR model [Radford et al., 2022]. We used a pretrained
large model that is publicly available6 . To evaluate the nat-
Table 1: Amount of text-only and paired data for each language. uralness, we carried out listening tests to calculate five-scale
Parentheses indicate amount of original data in CSS10. mean opinion scores (MOS) of synthesized speech for each
method. Forty native speakers were recruited through Ama-
zon Mechanical Turk [Paolacci et al., 2010] for each of the
§ 2.4, we used a 6-block Transformer encoder [Vaswani et tests. Furthermore, we leveraged a publicly available auto-
al., 2017] and a 6-block Transformer decoder, with a post- matic MOS (AMOS) prediction model [Saeki et al., 2022a]
net consisting of five convolutional layers with a kernel size to evaluate the naturalness. Note that the model was trained
of five. The attention dimension and the number of atten- on English and Chinese datasets, but previous work [Seki et
tion heads were set to 512 and 8, respectively. For the bot- al., 2022] has reported that it also showed a correlation coef-
tleneck layer described in § 2.4, we set the hidden dimen- ficient higher than 0.8 for another language (Japanese).
sion after the down projection to 256. The PredictionNet in
Eq. (4) consisted of a linear layer, a GELU activation func- 3.2 Evaluation Results on Seen Languages
tion [Hendrycks and Gimpel, 2016], Layer Normalization, We evaluated our framework on the seen languages included
and a linear layer with the hidden dimension of 512. We in the paired data, as defined in § 2.3. Table 2 lists the results
also used guided attention loss [Tachibana et al., 2018] to in MCD and CER. Lower values are better for both metrics.
improve the training efficiency. For the supervised learn- As we can see, the byte-based or IPA-based models with the
ing described in § 2.2, we trained the models for 2.47M it- proposed multilingual pretraining performed the best across
erations (200 epochs). The Noam optimizer was used with all languages and metrics. Among the baselines, byte-based
the warm-up step of 50000. For the neural vocoder, we monolingual and multilingual models tended to have higher
trained HiFi-GAN [Kong et al., 2020] for 2M iterations MCD and CER than IPA-based models, and failed to synthe-
with LibriTTS [Zen et al., 2019], VCTK [Veaux et al., size intelligible speech in some languages. For example, the
2017], and CSS10. For the x-vector described in § 2.4, we baseline byte-based models showed the high CER values for
used a model trained on VoxCeleb1 and VoxCeleb2 [Na- French, which has a deep orthography, meaning that a single
grani et al., 2017] published in SpeechBrain [Ravanelli et character has different pronunciations depending on the con-
al., 2021]. We used ESPnet2-TTS [Watanabe et al., 2018; text. We observed that our method improved the byte-based
Hayashi et al., 2021] for the implementation. models and they outperformed the IPA-based baseline models
Baselines for all the metrics and languages. It is worth noting that the
proposed byte-based models even outperformed the proposed
We developed baseline models without the pretraining.
IPA-based models except for el and ru. These results suggest
Seen language Monolingual: We trained a model for each that our framework is effective in building a TTS model for
language independently. Our preliminary study found that languages without G2P modules.
Transformer TTS was unstable5 and could not synthesize in-
telligible speech in the monolingual condition due to the lack 3.3 Evaluation Results on Unseen Language
of training data. Therefore, we used Tacotron2 [Shen et al., We evaluated our method on zero-shot TTS for the unseen
2018] only for the monolingual models, as in the original pa- language defined in § 2.3. As described in § 2.4, we first
per of the dataset [Park and Mulc, 2019]. Multilingual w/o used the x-vector from the es speaker to compute the MCD.
LIDs: We trained a multilingual Transformer TTS model us- Table 3 lists the results. The baseline models showed the
ing the paired data shown in Table 1 without language IDs CERs of over 40% and MCDs of over 10.0. However, our
5 6
The original paper [Li et al., 2019b] also reports the instability. https://github.com/openai/whisper
de fr ru fi hu nl el
Method
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Natural - 2.75 - 4.52 - 2.12 - 4.73 - 4.86 - 6.22 - 7.14
Baseline (Monolingual)
Bytes monolingual 7.70 8.61 11.76 91.82 11.43 >100 8.33 56.03 10.22 93.05 7.49 15.33 10.20 85.98
IPA monolingual 7.38 4.07 8.96 17.86 11.89 25.30 7.23 27.62 7.59 24.62 7.80 19.20 8.16 21.79
Baseline (Multilingual)
Bytes multilingual w/o LIDs 7.68 37.46 8.71 41.35 9.38 45.92 6.26 29.19 6.48 33.82 8.46 46.33 7.64 36.24
Bytes multilingual w/ LIDs 6.51 13.19 10.84 55.79 12.89 >100 6.78 27.22 9.09 42.97 8.47 39.37 7.25 23.56
IPA multilingual w/o LIDs 6.31 10.64 7.44 20.86 8.10 35.32 5.53 19.56 5.59 14.03 7.76 34.49 6.90 19.33
IPA multilingual w/ LIDs 6.16 9.76 6.88 14.97 7.63 23.54 5.17 10.63 5.28 9.11 6.95 19.48 6.90 16.97
Proposed (Unsupervised text pretraining)
Bytes multilingual 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 5.01 6.05 6.52 13.74 6.57 11.75
IPA multilingual 5.88 5.52 6.61 7.72 7.25 15.85 5.18 8.62 5.30 7.37 7.00 14.42 6.53 11.06
Table 2: Evaluation results for seen languages. Bold indicates best scores in baseline and proposed methods.
es
Method es x-vector fr x-vector
MCD CER CER
Natural - 2.71 2.71
Oracle
Bytes monolingual 8.65 10.70 -
IPA monolingual 8.47 5.28 -
IPA multilingual 6.20 5.32 6.99
Baseline (Fully zero-shot TTS)
Bytes multilingual 11.22 64.07 66.45
IPA multilingual 10.75 44.75 44.37
(a) Token embedding 𝑍 (b) Encoder inputs 𝐻!"
Proposed (Text-seen zero-shot TTS)
Bytes multilingual 9.05 18.27 13.74 Figure 3: Visualization of token and language embedding. Pairs of
IPA multilingual 9.44 11.69 13.33 similar languages (es–fr and de–nl) are overlapping in token embed-
ding space, while output of bottleneck layer separates them.
Table 3: Evaluation results for unseen language.
3.4 Ablation Study
proposed text preraining improved the metrics, resulting in To further evaluate our method, we conducted several abla-
CERs of less than half for both byte and IPA-based methods. tion studies. Table 4 lists the results. Bytes multilingual rep-
Also, in contrast to the results for the seen languages, the resents the byte-based proposed method in the evaluation of
IPA-based model outperformed the byte-based one in terms § 3.2 and 3.3. Note that it used the frozen language-aware
of CER. Compared with the oracle case with the paired data embedding layer as formulated in Eq. (9). Some additional
of the unseen language, our proposed zero-shot TTS showed studies of our method are also presented in the Appendix.
higher MCD and CER but achieved only 1% difference in In W/o bottleneck layer, we excluded the bottleneck layer
CER compared to the oracle byte-based monolingual model. and simply added the token and language embedding to ob-
These results demonstrate the effectiveness of our method in tain the encoder input in Eq. (2). We found that removing
achieving intelligible zero-shot TTS for the unseen language. the bottleneck layer led to a performance drop in all the lan-
To investigate the case where the target speaker informa- guages and metrics, with an average increase of 0.53 in MCD
tion is completely unavailable, we also used the x-vector from and 4.16% in CER. The largest increase was observed in the
a seen language. We chose the fr speaker because es and fr unseen language, with an increase of 1.21 in MCD. This sug-
are both categorized as Western Romance in Glottolog [Ham- gests that the bottleneck layer, which projects the token and
marström et al., 2021]. Table 3 lists the results. Note that language embedding into the hidden input text representation
this case does not have the MCD results, since a different with nonlinear dimensionality reduction, is effective in im-
speaker than the ground-truth speech was used. We can see proving the generalization for zero-shot TTS.
that the unsupervised text pretraining also improved the zero- We also evaluated the effect of including language IDs in
shot performance when using the x-vector from the fr speaker. the proposed method by comparing it with a version that ex-
In the proposed byte-based model, the cross-lingual x-vector cluded language IDs, referred to as W/o language ID. It cor-
showed the lower CER. This might result from that the es responds to a simple multilingual BERT pretraining [Wu and
x-vector was not present in the training data whereas the fr Dredze, 2019] that uses only text tokens across different lan-
x-vector was present in the training data. guages. We observed that the use of language IDs led to an
Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Bytes multilingual 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
W/o bottleneck layer 6.06 5.01 7.15 9.09 7.71 28.52 5.33 6.47 10.26 24.01 6.99 13.74
W/o language ID 6.07 5.09 7.09 9.99 7.77 22.58 5.23 6.99 10.45 32.70 6.96 14.06
W/o initializing encoder 5.59 3.75 6.52 9.31 7.12 16.47 4.86 5.03 9.02 21.91 6.42 11.85
Updating language-aware embedding layer 6.05 6.22 6.75 6.93 7.46 11.42 5.16 8.00 9.48 17.21 6.75 10.62
Table 4: Ablation studies on training and model configurations. Bold indicates best metrics on average (Avg.).
1.5 2.5
1
de (MOS) fr (MOS) de (AMOS) fr (AMOS) ru (AMOS) 2
fi (AMOS) hu (AMOS) nl (AMOS) el (AMOS) Avg (AMOS)
1 4.5
de hu de fr Natural
Method
MCD CER MCD CER 4 Oracle (IPA monolingual)
Oracle (IPA multilingual)
Natural - 2.75 - 2.12 3.5 Baseline (IPA multilingual)
Oracle Proposed (Bytes multilingual)
3
IPA monolingual 7.38 4.07 7.59 24.62 Proposed (IPA multilingual)
IPA multilingual 6.16 9.76 5.28 9.11 2.5
Table 5: Analysis on different unseen languages. Figure 5: MOS, AMOS, and AB test results for unseen language.
Error bars in MOS results represent 95% confidence intervals.
average improvement of 0.5 MCD and 4.48% CER, indicat- 3.5 Dependency on Unseen Languages
ing the effectiveness of our approach in using language IDs. We conducted evaluations on the zero-shot TTS for different
In W/o initializing encoder, we did not initialize the en- unseen languages. The eight European languages included in
coder θE before the supervised leaning described in § 2.2. the paired data are composed of Indo-European and Uralic
Instead, we only initialized the parameters θT , θL , and θB language families defined in Glottolog [Hammarström et al.,
with the parameters pretrained in § 2.1. Through this eval- 2021]. In this evaluation, we selected de and hu from each
uation, we investigated whether the performance gain with of the families. During supervised learning in § 2.2, we ex-
our method resulted from the initialization of the language- cluded the paired data for each of de and hu and instead in-
aware embedding layer or the encoder. We observed that W/o cluded the paired data for es. Table 5 lists the results. We
initializing encoder resulted in an improvement of 0.04 in chose the IPA-based baseline method, which had shown bet-
MCD and only a 2.27% increase in CER on average, sug- ter results in § 3.3. We observed that the pretraining im-
gesting that our method benefits more from the pretraining of proved the CER by around 10% and MCD by around 0.3 for
the language-aware embedding layer than from the encoder. de. However, the improvement in CER for hu was limited
In Updating language-aware embedding layer, we updated to 2%, while the MCD was improved by around 0.5. These
the language-aware embedding layer during supervised learn- results suggest that the performance of our zero-shot TTS is
ing, as formulated in Eq. (8). We observed that freezing the language dependent, as observed in previous work on cross-
language-aware embedding layer led to better performance lingual transfer for NLP tasks [Wu and Dredze, 2019].
for most languages and metrics, resulting in an average dif- Fig. 3 visualize the token embedding Z and encoder in-
ference of 0.29 in MCD and 1.04% in CER. puts Hin averaged on each utterance. We used a t-distributed
stochastic neighbor embeddings (t-SNE) [der Maaten and language G2P knowledge, and previous work has built a byte-
Hinton, 2008]. We observed overlaps in the token embed- based TTS model for around 40 languages [He et al., 2021].
ding for (es, fr) and (de, nl), which are classified as Western There has been work using the phonological features derived
Romance and West Germanic in Glottolog, respectively. The from IPA to achieve the zero-shot TTS [Staib et al., 2020].
encoder inputs are separated in the embedding space for each Our framework achieves the zero-shot cross-lingual transfer
language. The results in Table 5 and the visualization sug- with bytes by leveraging multilingual text pretraining. There
gest that the cross-lingual transfer works better when simi- have been studies on using untranscribed speech data for low-
lar languages sharing the token embedding space are present resource scenarios by leveraging VQ-VAE [Zhang and Lin,
during supervised learning. However, for languages with dis- 2020] or an ASR model [Ren et al., 2019; Ni et al., 2022].
tinct token and language embeddings, the cross-lingual trans- Other work [Saeki et al., 2022b] has trained a massively
ferability might be limited. We leave the further analysis on multilingual TTS using paired TTS, paired ASR, unpaired
language dependencies as a topic for future research. speech, and unpaired text data. While it also performs text-
only training as in our work, it still uses the paired speech-text
3.6 Subjective Evaluations on Naturalness data of the target languages. Our framework is simple and
We conducted evaluations on naturalness as described in scalable, while pioneering a novel paradigm with the zero-
§ 3.1. Fig. 4 shows the results for seen languages. Note shot TTS approach that relies only on text data.
that we conducted the listening tests for de and fr. For each Cross-lingual representation learning for NLP There
language, either of the proposed methods showed the high- have been studies on learning cross-lingual representations
est MOS, while we did not observe any significant differ- that can be applied to various NLP tasks in different lan-
ence between the proposed methods and the best baseline guages [Gouws et al., 2015; Ruder et al., 2019]. Recent
method, which was the IPA-based multilingual model with work has highlighted the strong cross-lingual transferability
LIDs. To further validate our results, we also evaluated the of multilingual BERT [Devlin et al., 2019], which has been
naturalness with an AMOS prediction model, as shown in observed to perform surprisingly well when transferred to
Fig. 4. We observed that the either of the proposed meth- other languages [Wu and Dredze, 2019; Conneau and Lam-
ods showed the highest scores in all the languages. On aver- ple, 2019]. Building on this, our work leverages multilingual
age, the byte-based and IPA-based proposed models showed MLM pretraining for TTS, which improves byte-based TTS
2.89 and 2.84, respectively, while the best baseline method models without G2P knowledge and achieves zero-shot TTS.
obtained 2.837 . Additionally, we observed that the byte-based Language model pretraining for TTS Previous research
proposed model often scored higher than the IPA-based pro- has explored self-supervised text pretraining techniques for
posed models, which is consistent with the results in Table 2. TTS. BERT models have been used to extract contextual
Fig. 5 shows the results for unseen languages. The ora- embeddings and enhance the prosody of TTS [Hayashi et
cle methods had the highest MOS of 3.76 and 3.96, and the al., 2019; Xu et al., 2021]. Other studies have used
baseline zero-shot method had the lowest MOS of 3.29. The phonemes jointly with graphemes [Jia et al., 2021] or sub-
proposed methods outperformed the baseline method, and the phonemes [Zhang et al., 2022] as the inputs of the MLM pre-
byte- and IPA-based models had the MOS of 3.44 and 3.32, training. Our work proposes multilingual MLM pretraining
respectively. The AMOS results were consistent with the lis- for TTS using text tokens shared across languages, rather than
tening test results, with the proposed zero-shot TTS methods focusing on monolingual pretraining.
outperforming the baseline method. In this evaluation, the
proposed byte-based model scored 3.21 on the AMOS, while
the oracle IPA-based model scored 3.20. To further validate
5 Conclusions
the results, we conducted a preference AB test on naturalness We presented a multilingual TTS framework that leverages
with 25 rators. As shown in Fig. 5, our byte-based model unsupervised text pretraining. Our framework achieved
significantly outperformed the baseline IPA-based model. highly intelligible zero-shot TTS for an unseen language, re-
sulting in a CER of less than 12%. It also improved the
4 Related Work TTS for seen languages, with byte-based models without G2P
modules outperforming the IPA-based baselines. Our abla-
Multilingual TTS While previous work on multilingual tion studies provided additional insights, including the effec-
TTS has primarily focused on resource-rich languages [Zen tiveness of the frozen language embedding layer.
et al., 2012; Li and Zen, 2016], there is growing interest
Limitations and future work Our proposed framework
in developing TTS models on low-resource languages. Sev-
has limitations. The performance gap remains between the
eral studies have explored the input tokens shared across lan-
oracle models and our zero-shot TTS models in terms of in-
guages such as bytes [Li et al., 2019a; He et al., 2021], IPA
telligibility, speech quality, and naturalness, as seen in the
symbols [Gutkin, 2017], and articulatory features [Lux and
evaluation in § 3.3 and § 3.6. Further studies are needed to
Vu, 2022], to transfer knowledge from resource-rich to low-
improve our zero-shot TTS. Our framework also has a limi-
resource languages. Grapheme tokens can eliminate the per-
tation with language dependency, as the results in § 3.5 sug-
7
The AMOS tended to be lower than the MOS. While the MOS gest that this dependency is caused by the presence of similar
prediction model has a high correlation, it may produce errors in languages during supervised learning. Our future work will
predicting absolute values, as reported in previous work [Saeki et al., focus on studying this language dependency further and de-
2022a]. The relative relationships are more reliable in the AMOS. veloping a method that performs better for various languages.
Acknowledgments Extending the edge of tts research. arXiv preprint
Part of this work was supported by JSPS KAKENHI Grant arXiv:2110.07840, 2021.
Number 21H05054, 22H03639, and 22J12040. This work [He et al., 2021] M. He, J. Yang, L. He, and F. K
used the Bridges system [Nystrom et al., 2015], which is sup- Soong. Multilingual byte2speech models for scal-
ported by NSF award number ACI-1445606, at the Pittsburgh able low-resource speech synthesis. arXiv preprint
Supercomputing Center. We would like to thank the research arXiv:2103.03541, 2021.
teams at Google through the internship of the first author for [Hendrycks and Gimpel, 2016] D. Hendrycks and K. Gim-
providing various insights on this topic.
pel. Gaussian error linear units (gelus). arXiv preprint
arXiv:1606.08415, 2016.
References
[Jia et al., 2021] Y. Jia, H. Zen, J. Shen, Y. Zhang, and
[Ba et al., 2016] J. L. Ba, J. R. Kiros, and G. E Hinton. Layer
Y. Wu. PnG BERT: Augmented BERT on phonemes
normalization. arXiv preprint arXiv:1607.06450, 2016. and graphemes for neural TTS. arXiv preprint
[Bañón et al., 2020] M. Bañón, P. Chen, B. Haddow, arXiv:2103.15060, 2021.
K. Heafield, H. Hoang, M. Esplà-Gomis, M. L Forcada, [Jr, 2005] R. G Gordon Jr. Ethnologue, languages of the
A. Kamran, F. Kirefu, P. Koehn, et al. Paracrawl: Web-
world. https://www.ethnologue.com/, 2005. Accessed:
scale acquisition of parallel corpora. In Proc. ACL, pages
2023-05-27.
4555–4567, 2020.
[Bapna and Firat, 2019] A. Bapna and O. Firat. Simple, scal- [Kim et al., 2021] J. Kim, J. Kong, and J. Son. Conditional
able adaptation for neural machine translation. In Proc. variational autoencoder with adversarial learning for end-
EMNLP-IJCNLP, pages 1538–1548, 2019. to-end text-to-speech. In Proc. ICML, pages 5530–5540,
2021.
[Conneau and Lample, 2019] A. Conneau and G. Lample.
Cross-lingual language model pretraining. In Proc. [Kong et al., 2020] J. Kong, J. Kim, and J. Bae. HiFi-GAN:
NeurIPS, pages 7059–7069, 2019. Generative adversarial networks for efficient and high fi-
delity speech synthesis. Proc. NeurIPS, 33:17022–17033,
[der Maaten and Hinton, 2008] L. Van der Maaten and 2020.
G. Hinton. Visualizing data using t-sne. JMLR,
9(11):2579–2605, 2008. [Li and Zen, 2016] B. Li and H. Zen. Multi-language multi-
speaker acoustic modeling for LSTM-RNN based statis-
[Devlin et al., 2019] J. Devlin, M.-W. Chang, K. Lee, and
tical parametric speech synthesis. In Proc. Interspeech,
K. Toutanova. BERT: Pre-training of deep bidirectional pages 2468–2472, 2016.
transformers for language understanding. In Proc. NAACL,
pages 4171–4186, 2019. [Li et al., 2019a] B. Li, Y. Zhang, T. Sainath, Y. Wu, and
[Ebrahimi and Kann, 2021] A. Ebrahimi and K. Kann. How W. Chan. Bytes are all you need: End-to-end multilin-
gual speech recognition and synthesis with bytes. In Proc.
to adapt your pretrained multilingual model to 1600 lan-
ICASSP, pages 5621–5625, 2019.
guages. In Proc. ACL-IJCNLP, pages 4555–4567, 2021.
[Fukada et al., 1992] T. Fukada, K. Tokuda, T. Kobayashi, [Li et al., 2019b] N. Li, S. Liu, Y. Liu, S. Zhao, and M. Liu.
and S. Imai. An adaptive algorithm for mel-cepstral anal- Neural speech synthesis with Transformer network. In
ysis of speech. In Proc. ICASSP, pages 137–140, 1992. Proc. AAAI, pages 6706–6713, 2019.
[Gouws et al., 2015] S. Gouws, Y. Bengio, and G. Corrado. [Li et al., 2022] X. Li, F. Metze, D. R Mortensen, A. W
Bilbowa: Fast bilingual distributed representations with- Black, and S. Watanabe. ASR2K: Speech recognition
out word alignments. In Proc. ICML, pages 748–756, for around 2000 languages without audio. arXiv preprint
2015. arXiv:2209.02842, 2022.
[Gutkin, 2017] A. Gutkin. Uniform multilingual multi- [Lux and Vu, 2022] F. Lux and T. Vu. Language-agnostic
speaker acoustic model for statistical parametric speech meta-learning for low-resource text-to-speech with artic-
synthesis of low-resourced languages. In Proc. Inter- ulatory features. In Proc. ACL, pages 6858–6868, 2022.
speech, pages 2183–2187, 2017. [Munich Artificial Intelligence Laboratories GmbH, 2017]
[Hammarström et al., 2021] H. Hammarström, R. Forkel, Munich Artificial Intelligence Laboratories GmbH.
M. Haspelmath, and S. Bank. Glottolog 4.5. Max Planck The M-AILABS speech dataset. https://www.caito.de/
Institute for the Science of Human History, 2021. 2019/01/the-m-ailabs-speech-dataset/, 2017. Accessed:
[Hayashi et al., 2019] T. Hayashi, S. Watanabe, T. Toda, 2023-05-27.
K. Takeda, S. Toshniwal, and K. Livescu. Pre-trained [Nagrani et al., 2017] A. Nagrani, J. S. Chung, and A. Zis-
text embeddings for enhanced text-to-speech synthesis. In serman. VoxCeleb: A large-scale speaker identification
Proc. Interspeech, pages 4430–4434, 2019. dataset. In Proc. Interspeech, pages 2616–2620, 2017.
[Hayashi et al., 2021] T. Hayashi, R. Yamamoto, [Nair and Hinton, 2010] V. Nair and G. E Hinton. Rectified
T. Yoshimura, P. Wu, J. Shi, T. Saeki, Y. Ju, Y. Ya- linear units improve restricted boltzmann machines. In
suda, S. Takamichi, and S. Watanabe. Espnet2-tts: Proc. ICML, 2010.
[Ni et al., 2022] J. Ni, L. Wang, H. Gao, K. Qian, Y. Zhang, [Snyder et al., 2018] D. Snyder, D. Garcia-Romero, G. Sell,
S. Chang, and M. Hasegawa-Johnson. Unsupervised D. Povey, and S. Khudanpur. X-vectors: Robust dnn em-
text-to-speech synthesis by unsupervised automatic speech beddings for speaker recognition. In Proc. ICASSP, pages
recognition. In Proc. Interspeech, pages 461–465, 2022. 5329–5333, 2018.
[Nystrom et al., 2015] N. A Nystrom, M. J Levine, R. Z [Staib et al., 2020] M. Staib, T. H. Teh, A. Torresquintero,
Roskies, and J Ray Scott. Bridges: a uniquely flexible D. S R. Mohan, L. Foglianti, R. Lenain, and J. Gao.
hpc resource for new communities and data analytics. In Phonological features for 0-shot multilingual speech syn-
Proc. XSEDE, pages 1–8, 2015. thesis. In Proc. Interspeech, pages 2942–2946, 2020.
[Paolacci et al., 2010] G. Paolacci, J. Chandler, and P. G [Tachibana et al., 2018] H. Tachibana, K. Uenoyama, and
Ipeirotis. Running experiments on amazon mechanical S. Aihara. Efficiently trainable text-to-speech system
turk. Judgment and Decision making, 5(5):411–419, 2010. based on deep convolutional networks with guided atten-
[Park and Mulc, 2019] K. Park and T. Mulc. CSS10: A col- tion. In Proc. ICASSP, pages 4784–4788, 2018.
lection of single speaker speech datasets for 10 languages. [Vaswani et al., 2017] A. Vaswani, N. Shazeer, N. Parmar,
Proc. Interspeech, pages 1566–1570, 2019. J. Uszkoreit, L. Jones, A. N Gomez, Ł. Kaiser, and I. Polo-
[Pires et al., 2019] T. Pires, E. Schlinger, and D. Garrette. sukhin. Attention is all you need. In Proc. NeurIPS, vol-
How multilingual is multilingual BERT? In Proc. ACL, ume 30, 2017.
pages 4996–5001, 2019. [Veaux et al., 2017] C. Veaux, J. Yamagishi, K. MacDonald,
[Prakash et al., 2019] A. Prakash, A L. Thomas, S Umesh, et al. CSTR VCTK corpus: English multi-speaker corpus
and H. A Murthy. Building multilingual end-to-end speech for CSTR voice cloning toolkit. University of Edinburgh.
synthesisers for Indian languages. In Proc. SSW, pages The Centre for Speech Technology Research (CSTR), 2017.
194–199, 2019. [Wang et al., 2021] C. Wang, M. Riviere, A. Lee, A. Wu,
[Radford et al., 2022] A. Radford, J. W Kim, T. Xu, C. Talnikar, D. Haziza, M. Williamson, J. Pino, and
G. Brockman, C. McLeavey, and I. Sutskever. Robust E. Dupoux. VoxPopuli: A large-scale multilingual speech
speech recognition via large-scale weak supervision. arXiv corpus for representation learning, semi-supervised learn-
preprint arXiv:2212.04356, 2022. ing and interpretation. In Proc. ACL, pages 993–1003,
2021.
[Ravanelli et al., 2021] M. Ravanelli, T. Parcollet,
P. Plantinga, A. Rouhe, S. Cornell, L. Lugosch, [Watanabe et al., 2018] S. Watanabe, T. Hori, S. Karita,
C. Subakan, N. Dawalatabad, A. Heba, J. Zhong, T. Hayashi, J. Nishitoba, Y. Unno, N.-E. Y. Soplin, J. Hey-
et al. SpeechBrain: A general-purpose speech toolkit. mann, M. Wiesner, N. Chen, et al. ESPnet: End-to-end
arXiv preprint arXiv:2106.04624, 2021. speech processing toolkit. Proc. Interspeech, pages 2207–
2211, 2018.
[Ren et al., 2019] Y. Ren, X. Tan, T. Qin, S. Zhao, Z. Zhao,
and T.-Y. Liu. Almost unsupervised text to speech and au- [Wu and Dredze, 2019] S. Wu and M. Dredze. Beto, Bentz,
tomatic speech recognition. In Proc. ICML, pages 5410– Becas: The surprising cross-lingual effectiveness of
5419, 2019. BERT. In Proc. EMNLP-IJCNLP, pages 833–844, 2019.
[Ruder et al., 2019] S. Ruder, I. Vulić, and A. Søgaard. A [Xu et al., 2021] G. Xu, W. Song, Z. Zhang, C. Zhang,
survey of cross-lingual word embedding models. JAIR, X. He, and B. Zhou. Improving prosody modelling with
65:569–631, 2019. cross-utterance BERT embeddings for end-to-end speech
synthesis. In Proc. ICASSP, pages 6079–6083, 2021.
[Saeki et al., 2022a] T. Saeki, D. Xin, W. Nakata, T. Ko-
riyama, S. Takamichi, and H. Saruwatari. UTMOS: [Zen et al., 2012] H. Zen, N. Braunschweiler, S. Buchholz,
UTokyo-SaruLab system for VoiceMOS Challenge 2022. M. JF Gales, K. Knill, S. Krstulovic, and J. Latorre. Sta-
In Proc. Interspeech, pages 4521–4525, 2022. tistical parametric speech synthesis based on speaker and
language factorization. TASLP, 20(6):1713–1724, 2012.
[Saeki et al., 2022b] T. Saeki, H. Zen, Z. Chen, N. Morioka,
G. Wang, Y. Zhang, A. Bapna, A. Rosenberg, and B. Ram- [Zen et al., 2019] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J
abhadran. Virtuoso: Massive multilingual speech-text Weiss, Y. Jia, Z. Chen, and Y. Wu. LibriTTS: A corpus
joint semi-supervised learning for text-to-speech. arXiv derived from LibriSpeech for text-to-speech. In Proc. In-
preprint arXiv:2210.15447, 2022. terspeech, pages 1526–1530, 2019.
[Seki et al., 2022] K. Seki, S. Takamichi, T. Saeki, and [Zhang and Lin, 2020] H. Zhang and Y. Lin. Unsupervised
H. Saruwatari. Text-to-speech synthesis from dark data learning for sequence-to-sequence text-to-speech for low-
with evaluation-in-the-loop data selection. arXiv preprint resource languages. Proc. Interspeech, pages 3161–3165,
arXiv:2210.14850, 2022. 2020.
[Shen et al., 2018] J. Shen, R. Pang, R. J Weiss, M. Schus- [Zhang et al., 2022] G. Zhang, K. Song, X. Tan, D. Tan,
ter, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, Y. Yan, Y. Liu, G. Wang, W. Zhou, T. Qin, T. Lee,
RJ Skerrv-Ryan, et al. Natural TTS synthesis by condi- et al. Mixed-Phoneme BERT: Improving BERT with
tioning WaveNet on mel spectrogram predictions. In Proc. mixed phoneme and sup-phoneme representations for text
ICASSP, pages 4779–4783, 2018. to speech. In Proc. Interspeech, pages 456–460, 2022.
Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Spoken Text 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
Written Text 5.81 4.55 6.94 9.10 7.61 21.24 5.22 12.73 9.50 18.44 6.76 12.52
Spoken+Written Text 5.54 3.72 6.34 7.51 7.07 15.33 4.96 5.44 8.82 17.48 6.35 10.04
Seen Unseen
Avg.
Method de fr ru fi es
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Residual layer 5.65 3.79 6.48 7.15 7.38 10.62 4.99 5.28 9.05 18.27 6.46 9.58
Transformer encoder 5.77 4.63 6.36 6.61 7.17 11.25 4.90 6.50 8.89 14.15 6.44 9.97
A Text data for pretraining encoder as the bottleneck layer (referred to as Transformer-
encoder), comparing it with the original residual layer de-
We investigated the effect of different types of text data
tailed in § 2.4 (referred to as Residual layer). Table 7 lists the
used in pretraining Dtext on the performance of our method.
results.
As described in § 3.1, we used spoken texts from VoxPop-
uli, M-AILABS, and CSS10 in the evaluations presented in For the seen languages, the superior performance between
§ 3. However, we can acquire a large amount of text data the residual layer and the transformer encoder varied, depend-
from datasets designed for NLP or from web-crawled text ing on the specific language and evaluation metrics. However,
resources. These data typically consist of written texts and for the unseen language, the Transformer encoder showed
cover a wide range of domains. To investigate the effective- higher performance, achieving an improvement of 4.12 in
ness of using written texts for our multilingual TTS, we used CER. Looking at the average scores across all languages,
ParaCrawl [Bañón et al., 2020], a web-crawled text dataset the Transformer encoder had a slightly lower MCD, while
built for machine translation, for Dtext during the unsuper- the CER was reduced by 0.39 when using the residual layer.
vised pretraining described in § 2.1. For the investigation, These results suggest that the use of a deeper layer can im-
we randomly sampled the same amount of texts as in Table 1 prove the generalizability of the proposed model. Neverthe-
for each language. We then trained our model using the fol- less, the overall performance of both models remains compa-
lowing three different cases. 1) Spoken Text: Only using the rable in terms of average metrics.
spoken text for pretraining as in the previous evaluations, 2)
Written Text: Only using the text data from ParaCrawl, and C Effect of excluding some languages from
3) Spoken+Written Text: We combined the text data in Spo- paired data.
ken Text and Written Text. We used the byte-based proposed
model presented in § 3.2 and § 3.3. Table 6 lists the results. In this section, we have deliberately excluded several lan-
We observed that Spoken Text outperformed Written Text guages from the paired data used for the supervised learning
in all the metrics and languages, resulting in an average dif- described in § 2.2 in order to study their impact. As shown
ference of 0.3 in MCD and 2.94% in CER. These results in Table 1, the paired data originally included the languages
demonstrate the effectiveness of using spoken text for pre- de, fr, nl, fi, hu, ru, and el. As a comparison case, we first ex-
training. Spoken+Written Text showed on average 0.11 lower cluded fr, which belongs to Italic languages as es according to
MCD and 0.46% higher CER compared to Spoken Text. How- Glottolog [Hammarström et al., 2021]. We also removed de
ever, for the unseen language, Spoken+Written Text outper- and nl, which belong to Germanic languages, from the paired
formed Spoken Text in MCD and CER. These results suggest data. Consequently, in the comparison case, supervised learn-
that adding written text data can improve the generalization ing was performed only with fi, ru, hu, and el. Table 8 lists
of our TTS models for the zero-shot scenarios. the results. Original corresponds to the case shown in Ta-
ble 1, while Excluded denotes the case where only fi, ru, hu,
and el were used.
B Architecture of bottleneck layer The three languages listed on the left-hand side of Table 8
As described in § 2.4, we used the residual layer for our bot- (ru, hu, fi) represent the seen languages in both cases. In-
tleneck layer. Also, we demonstrated the effectiveness of terestingly, the Original scenario generally outperformed the
the residula bottleneck layer for both seen and unseen lan- Excluded scenario for these languages. These results indi-
guages in the evaluation presented in § 3.4. In this section, we cate that in the context of multilingual TTS training, perfor-
explored an alternative architecture for the bottleneck layer. mance can potentially be improved by including a wider va-
We conducted experiments using a single-layer Transformer riety of languages rather than restricting to similar languages.
ru hu fi de fr es
Method
MCD CER MCD CER MCD CER MCD CER MCD CER MCD CER
Original (de, fr, nl, fi, hu, ru, el) 7.38 10.62 5.01 6.05 4.99 5.28 5.65 3.79 6.48 7.15 9.05 18.27
Excluded (fi, hu, ru, el) 7.00 11.11 5.32 6.92 4.98 5.46 10.39 34.11 10.90 49.65 10.00 24.80
both the baseline and the proposed methods. The top part of
Fig. 6 represents results from the fifth layer, while the bottom
part corresponds to results from the sixth layer. The left side
of the figure shows the results of the baseline method, while
the right side shows the results of the proposed method.
We observe that the fifth layer of the baseline model shows
a discontinuity in the attention map, which leads to instabil-
ity of the linguistic content. Conversely, in the fifth layer of
the proposed model, the attention map is significantly more
continuous than in the baseline method. These results sug-
gest that our unsupervised text pretraining can improve cross-
attention in the absence of paired speech-text data. The re-
sults are also reflected in the intelligibility difference between
the baseline and the proposed methods presented in § 3.3.