0% found this document useful (0 votes)
24 views5 pages

Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views5 pages

Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

nandhini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.

00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9746282

MULTILINGUAL TEXT-TO-SPEECH TRAINING USING CROSS LANGUAGE VOICE


CONVERSION AND SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS

Jilong Wu1 , Adam Polyak1 , Yaniv Taigman1 , Jason Fong2 , Prabhav Agrawal1 , Qing He1
1
Facebook AI
2
The University of Edinburgh

ABSTRACT is trained with recordings from bilingual speakers. However,


it is hard to find multilingual speakers with native proficiency
State of the art text-to-speech (TTS) models can generate
in several different languages. To address this problem, one
high fidelity monolingual speech, but it is still challenging to
category of approaches is focused on disentangling speaker
synthesize multilingual speech from the same speaker. One
and language information from a monolingual speech cor-
major hurdle is for training data. It’s hard to find speakers
pus. For instance,this work [6] builds a multilingual TTS
who have native proficiency in several languages. One way of
system through cross-lingual voice cloning by incorporating
mitigating this issue is by generating polyglot corpus through
an adversarial term in the loss function to enable the model
voice conversion. In this paper, we train such multilingual
to disentangle the speaker representation. Another paradigm
TTS system through a novel cross-lingual voice conver-
to tackle the same problem of insufficient speaker-language
sion model trained with speaker-invariant features extracted
pairs is explicitly generating that data using cross-lingual
from a speech representation model which is pre-trained
voice conversion.
with 53 languages through self-supervised learning [1]. To
In this paper, we propose a multilingual TTS system
further improve the speaker identity shift, we also adopt a
trained with polyglot speech corpus generated through a
speaker similarity loss term during training. We then use this
novel cross-lingual voice conversion model.To train voice
model to convert multilingual multi-speaker speech data to
conversion model, we first extract speaker-invariant features
the voice of the target speaker. Through augmenting data
from a speech representation model pre-trained through self-
from 4 other languages, we train a multilingual TTS system
supervised learning to leverage large amount of multilingual
for a native monolingual English speaker which speaks 5
speech data [1]. Concatenated with speaker and pitch fea-
languages(English, French, German, Italian and Spanish).
tures, those extracted features from the pre-trained model
Our system achieves improved mean opinion score (MOS)
are fed into a HiFiGAN [7] based neural vocoder to gener-
compared with the baseline of multi-speaker system for all
ate speech for the target speaker. We also apply multi-task
languages, specifically: 3.74 vs 3.62 for Spanish, 3.11 vs
learning with a loss on speaker embedding to ensure speaker
2.71 for German, 3.47 vs 2.84 for Italian, and 2.72 vs 2.41 for
identity similarity during training. Details of the proposed
French.
training and inference will be discussed in Section 3. Then,
Index Terms— multilingual text-to-speech, transfer to build a multilingual TTS voice, we start with a set of
learning, self-supervised learning, voice conversion. monolingual TTS voices from multiple languages(e.g. Ital-
ian, French, Spanish, German). We use the proposed voice
conversion model to convert the speaker identities of these
1. INTRODUCTION
non-English languages to the English speaker. Hence, we
generate a synthetic dataset containing multilingual data con-
Recent advances in speech synthesis research witnessed the
sisting of Italian, French, Spanish and German languages for
applications of neural network-based models such as Wav-
an English speaker. The synthetic dataset combined with the
eRNN [2], Tacotron2 [3] and Transformer-TTS [4] to synthe-
English speaker’s original data is then used to train a single
size natural and intelligible speech with high audio quality.
speaker multilingual TTS system with the same model archi-
Typically, these models are trained on a monolingual text-
tectures from this work [8]. The whole system of this paper
audio corpus. However, synthesizing speech from multilin-
is shown in Figure 1.
gual input data still is a challenging task. A natural way to
train such a multilingual TTS system is using training data
from either a bilingual or a polyglot speaker. For example, 2. RELATED WORK
this method [5] presents a Chinese-English TTS system that
There have been several studies to use cross-lingual voice
Correspondence to Jilong Wu: [email protected] conversion [9] to build polyglot speech corpus for a multi-

978-1-6654-0540-9/22/$31.00 ©2022 IEEE 8017 ICASSP 2022

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
lingual speech representation [1] as content representations.
The representation is extracted via a wav2vec 2.0 [14] en-
coder, Ew2v , trained on raw speech waveforms in multiple
languages. Speaker identity is controlled via a speaker em-
bedding v as an additional input. The speaker embeddings
are learned during training and stored in a lookup table. Fi-
nally, we use the fundamental frequency, F0 (x), as a prosodic
representation. We use YAAPT [19] to extract pitch contours.
The F0 values are upsampled and concatenated to the con-
tent representation, Ew2v (x). Then, the speaker embedding
v is repeated and concatenated to each time-step of the for-
mer concatenation. To summarize, the encoding is given as
Fig. 1. Multilingual TTS System. Audio data is colored green
E(x) = [Ew2v (x), F0 (x), v].
and input text is colored yellow. Proposed HifiGAN-VC and
multilingual TTS models (for the target speaker) are colored Decoder A neural speech synthesizer is used to gener-
blue. ate audio output from the extracted representation described
above. We decided to use an adapted HiFiGAN [7] architec-
ture to serve as the speech synthesizer. HiFiGAN is reported
lingual TTS system [10, 11]. Among recent works, many to produce good audio quality with fast inference speed. Fast
papers have shown good performance using phonetic pos- inference enables us to efficiently generate large amounts of
teriorgram (PPG) as speaker-invariant features for the voice multilingual data in our data augmentation stage.
conversion model [12, 13]. However, a PPG based system HiFiGAN uses a generative adversarial network (GAN)
needs to go through two stages in order to get desirable in- setup to produce raw waveforms. In our work we denote the
put features before training the audio synthesizer. It first HiFiGAN generator as G, which is a fully convolutional neu-
needs to extract PPG features from target speech using a pre- ral network that takes as input the encoding described above
trained Automatic Speech Recognition(ASR) model. Then and outputs raw waveforms. The discriminator network
it needs to train another model to map from target speech’s D, is composed of two discriminators: (i) a multi-period
PPG to source speaker’s acoustic features (for example, mel discriminator (MPD) to handle periodic signals, and, (ii) a
spectrograms, etc). In this paper, instead of using PPG, we multi-scale discriminator (MSD) to handle long-term depen-
use self-supervised speech representations extracted from dencies. Both the MPD and the MSD consists of multiple
XLSR-53 [1], a wav2vec2.0-based [14] pre-trained cross- sub-discriminators, Dj , operating at different periods and
lingual model . Compared with PPG based approach, without scales accordingly. Specifically, the MPD employs a total of
training any extra model for feature mapping, our model five period discriminators with period hops of [2, 3, 5, 7, 11]
can train the speech synthesizer directly with features ex- and the MSD’s three scale discriminators operate at: the
tracted from a cross-lingual representation model. Previous original input scale, a ×2 downsampled scale, and a ×4
work [15, 16, 17] showed such features enable high-quality downsampled scale. Each sub-discriminator Dj minimizes
audio generation. the following loss functions,
X
3. METHOD Ladv (Dj , G) = ||1 − Dj (x̂)||22 ,
x
X (1)
3.1. Multilingual Voice Conversion LD (Dj , G) = [||1 − Dj (x)||22 + ||Dj (x̂)||22 ],
x
The proposed voice conversion model, HiFiGAN-VC, is
based on a generative adversarial network architecture similar where x̂ = G(E(x)), is the generator output.
to [16, 17, 18] and a HiFiGAN neural vocoder [7]. It is condi- Additionally, three more terms are added to the loss func-
tioned on speaker invariant content representations, spectral tion. The first one is feature-matching loss [20] which mea-
features, and, a learned speaker embedding for speaker iden- sures the distance between discriminator activations of a real
tity to generate raw speech waveforms. Once trained, voice signal and those of the generator output,
conversion can be achieved by changing the speaker identity
R
representation to that of the desired speaker. HiFiGAN-VC is XX 1
illustrated in Figure 2. Lf m (Dj , G) = ||ψi (x) − ψi (x̂)||1 , (2)
x i=1
Mi
We denote the domain of audio samples by X ⊂ R. The
representation for a raw signal is therefore a sequence of sam- where ψi is an operator which extracts the activations of the
ples x = (x1 , . . . , xT ), where xt ∈ X for all 1 ≤ t ≤ T . discriminator i-th layer, Mi is the number of features in layer
Input Features Given an input, x, we use XLSR-53 cross- i, and R is the total number of layers in Dj .

8018

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
a sequence of samples s = (s1 , . . . , sT ), where st ∈ S for
all 1 ≤ t ≤ T . Similar to raw audio signals, the length
of the transcriptions varies for different samples, thus the
number of input samples in the sequence, T , is not fixed.
A TTS training dataset is composed of n paired examples
of text and audio, S = {(si , xi )}ni=1 , where si is the text
transcription of audio xi . Given the desired speaker em-
bedding v, we generate a single speaker training dataset,
S̃ = {(si , x̂i ) | x̂i = G(Ew2v (xi ), F0 (xi ), v)}ni=1 .

3.2. Multilingual TTS System


Our TTS system uses the same system described in [8] which
Fig. 2. HiFiGAN-VC. HifiGAN model, speaker embedder
combines multi-rate attention based acoustic and prosody
model and wav2vec2.0 pretrained model are colored blue. In-
models with an adapted WaveRNN-based neural vocoder.
put features of HiFiGAN-VC are colored yellow.
The linguistic frontend adopts the supersegmental Inter-
national Phonetic Alphabet (IPA) representation [22].The
The second term is a reconstruction term computed be- linguistic features are rolled out based on the phone level
tween the mel-spectrogram of an input signal and the gener- duration predictions from the prosody model. Then they are
ated signal, up-sampled by repetition to generate frame level features
X which are used by the acoustic model. Additionally other cat-
Lrecon (G) = ||φ(x) − φ(x̂)||1 , (3) egories of input features are extracted: sentence level features
x like language ID and speaker ID; phrase level features such
as intonation; word level features such as word embedding;
where φ is a spectral operator which computes the mel-
syllable level features such as syllable type; and phone level
spectrogram.
features such as articulation diacritics. These features are
The third term is a speaker similarity term computed be-
consumed by the acoustic model to predict pitch features,
tween the input signal and a signal converted to the identity
periodicity features and mel-frequency cepstrum features
of the input signal,
(MFCC). Both acoustic and prosody models are trained using
the Mean Square Error (MSE) loss function as described in
X
Lspkr (G) = 1 − cos(Eid (x), Eid (ŷ)), (4)
x the paper [8]. For the last step, we use an adapted version
of WaveRNN neural vocoder to condition on the predicted
where x is an utterance spoken by a speaker with embedding acoustic features and synthesize speech for the target speaker.
v, ŷ = G(Ew2v (y), F0 (y), v) is the conversion of sample y
to the speaker with embedding v, Eid is a speaker embedding
4. EXPERIMENTS
network [21],and, cos is the cosine similarity between two
vectors. 4.1. Datasets
In conclusion, discriminator D and generator G minimize
the following terms accordingly: Both HiFiGAN-VC and TTS models are trained with our in-
ternal dataset, which was recorded in a voice production stu-
J
X dio by contracted professional voice talents. Each voice talent
Lmulti
G (D, G) = [Ladv (G, Dj ) + λf m Lf m (G, Dj )],
speaks only one language. Our voice data are from voice tal-
j=1
ents speaking English, German, Italian, French and Spanish.
+ λr Lrecon (G) + λspkr Lspkr (G), We use 24kHz recorded audio from 8 speakers: 3 English
J
X speakers (32 hours, 13 hours, 13 hours), 1 French speaker(
Lmulti
D (D, G) = LD (G, Dj ), 28.5 hours), 1 German speaker (8.5 hours), 1 Spanish speaker
j=1 (23.7 hours) and 3 Italian speakers (9.8 hours, 9.7 hours, 12.3
(5) hours).
where Dj is the j-th sub-discriminator of D. We set λf m = 2, In this work, we train a multilingual TTS system using
λr = 45, and, λspkr = 1. one of the English speakers’ voice which is able to synthesize
Data Augmentation With a trained voice conversion audio in 4 other languages: Spanish, German, French, and
model for the target speaker, we convert our multilingual Italian. For internal dataset, we only have target speaker’s
multi-speaker speech dataset to the target speaker’s voice. data speaking English. To obtain data for other 4 languages,
Formally, we denote the domain of transcription samples by we first train a HiFiGAN-VC with internal data from 8 speak-
S ⊂ R. The representation for a transcription is therefore ers covering targeted 4 languages, as described above. Then

8019

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
we use audio dataset from 4 speakers who are monolingual
and speak native German, Spanish, Italian and French re-
spectively and convert their speakers’ voice to the identity of
our target English speaker. We now have a set of data with
target speaker’s voice obtained from voice conversion in 4
languages: Spanish (8 hours), French (4 hours), German (7
hours) and Italian (9 hours) and data from paid recording in
English (32 hours).

Table 1. Speaker Identities Equal-Error-Rate(%)


Models ES DE IT FR Fig. 3. MOS results described in the section of subjec-
tive evaluation. 95% confidence intervals are depicted as
Base multi 44.0 29.3 60.0 40.5
black lines. For each language, from left to right: Base vc
Base mono 2.0 11.1 9.5 4.0
(colored blue), Base multi (colored orange), Base mono
Proposed 2.0 14.5 18.3 19.0
(colored grey), Proposed model (colored yellow). Lan-
guages evaluated are Spanish(ES), German(DE), Italian(IT)
and French(FR).
4.2. Evaluation
We evaluate the multilingual TTS system against three base- posed model achieves higher MOS than the multi-speaker and
lines. The first baseline is a TTS system trained only with mono-speaker baselines. Even compared with Base vc whose
the target speaker’s recorded monolingual data (Base mono). TTS models are trained with native speakers, the scores are
The second baseline (Base multi) is a multilingual multi- comparable. A selection of samples can be found on the web-
speaker TTS system trained with 8 speakers’ data described page of this paper1 .
in Datasets. The training data has the same set of language-
speaker pairs as those used in the proposed voice conversion 4.2.2. Objective evaluation
model. This baseline model relies on transfer learning to
enable target English speaker to speak multiple languages. Besides MOS studies, we also conducted an objective evalua-
A third baseline (Base vc) is set up as the following: we tion on speaker similarities. We enrolled the 5 speakers from
first train separate monolingual TTS models for each tar- Data Augmentation with a trained speaker embedding net-
get language. Then we use those TTS models to synthesize work [21]. We calculated the Equal-Error-Rate (EER) based
multilingual audio samples with native speakers’ voice of on speaker embeddings’ multi-class similarity matrix. The re-
that language. We then use the proposed voice conversion sults are shown in table 1. As you can see, there is a noticeable
model to convert those audio samples to the target English improvement compared with Base multi which is trained with
speaker’s voice. This baseline is expected to deliver high multi-speaker’s data. Also the proposed model narrows the
quality because the TTS system is trained in their native gap between the TTS model(Base mono) trained with data
language. However, this baseline system is undesirable for purely from recordings of the target speaker.
practical applications because it’s computationally expensive
to run a TTS followed by a voice conversion model during
inference time. All the TTS systems above are using the same 5. CONCLUSIONS
spectrum, prosody, and neural vocoder models.
In conclusion, we propose a new way to train multilingual
TTS with data augmentation through voice conversion. The
4.2.1. Subjective evaluation voice conversation model is trained with wav2vec2.0 rep-
For evaluation, we conducted a mean opinion score (MOS) resentations and uses HiFiGAN as the audio synthesizer.
study which used test set utterances excluded from the train- Our MOS study shows that our proposed multilingual model
ing data. We used a crowdsourcing platform which recruited achieves better quality compared to the multi-speaker or
the following number of participants for each test set lan- single-speaker baselines. It also has comparable quality with
guage: 277 Spanish, 39 French, 290 Italian, and 39 Ger- baseline TTS models trained with native speakers’ dataset.
man. Each rater rated randomly assigned 30 samples. For For future work, we plan to explore other audio synthesizer
each MOS test, the participants rated each sample between architectures to further improve the quality of the voice con-
1-5 (1:bad - 5:excellent). We asked participants to rate the version model.
audio samples in terms of overall naturalness and intelligibil- 1 https://multilingual-tts-data-aug.github.io/

ity. As you can see from Figure 3, for each language, our pro- multilingual_tts_data_aug/

8020

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [10] P Vijayalakshmi, B Ramani, MP Actlin Jeeva, and T Na-
garajan. A multilingual to polyglot speech synthesizer
[1] Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab- for indian languages using a voice-converted polyglot
delrahman Mohamed, and Michael Auli. Unsupervised speech corpus. Circuits, Systems, and Signal Process-
cross-lingual representation learning for speech recog- ing, 37(5):2142–2163, 2018.
nition. arXiv preprint arXiv:2006.13979, 2020.
[11] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Na-
[2] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb garajan. Cross-lingual voice conversion-based polyglot
Noury, Norman Casagrande, Edward Lockhart, Florian speech synthesizer for indian languages. In Fifteenth
Stimberg, Aaron Oord, Sander Dieleman, and Koray annual conference of the international speech commu-
Kavukcuoglu. Efficient neural audio synthesis. In Inter- nication association, 2014.
national Conference on Machine Learning, pages 2410–
[12] Qinghua Sun and Kenji Nagamatsu. Building multi lin-
2419. PMLR, 2018.
gual tts using cross lingual voice conversion. arXiv
[3] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike preprint arXiv:2012.14039, 2020.
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng [13] Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, and
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Bin Ma. Towards natural bilingual and code-switched
Natural tts synthesis by conditioning wavenet on mel speech synthesis based on mix of monolingual record-
spectrogram predictions. In 2018 IEEE International ings and cross-lingual voice conversion. arXiv preprint
Conference on Acoustics, Speech and Signal Processing arXiv:2010.08136, 2020.
(ICASSP), pages 4779–4783. IEEE, 2018.
[14] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
[4] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and and Michael Auli. wav2vec 2.0: A framework for self-
Ming Liu. Neural speech synthesis with transformer supervised learning of speech representations. arXiv
network. In Proceedings of the AAAI Conference on Ar- preprint arXiv:2006.11477, 2020.
tificial Intelligence, volume 33, pages 6706–6713, 2019.
[15] Adam Polyak, Lior Wolf, and Yaniv Taigman. TTS
[5] Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, and skins: Speaker conversion via asr. INTERSPEECH,
Minghui Dong. A light-weight method of building 2020.
an lstm-rnn-based bilingual tts system. In 2017 In-
ternational Conference on Asian Language Processing [16] Adam Polyak et al. Unsupervised Cross-Domain
(IALP), pages 201–205. IEEE, 2017. Singing Voice Conversion. In INTERSPEECH, 2020.
[17] Adam Polyak et al. High fidelity speech regeneration
[6] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu,
with application to speech enhancement. ICASSP, 2021.
Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosen-
berg, and Bhuvana Ramabhadran. Learning to speak [18] Adam Polyak, Yossi Adi, Jade Copet, Eugene
fluently in a foreign language: Multilingual speech syn- Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah-
thesis and cross-language voice cloning. arXiv preprint man Mohamed, and Emmanuel Dupoux. Speech resyn-
arXiv:1907.04448, 2019. thesis from discrete disentangled self-supervised repre-
sentations. INTERSPEECH, 2021.
[7] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
Hifi-gan: Generative adversarial networks for efficient [19] K. Kasi and S. A. Zahorian. Yet another algorithm for
and high fidelity speech synthesis. arXiv preprint pitch tracking. ICASSP, 2002.
arXiv:2010.05646, 2020.
[20] Anders Boesen Lindbo Larsen et al. Autoencoding be-
[8] Qing He, Zhiping Xiu, Thilo Koehler, and Jilong Wu. yond pixels using a learned similarity metric. In ICML,
Multi-rate attention architecture for fast streamable text- 2016.
to-speech spectrum modeling. In ICASSP 2021-2021 [21] Li Wan et al. Generalized end-to-end loss for speaker
IEEE International Conference on Acoustics, Speech verification. ICASSP, 2018.
and Signal Processing (ICASSP), pages 5689–5693.
IEEE, 2021. [22] International Phonetic Association. The interna-
tional phonetic alphabet, 2015. https://www.
[9] Sai Sirisha Rallabandi and Suryakanth V Gangashetty. internationalphoneticassociation.org/
An approach to cross-lingual voice conversion. In sites/default/files/IPA_Kiel_2015.
2019 International Joint Conference on Neural Net- pdf.
works (IJCNN), pages 1–7, 2019.

8021

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.

You might also like