Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
Jilong Wu1 , Adam Polyak1 , Yaniv Taigman1 , Jason Fong2 , Prabhav Agrawal1 , Qing He1
1
Facebook AI
2
The University of Edinburgh
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
lingual speech representation [1] as content representations.
The representation is extracted via a wav2vec 2.0 [14] en-
coder, Ew2v , trained on raw speech waveforms in multiple
languages. Speaker identity is controlled via a speaker em-
bedding v as an additional input. The speaker embeddings
are learned during training and stored in a lookup table. Fi-
nally, we use the fundamental frequency, F0 (x), as a prosodic
representation. We use YAAPT [19] to extract pitch contours.
The F0 values are upsampled and concatenated to the con-
tent representation, Ew2v (x). Then, the speaker embedding
v is repeated and concatenated to each time-step of the for-
mer concatenation. To summarize, the encoding is given as
Fig. 1. Multilingual TTS System. Audio data is colored green
E(x) = [Ew2v (x), F0 (x), v].
and input text is colored yellow. Proposed HifiGAN-VC and
multilingual TTS models (for the target speaker) are colored Decoder A neural speech synthesizer is used to gener-
blue. ate audio output from the extracted representation described
above. We decided to use an adapted HiFiGAN [7] architec-
ture to serve as the speech synthesizer. HiFiGAN is reported
lingual TTS system [10, 11]. Among recent works, many to produce good audio quality with fast inference speed. Fast
papers have shown good performance using phonetic pos- inference enables us to efficiently generate large amounts of
teriorgram (PPG) as speaker-invariant features for the voice multilingual data in our data augmentation stage.
conversion model [12, 13]. However, a PPG based system HiFiGAN uses a generative adversarial network (GAN)
needs to go through two stages in order to get desirable in- setup to produce raw waveforms. In our work we denote the
put features before training the audio synthesizer. It first HiFiGAN generator as G, which is a fully convolutional neu-
needs to extract PPG features from target speech using a pre- ral network that takes as input the encoding described above
trained Automatic Speech Recognition(ASR) model. Then and outputs raw waveforms. The discriminator network
it needs to train another model to map from target speech’s D, is composed of two discriminators: (i) a multi-period
PPG to source speaker’s acoustic features (for example, mel discriminator (MPD) to handle periodic signals, and, (ii) a
spectrograms, etc). In this paper, instead of using PPG, we multi-scale discriminator (MSD) to handle long-term depen-
use self-supervised speech representations extracted from dencies. Both the MPD and the MSD consists of multiple
XLSR-53 [1], a wav2vec2.0-based [14] pre-trained cross- sub-discriminators, Dj , operating at different periods and
lingual model . Compared with PPG based approach, without scales accordingly. Specifically, the MPD employs a total of
training any extra model for feature mapping, our model five period discriminators with period hops of [2, 3, 5, 7, 11]
can train the speech synthesizer directly with features ex- and the MSD’s three scale discriminators operate at: the
tracted from a cross-lingual representation model. Previous original input scale, a ×2 downsampled scale, and a ×4
work [15, 16, 17] showed such features enable high-quality downsampled scale. Each sub-discriminator Dj minimizes
audio generation. the following loss functions,
X
3. METHOD Ladv (Dj , G) = ||1 − Dj (x̂)||22 ,
x
X (1)
3.1. Multilingual Voice Conversion LD (Dj , G) = [||1 − Dj (x)||22 + ||Dj (x̂)||22 ],
x
The proposed voice conversion model, HiFiGAN-VC, is
based on a generative adversarial network architecture similar where x̂ = G(E(x)), is the generator output.
to [16, 17, 18] and a HiFiGAN neural vocoder [7]. It is condi- Additionally, three more terms are added to the loss func-
tioned on speaker invariant content representations, spectral tion. The first one is feature-matching loss [20] which mea-
features, and, a learned speaker embedding for speaker iden- sures the distance between discriminator activations of a real
tity to generate raw speech waveforms. Once trained, voice signal and those of the generator output,
conversion can be achieved by changing the speaker identity
R
representation to that of the desired speaker. HiFiGAN-VC is XX 1
illustrated in Figure 2. Lf m (Dj , G) = ||ψi (x) − ψi (x̂)||1 , (2)
x i=1
Mi
We denote the domain of audio samples by X ⊂ R. The
representation for a raw signal is therefore a sequence of sam- where ψi is an operator which extracts the activations of the
ples x = (x1 , . . . , xT ), where xt ∈ X for all 1 ≤ t ≤ T . discriminator i-th layer, Mi is the number of features in layer
Input Features Given an input, x, we use XLSR-53 cross- i, and R is the total number of layers in Dj .
8018
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
a sequence of samples s = (s1 , . . . , sT ), where st ∈ S for
all 1 ≤ t ≤ T . Similar to raw audio signals, the length
of the transcriptions varies for different samples, thus the
number of input samples in the sequence, T , is not fixed.
A TTS training dataset is composed of n paired examples
of text and audio, S = {(si , xi )}ni=1 , where si is the text
transcription of audio xi . Given the desired speaker em-
bedding v, we generate a single speaker training dataset,
S̃ = {(si , x̂i ) | x̂i = G(Ew2v (xi ), F0 (xi ), v)}ni=1 .
8019
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
we use audio dataset from 4 speakers who are monolingual
and speak native German, Spanish, Italian and French re-
spectively and convert their speakers’ voice to the identity of
our target English speaker. We now have a set of data with
target speaker’s voice obtained from voice conversion in 4
languages: Spanish (8 hours), French (4 hours), German (7
hours) and Italian (9 hours) and data from paid recording in
English (32 hours).
ity. As you can see from Figure 3, for each language, our pro- multilingual_tts_data_aug/
8020
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [10] P Vijayalakshmi, B Ramani, MP Actlin Jeeva, and T Na-
garajan. A multilingual to polyglot speech synthesizer
[1] Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab- for indian languages using a voice-converted polyglot
delrahman Mohamed, and Michael Auli. Unsupervised speech corpus. Circuits, Systems, and Signal Process-
cross-lingual representation learning for speech recog- ing, 37(5):2142–2163, 2018.
nition. arXiv preprint arXiv:2006.13979, 2020.
[11] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Na-
[2] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb garajan. Cross-lingual voice conversion-based polyglot
Noury, Norman Casagrande, Edward Lockhart, Florian speech synthesizer for indian languages. In Fifteenth
Stimberg, Aaron Oord, Sander Dieleman, and Koray annual conference of the international speech commu-
Kavukcuoglu. Efficient neural audio synthesis. In Inter- nication association, 2014.
national Conference on Machine Learning, pages 2410–
[12] Qinghua Sun and Kenji Nagamatsu. Building multi lin-
2419. PMLR, 2018.
gual tts using cross lingual voice conversion. arXiv
[3] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike preprint arXiv:2012.14039, 2020.
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng [13] Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, and
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Bin Ma. Towards natural bilingual and code-switched
Natural tts synthesis by conditioning wavenet on mel speech synthesis based on mix of monolingual record-
spectrogram predictions. In 2018 IEEE International ings and cross-lingual voice conversion. arXiv preprint
Conference on Acoustics, Speech and Signal Processing arXiv:2010.08136, 2020.
(ICASSP), pages 4779–4783. IEEE, 2018.
[14] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
[4] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and and Michael Auli. wav2vec 2.0: A framework for self-
Ming Liu. Neural speech synthesis with transformer supervised learning of speech representations. arXiv
network. In Proceedings of the AAAI Conference on Ar- preprint arXiv:2006.11477, 2020.
tificial Intelligence, volume 33, pages 6706–6713, 2019.
[15] Adam Polyak, Lior Wolf, and Yaniv Taigman. TTS
[5] Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, and skins: Speaker conversion via asr. INTERSPEECH,
Minghui Dong. A light-weight method of building 2020.
an lstm-rnn-based bilingual tts system. In 2017 In-
ternational Conference on Asian Language Processing [16] Adam Polyak et al. Unsupervised Cross-Domain
(IALP), pages 201–205. IEEE, 2017. Singing Voice Conversion. In INTERSPEECH, 2020.
[17] Adam Polyak et al. High fidelity speech regeneration
[6] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu,
with application to speech enhancement. ICASSP, 2021.
Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosen-
berg, and Bhuvana Ramabhadran. Learning to speak [18] Adam Polyak, Yossi Adi, Jade Copet, Eugene
fluently in a foreign language: Multilingual speech syn- Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah-
thesis and cross-language voice cloning. arXiv preprint man Mohamed, and Emmanuel Dupoux. Speech resyn-
arXiv:1907.04448, 2019. thesis from discrete disentangled self-supervised repre-
sentations. INTERSPEECH, 2021.
[7] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
Hifi-gan: Generative adversarial networks for efficient [19] K. Kasi and S. A. Zahorian. Yet another algorithm for
and high fidelity speech synthesis. arXiv preprint pitch tracking. ICASSP, 2002.
arXiv:2010.05646, 2020.
[20] Anders Boesen Lindbo Larsen et al. Autoencoding be-
[8] Qing He, Zhiping Xiu, Thilo Koehler, and Jilong Wu. yond pixels using a learned similarity metric. In ICML,
Multi-rate attention architecture for fast streamable text- 2016.
to-speech spectrum modeling. In ICASSP 2021-2021 [21] Li Wan et al. Generalized end-to-end loss for speaker
IEEE International Conference on Acoustics, Speech verification. ICASSP, 2018.
and Signal Processing (ICASSP), pages 5689–5693.
IEEE, 2021. [22] International Phonetic Association. The interna-
tional phonetic alphabet, 2015. https://www.
[9] Sai Sirisha Rallabandi and Suryakanth V Gangashetty. internationalphoneticassociation.org/
An approach to cross-lingual voice conversion. In sites/default/files/IPA_Kiel_2015.
2019 International Joint Conference on Neural Net- pdf.
works (IJCNN), pages 1–7, 2019.
8021
Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.