Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

nandhini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

24 views5 pages

Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

nandhini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.

00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9746282

MULTILINGUAL TEXT-TO-SPEECH TRAINING USING CROSS LANGUAGE VOICE

CONVERSION AND SELF-SUPERVISED LEARNING OF SPEECH REPRESENTATIONS

Jilong Wu1 , Adam Polyak1 , Yaniv Taigman1 , Jason Fong2 , Prabhav Agrawal1 , Qing He1
1
Facebook AI
2
The University of Edinburgh

ABSTRACT is trained with recordings from bilingual speakers. However,

it is hard to find multilingual speakers with native proficiency
State of the art text-to-speech (TTS) models can generate
in several different languages. To address this problem, one
high fidelity monolingual speech, but it is still challenging to
category of approaches is focused on disentangling speaker
synthesize multilingual speech from the same speaker. One
and language information from a monolingual speech cor-
major hurdle is for training data. It’s hard to find speakers
pus. For instance,this work [6] builds a multilingual TTS
who have native proficiency in several languages. One way of
system through cross-lingual voice cloning by incorporating
mitigating this issue is by generating polyglot corpus through
an adversarial term in the loss function to enable the model
voice conversion. In this paper, we train such multilingual
to disentangle the speaker representation. Another paradigm
TTS system through a novel cross-lingual voice conver-
to tackle the same problem of insufficient speaker-language
sion model trained with speaker-invariant features extracted
pairs is explicitly generating that data using cross-lingual
from a speech representation model which is pre-trained
voice conversion.
with 53 languages through self-supervised learning [1]. To
In this paper, we propose a multilingual TTS system
further improve the speaker identity shift, we also adopt a
trained with polyglot speech corpus generated through a
speaker similarity loss term during training. We then use this
novel cross-lingual voice conversion model.To train voice
model to convert multilingual multi-speaker speech data to
conversion model, we first extract speaker-invariant features
the voice of the target speaker. Through augmenting data
from a speech representation model pre-trained through self-
from 4 other languages, we train a multilingual TTS system
supervised learning to leverage large amount of multilingual
for a native monolingual English speaker which speaks 5
speech data [1]. Concatenated with speaker and pitch fea-
languages(English, French, German, Italian and Spanish).
tures, those extracted features from the pre-trained model
Our system achieves improved mean opinion score (MOS)
are fed into a HiFiGAN [7] based neural vocoder to gener-
compared with the baseline of multi-speaker system for all
ate speech for the target speaker. We also apply multi-task
languages, specifically: 3.74 vs 3.62 for Spanish, 3.11 vs
learning with a loss on speaker embedding to ensure speaker
2.71 for German, 3.47 vs 2.84 for Italian, and 2.72 vs 2.41 for
identity similarity during training. Details of the proposed
French.
training and inference will be discussed in Section 3. Then,
Index Terms— multilingual text-to-speech, transfer to build a multilingual TTS voice, we start with a set of
learning, self-supervised learning, voice conversion. monolingual TTS voices from multiple languages(e.g. Ital-
ian, French, Spanish, German). We use the proposed voice
conversion model to convert the speaker identities of these
1. INTRODUCTION
non-English languages to the English speaker. Hence, we
generate a synthetic dataset containing multilingual data con-
Recent advances in speech synthesis research witnessed the
sisting of Italian, French, Spanish and German languages for
applications of neural network-based models such as Wav-
an English speaker. The synthetic dataset combined with the
eRNN [2], Tacotron2 [3] and Transformer-TTS [4] to synthe-
English speaker’s original data is then used to train a single
size natural and intelligible speech with high audio quality.
speaker multilingual TTS system with the same model archi-
Typically, these models are trained on a monolingual text-
tectures from this work [8]. The whole system of this paper
audio corpus. However, synthesizing speech from multilin-
is shown in Figure 1.
gual input data still is a challenging task. A natural way to
train such a multilingual TTS system is using training data
from either a bilingual or a polyglot speaker. For example, 2. RELATED WORK
this method [5] presents a Chinese-English TTS system that
There have been several studies to use cross-lingual voice
Correspondence to Jilong Wu: [email protected] conversion [9] to build polyglot speech corpus for a multi-

978-1-6654-0540-9/22/$31.00 ©2022 IEEE 8017 ICASSP 2022

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
lingual speech representation [1] as content representations.
The representation is extracted via a wav2vec 2.0 [14] en-
coder, Ew2v , trained on raw speech waveforms in multiple
languages. Speaker identity is controlled via a speaker em-
bedding v as an additional input. The speaker embeddings
are learned during training and stored in a lookup table. Fi-
nally, we use the fundamental frequency, F0 (x), as a prosodic
representation. We use YAAPT [19] to extract pitch contours.
The F0 values are upsampled and concatenated to the con-
tent representation, Ew2v (x). Then, the speaker embedding
v is repeated and concatenated to each time-step of the for-
mer concatenation. To summarize, the encoding is given as
Fig. 1. Multilingual TTS System. Audio data is colored green
E(x) = [Ew2v (x), F0 (x), v].
and input text is colored yellow. Proposed HifiGAN-VC and
multilingual TTS models (for the target speaker) are colored Decoder A neural speech synthesizer is used to gener-
blue. ate audio output from the extracted representation described
above. We decided to use an adapted HiFiGAN [7] architec-
ture to serve as the speech synthesizer. HiFiGAN is reported
lingual TTS system [10, 11]. Among recent works, many to produce good audio quality with fast inference speed. Fast
papers have shown good performance using phonetic pos- inference enables us to efficiently generate large amounts of
teriorgram (PPG) as speaker-invariant features for the voice multilingual data in our data augmentation stage.
conversion model [12, 13]. However, a PPG based system HiFiGAN uses a generative adversarial network (GAN)
needs to go through two stages in order to get desirable in- setup to produce raw waveforms. In our work we denote the
put features before training the audio synthesizer. It first HiFiGAN generator as G, which is a fully convolutional neu-
needs to extract PPG features from target speech using a pre- ral network that takes as input the encoding described above
trained Automatic Speech Recognition(ASR) model. Then and outputs raw waveforms. The discriminator network
it needs to train another model to map from target speech’s D, is composed of two discriminators: (i) a multi-period
PPG to source speaker’s acoustic features (for example, mel discriminator (MPD) to handle periodic signals, and, (ii) a
spectrograms, etc). In this paper, instead of using PPG, we multi-scale discriminator (MSD) to handle long-term depen-
use self-supervised speech representations extracted from dencies. Both the MPD and the MSD consists of multiple
XLSR-53 [1], a wav2vec2.0-based [14] pre-trained cross- sub-discriminators, Dj , operating at different periods and
lingual model . Compared with PPG based approach, without scales accordingly. Specifically, the MPD employs a total of
training any extra model for feature mapping, our model five period discriminators with period hops of [2, 3, 5, 7, 11]
can train the speech synthesizer directly with features ex- and the MSD’s three scale discriminators operate at: the
tracted from a cross-lingual representation model. Previous original input scale, a ×2 downsampled scale, and a ×4
work [15, 16, 17] showed such features enable high-quality downsampled scale. Each sub-discriminator Dj minimizes
audio generation. the following loss functions,
X
3. METHOD Ladv (Dj , G) = ||1 − Dj (x̂)||22 ,
x
X (1)
3.1. Multilingual Voice Conversion LD (Dj , G) = [||1 − Dj (x)||22 + ||Dj (x̂)||22 ],
x
The proposed voice conversion model, HiFiGAN-VC, is
based on a generative adversarial network architecture similar where x̂ = G(E(x)), is the generator output.
to [16, 17, 18] and a HiFiGAN neural vocoder [7]. It is condi- Additionally, three more terms are added to the loss func-
tioned on speaker invariant content representations, spectral tion. The first one is feature-matching loss [20] which mea-
features, and, a learned speaker embedding for speaker iden- sures the distance between discriminator activations of a real
tity to generate raw speech waveforms. Once trained, voice signal and those of the generator output,
conversion can be achieved by changing the speaker identity
R
representation to that of the desired speaker. HiFiGAN-VC is XX 1
illustrated in Figure 2. Lf m (Dj , G) = ||ψi (x) − ψi (x̂)||1 , (2)
x i=1
Mi
We denote the domain of audio samples by X ⊂ R. The
representation for a raw signal is therefore a sequence of sam- where ψi is an operator which extracts the activations of the
ples x = (x1 , . . . , xT ), where xt ∈ X for all 1 ≤ t ≤ T . discriminator i-th layer, Mi is the number of features in layer
Input Features Given an input, x, we use XLSR-53 cross- i, and R is the total number of layers in Dj .

8018

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
a sequence of samples s = (s1 , . . . , sT ), where st ∈ S for
all 1 ≤ t ≤ T . Similar to raw audio signals, the length
of the transcriptions varies for different samples, thus the
number of input samples in the sequence, T , is not fixed.
A TTS training dataset is composed of n paired examples
of text and audio, S = {(si , xi )}ni=1 , where si is the text
transcription of audio xi . Given the desired speaker em-
bedding v, we generate a single speaker training dataset,
S̃ = {(si , x̂i ) | x̂i = G(Ew2v (xi ), F0 (xi ), v)}ni=1 .

3.2. Multilingual TTS System

Our TTS system uses the same system described in [8] which
Fig. 2. HiFiGAN-VC. HifiGAN model, speaker embedder
combines multi-rate attention based acoustic and prosody
model and wav2vec2.0 pretrained model are colored blue. In-
models with an adapted WaveRNN-based neural vocoder.
put features of HiFiGAN-VC are colored yellow.
The linguistic frontend adopts the supersegmental Inter-
national Phonetic Alphabet (IPA) representation [22].The
The second term is a reconstruction term computed be- linguistic features are rolled out based on the phone level
tween the mel-spectrogram of an input signal and the gener- duration predictions from the prosody model. Then they are
ated signal, up-sampled by repetition to generate frame level features
X which are used by the acoustic model. Additionally other cat-
Lrecon (G) = ||φ(x) − φ(x̂)||1 , (3) egories of input features are extracted: sentence level features
x like language ID and speaker ID; phrase level features such
as intonation; word level features such as word embedding;
where φ is a spectral operator which computes the mel-
syllable level features such as syllable type; and phone level
spectrogram.
features such as articulation diacritics. These features are
The third term is a speaker similarity term computed be-
consumed by the acoustic model to predict pitch features,
tween the input signal and a signal converted to the identity
periodicity features and mel-frequency cepstrum features
of the input signal,
(MFCC). Both acoustic and prosody models are trained using
the Mean Square Error (MSE) loss function as described in
X
Lspkr (G) = 1 − cos(Eid (x), Eid (ŷ)), (4)
x the paper [8]. For the last step, we use an adapted version
of WaveRNN neural vocoder to condition on the predicted
where x is an utterance spoken by a speaker with embedding acoustic features and synthesize speech for the target speaker.
v, ŷ = G(Ew2v (y), F0 (y), v) is the conversion of sample y
to the speaker with embedding v, Eid is a speaker embedding
4. EXPERIMENTS
network [21],and, cos is the cosine similarity between two
vectors. 4.1. Datasets
In conclusion, discriminator D and generator G minimize
the following terms accordingly: Both HiFiGAN-VC and TTS models are trained with our in-
ternal dataset, which was recorded in a voice production stu-
J
X dio by contracted professional voice talents. Each voice talent
Lmulti
G (D, G) = [Ladv (G, Dj ) + λf m Lf m (G, Dj )],
speaks only one language. Our voice data are from voice tal-
j=1
ents speaking English, German, Italian, French and Spanish.
+ λr Lrecon (G) + λspkr Lspkr (G), We use 24kHz recorded audio from 8 speakers: 3 English
J
X speakers (32 hours, 13 hours, 13 hours), 1 French speaker(
Lmulti
D (D, G) = LD (G, Dj ), 28.5 hours), 1 German speaker (8.5 hours), 1 Spanish speaker
j=1 (23.7 hours) and 3 Italian speakers (9.8 hours, 9.7 hours, 12.3
(5) hours).
where Dj is the j-th sub-discriminator of D. We set λf m = 2, In this work, we train a multilingual TTS system using
λr = 45, and, λspkr = 1. one of the English speakers’ voice which is able to synthesize
Data Augmentation With a trained voice conversion audio in 4 other languages: Spanish, German, French, and
model for the target speaker, we convert our multilingual Italian. For internal dataset, we only have target speaker’s
multi-speaker speech dataset to the target speaker’s voice. data speaking English. To obtain data for other 4 languages,
Formally, we denote the domain of transcription samples by we first train a HiFiGAN-VC with internal data from 8 speak-
S ⊂ R. The representation for a transcription is therefore ers covering targeted 4 languages, as described above. Then

8019

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
we use audio dataset from 4 speakers who are monolingual
and speak native German, Spanish, Italian and French re-
spectively and convert their speakers’ voice to the identity of
our target English speaker. We now have a set of data with
target speaker’s voice obtained from voice conversion in 4
languages: Spanish (8 hours), French (4 hours), German (7
hours) and Italian (9 hours) and data from paid recording in
English (32 hours).

Table 1. Speaker Identities Equal-Error-Rate(%)

Models ES DE IT FR Fig. 3. MOS results described in the section of subjec-
tive evaluation. 95% confidence intervals are depicted as
Base multi 44.0 29.3 60.0 40.5
black lines. For each language, from left to right: Base vc
Base mono 2.0 11.1 9.5 4.0
(colored blue), Base multi (colored orange), Base mono
Proposed 2.0 14.5 18.3 19.0
(colored grey), Proposed model (colored yellow). Lan-
guages evaluated are Spanish(ES), German(DE), Italian(IT)
and French(FR).
4.2. Evaluation
We evaluate the multilingual TTS system against three base- posed model achieves higher MOS than the multi-speaker and
lines. The first baseline is a TTS system trained only with mono-speaker baselines. Even compared with Base vc whose
the target speaker’s recorded monolingual data (Base mono). TTS models are trained with native speakers, the scores are
The second baseline (Base multi) is a multilingual multi- comparable. A selection of samples can be found on the web-
speaker TTS system trained with 8 speakers’ data described page of this paper1 .
in Datasets. The training data has the same set of language-
speaker pairs as those used in the proposed voice conversion 4.2.2. Objective evaluation
model. This baseline model relies on transfer learning to
enable target English speaker to speak multiple languages. Besides MOS studies, we also conducted an objective evalua-
A third baseline (Base vc) is set up as the following: we tion on speaker similarities. We enrolled the 5 speakers from
first train separate monolingual TTS models for each tar- Data Augmentation with a trained speaker embedding net-
get language. Then we use those TTS models to synthesize work [21]. We calculated the Equal-Error-Rate (EER) based
multilingual audio samples with native speakers’ voice of on speaker embeddings’ multi-class similarity matrix. The re-
that language. We then use the proposed voice conversion sults are shown in table 1. As you can see, there is a noticeable
model to convert those audio samples to the target English improvement compared with Base multi which is trained with
speaker’s voice. This baseline is expected to deliver high multi-speaker’s data. Also the proposed model narrows the
quality because the TTS system is trained in their native gap between the TTS model(Base mono) trained with data
language. However, this baseline system is undesirable for purely from recordings of the target speaker.
practical applications because it’s computationally expensive
to run a TTS followed by a voice conversion model during
inference time. All the TTS systems above are using the same 5. CONCLUSIONS
spectrum, prosody, and neural vocoder models.
In conclusion, we propose a new way to train multilingual
TTS with data augmentation through voice conversion. The
4.2.1. Subjective evaluation voice conversation model is trained with wav2vec2.0 rep-
For evaluation, we conducted a mean opinion score (MOS) resentations and uses HiFiGAN as the audio synthesizer.
study which used test set utterances excluded from the train- Our MOS study shows that our proposed multilingual model
ing data. We used a crowdsourcing platform which recruited achieves better quality compared to the multi-speaker or
the following number of participants for each test set lan- single-speaker baselines. It also has comparable quality with
guage: 277 Spanish, 39 French, 290 Italian, and 39 Ger- baseline TTS models trained with native speakers’ dataset.
man. Each rater rated randomly assigned 30 samples. For For future work, we plan to explore other audio synthesizer
each MOS test, the participants rated each sample between architectures to further improve the quality of the voice con-
1-5 (1:bad - 5:excellent). We asked participants to rate the version model.
audio samples in terms of overall naturalness and intelligibil- 1 https://multilingual-tts-data-aug.github.io/

ity. As you can see from Figure 3, for each language, our pro- multilingual_tts_data_aug/

8020

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.
6. REFERENCES [10] P Vijayalakshmi, B Ramani, MP Actlin Jeeva, and T Na-
garajan. A multilingual to polyglot speech synthesizer
[1] Alexis Conneau, Alexei Baevski, Ronan Collobert, Ab- for indian languages using a voice-converted polyglot
delrahman Mohamed, and Michael Auli. Unsupervised speech corpus. Circuits, Systems, and Signal Process-
cross-lingual representation learning for speech recog- ing, 37(5):2142–2163, 2018.
nition. arXiv preprint arXiv:2006.13979, 2020.
[11] B Ramani, MP Actlin Jeeva, P Vijayalakshmi, and T Na-
[2] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb garajan. Cross-lingual voice conversion-based polyglot
Noury, Norman Casagrande, Edward Lockhart, Florian speech synthesizer for indian languages. In Fifteenth
Stimberg, Aaron Oord, Sander Dieleman, and Koray annual conference of the international speech commu-
Kavukcuoglu. Efficient neural audio synthesis. In Inter- nication association, 2014.
national Conference on Machine Learning, pages 2410–
[12] Qinghua Sun and Kenji Nagamatsu. Building multi lin-
2419. PMLR, 2018.
gual tts using cross lingual voice conversion. arXiv
[3] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike preprint arXiv:2012.14039, 2020.
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng [13] Shengkui Zhao, Trung Hieu Nguyen, Hao Wang, and
Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Bin Ma. Towards natural bilingual and code-switched
Natural tts synthesis by conditioning wavenet on mel speech synthesis based on mix of monolingual record-
spectrogram predictions. In 2018 IEEE International ings and cross-lingual voice conversion. arXiv preprint
Conference on Acoustics, Speech and Signal Processing arXiv:2010.08136, 2020.
(ICASSP), pages 4779–4783. IEEE, 2018.
[14] Alexei Baevski, Henry Zhou, Abdelrahman Mohamed,
[4] Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and and Michael Auli. wav2vec 2.0: A framework for self-
Ming Liu. Neural speech synthesis with transformer supervised learning of speech representations. arXiv
network. In Proceedings of the AAAI Conference on Ar- preprint arXiv:2006.11477, 2020.
tificial Intelligence, volume 33, pages 6706–6713, 2019.
[15] Adam Polyak, Lior Wolf, and Yaniv Taigman. TTS
[5] Huaiping Ming, Yanfeng Lu, Zhengchen Zhang, and skins: Speaker conversion via asr. INTERSPEECH,
Minghui Dong. A light-weight method of building 2020.
an lstm-rnn-based bilingual tts system. In 2017 In-
ternational Conference on Asian Language Processing [16] Adam Polyak et al. Unsupervised Cross-Domain
(IALP), pages 201–205. IEEE, 2017. Singing Voice Conversion. In INTERSPEECH, 2020.
[17] Adam Polyak et al. High fidelity speech regeneration
[6] Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu,
with application to speech enhancement. ICASSP, 2021.
Zhifeng Chen, RJ Skerry-Ryan, Ye Jia, Andrew Rosen-
berg, and Bhuvana Ramabhadran. Learning to speak [18] Adam Polyak, Yossi Adi, Jade Copet, Eugene
fluently in a foreign language: Multilingual speech syn- Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrah-
thesis and cross-language voice cloning. arXiv preprint man Mohamed, and Emmanuel Dupoux. Speech resyn-
arXiv:1907.04448, 2019. thesis from discrete disentangled self-supervised repre-
sentations. INTERSPEECH, 2021.
[7] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae.
Hifi-gan: Generative adversarial networks for efficient [19] K. Kasi and S. A. Zahorian. Yet another algorithm for
and high fidelity speech synthesis. arXiv preprint pitch tracking. ICASSP, 2002.
arXiv:2010.05646, 2020.
[20] Anders Boesen Lindbo Larsen et al. Autoencoding be-
[8] Qing He, Zhiping Xiu, Thilo Koehler, and Jilong Wu. yond pixels using a learned similarity metric. In ICML,
Multi-rate attention architecture for fast streamable text- 2016.
to-speech spectrum modeling. In ICASSP 2021-2021 [21] Li Wan et al. Generalized end-to-end loss for speaker
IEEE International Conference on Acoustics, Speech verification. ICASSP, 2018.
and Signal Processing (ICASSP), pages 5689–5693.
IEEE, 2021. [22] International Phonetic Association. The interna-
tional phonetic alphabet, 2015. https://www.
[9] Sai Sirisha Rallabandi and Suryakanth V Gangashetty. internationalphoneticassociation.org/
An approach to cross-lingual voice conversion. In sites/default/files/IPA_Kiel_2015.
2019 International Joint Conference on Neural Net- pdf.
works (IJCNN), pages 1–7, 2019.

8021

Authorized licensed use limited to: Thiagarajar College of Engineering. Downloaded on August 07,2024 at 04:51:23 UTC from IEEE Xplore. Restrictions apply.

MCS Resume Template (Bullet Points)
No ratings yet
MCS Resume Template (Bullet Points)
2 pages
Quiz Units 1 and 2
No ratings yet
Quiz Units 1 and 2
10 pages
Cot1 DLP
No ratings yet
Cot1 DLP
11 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
No ratings yet
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
5 pages
Radmmm
No ratings yet
Radmmm
5 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Thesis
No ratings yet
Thesis
37 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
Suoni
No ratings yet
Suoni
38 pages
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
No ratings yet
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
5 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
softvc语音转换2111 02392
No ratings yet
softvc语音转换2111 02392
5 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Direct Punjabi To English Speech Translation Using Discrete Units
No ratings yet
Direct Punjabi To English Speech Translation Using Discrete Units
13 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
No ratings yet
NLP Proposal For Evaluating and Enhancing Persian and Bengali Text To Speech TTS Technology
7 pages
NLPReport Phase 1
No ratings yet
NLPReport Phase 1
5 pages
Scaling Speech Technology To 1,000+ Languages
No ratings yet
Scaling Speech Technology To 1,000+ Languages
41 pages
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
No ratings yet
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
7 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
Sound
No ratings yet
Sound
5 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
1707 06519
No ratings yet
1707 06519
8 pages
Cotatron: Transcription-Guided Speech Encoder For Any-to-Many Voice Conversion Without Parallel Data
No ratings yet
Cotatron: Transcription-Guided Speech Encoder For Any-to-Many Voice Conversion Without Parallel Data
5 pages
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
No ratings yet
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
5 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
1 Base
No ratings yet
1 Base
5 pages
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
No ratings yet
Fast and Lightweight On-Device TTS With Tacotron2 and LPCNet
5 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
Speech Synthesis
No ratings yet
Speech Synthesis
4 pages
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
No ratings yet
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
16 pages
ssw9 PS2-13 Wu
No ratings yet
ssw9 PS2-13 Wu
6 pages
1 s2.0 S1877050924006033 Main
No ratings yet
1 s2.0 S1877050924006033 Main
11 pages
Tratamiento Sonido
No ratings yet
Tratamiento Sonido
5 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Voice Connect - S2ST Reserch Paper
No ratings yet
Voice Connect - S2ST Reserch Paper
4 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
TTS-Guided Training For Accent Conversion Without Parallel Data
No ratings yet
TTS-Guided Training For Accent Conversion Without Parallel Data
5 pages
DRL VXV
No ratings yet
DRL VXV
5 pages
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
No ratings yet
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
13 pages
Ieee
No ratings yet
Ieee
12 pages
MLP Singer
No ratings yet
MLP Singer
6 pages
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
No ratings yet
AI-Synthesized Voice Detection Using Neural Vocoder Artifacts
9 pages
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
No ratings yet
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
5 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
METTS Multilingual Emotional Text-To-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
No ratings yet
METTS Multilingual Emotional Text-To-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
13 pages
Introduction to VRS Interpreting: A Curriculum Guide
From Everand
Introduction to VRS Interpreting: A Curriculum Guide
VRS Interpreting Institute
No ratings yet
The language dimension in all subjects: A handbook for curriculum development and teacher training
From Everand
The language dimension in all subjects: A handbook for curriculum development and teacher training
Jean-Claude Beacco
No ratings yet
3 PGDT Classroom - English - Teaching - Material - Autosaved - Autosaved11
No ratings yet
3 PGDT Classroom - English - Teaching - Material - Autosaved - Autosaved11
51 pages
Idiomatic Expressions: Juan No Tiene Pelos en La Lengua
No ratings yet
Idiomatic Expressions: Juan No Tiene Pelos en La Lengua
4 pages
Worksheet Present Simple and Daily Routines1
No ratings yet
Worksheet Present Simple and Daily Routines1
4 pages
Unit 4
No ratings yet
Unit 4
7 pages
Chuckys Participle Clauses Update
No ratings yet
Chuckys Participle Clauses Update
6 pages
Test 1- Nguyễn Thu Giang
No ratings yet
Test 1- Nguyễn Thu Giang
12 pages
WeHelpU A1 Course Class 32 - Answer Key
No ratings yet
WeHelpU A1 Course Class 32 - Answer Key
4 pages
Appendices
No ratings yet
Appendices
8 pages
MAKE A COPY OF THE DOC - Harvard CV Template
No ratings yet
MAKE A COPY OF THE DOC - Harvard CV Template
1 page
Quiz I Level 12C
No ratings yet
Quiz I Level 12C
3 pages
Semiotics and Communication Unit 1: Licenciatura en Ingles
No ratings yet
Semiotics and Communication Unit 1: Licenciatura en Ingles
21 pages
Quotation Marks
No ratings yet
Quotation Marks
3 pages
Present Continuous - With Karen Translation
No ratings yet
Present Continuous - With Karen Translation
4 pages
Tenses Exercise Choose The Correct Option-: A Have Been B Had Been C Shall Be Dam
0% (2)
Tenses Exercise Choose The Correct Option-: A Have Been B Had Been C Shall Be Dam
7 pages
Review of "Chapter 6 of Rod Ellis' Book Entitled Input, Interaction, and Second Language Acquisition"
100% (2)
Review of "Chapter 6 of Rod Ellis' Book Entitled Input, Interaction, and Second Language Acquisition"
14 pages
The Assessment On Reading Comprehension at SMAN 22
No ratings yet
The Assessment On Reading Comprehension at SMAN 22
7 pages
Grammar Bank U9-U11
0% (1)
Grammar Bank U9-U11
10 pages
Clauses
No ratings yet
Clauses
3 pages
E-Unit Planning Commentary Mckeown
No ratings yet
E-Unit Planning Commentary Mckeown
2 pages
Past Tense 3rd Assignment
No ratings yet
Past Tense 3rd Assignment
6 pages
05 Grammar Present Perfect
No ratings yet
05 Grammar Present Perfect
2 pages
List of Verbs and Their Conjugations: Bore Borne
No ratings yet
List of Verbs and Their Conjugations: Bore Borne
3 pages
Latest Form 3 Lesson Plan (Cefr/kssm English Daily Lesson Plan) RPH Bahasa Inggeris Sekolah Menengah Tingkatan 3 Terkini
50% (2)
Latest Form 3 Lesson Plan (Cefr/kssm English Daily Lesson Plan) RPH Bahasa Inggeris Sekolah Menengah Tingkatan 3 Terkini
2 pages
Baybayin Sinta
No ratings yet
Baybayin Sinta
6 pages
Lesson Plan: Ser vs. Estar & Vocab
No ratings yet
Lesson Plan: Ser vs. Estar & Vocab
4 pages
Adverbs of Manner
No ratings yet
Adverbs of Manner
32 pages
Heo - Class 1 - Mock Test 10
No ratings yet
Heo - Class 1 - Mock Test 10
5 pages

Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations

Uploaded by

ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) | 978-1-6654-0540-9/22/$31.

00 ©2022 IEEE | DOI: 10.1109/ICASSP43922.2022.9746282

MULTILINGUAL TEXT-TO-SPEECH TRAINING USING CROSS LANGUAGE VOICE

ABSTRACT is trained with recordings from bilingual speakers. However,

978-1-6654-0540-9/22/$31.00 ©2022 IEEE 8017 ICASSP 2022

3.2. Multilingual TTS System

Table 1. Speaker Identities Equal-Error-Rate(%)

You might also like