0% found this document useful (0 votes)
38 views13 pages

Parrot TTS

Uploaded by

muxuxini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views13 pages

Parrot TTS

Uploaded by

muxuxini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised

representations
Neil Shah1,2 * Saiteja Kosgi 1 * Vishal Tambrahalli 1 Neha Sahipjohn1
Anil Kumar Nelakanti3 Vineet Gandhi1
1
Kohli Centre on Intelligent Systems, CVIT, IIIT Hyderabad
2
TCS Research, Pune
3
Amazon, Bengaluru, India.
{neilkumar.shah,saiteja.k,vishal.tambrahalli,neha.s}@research.iiit.ac.in
[email protected],[email protected],[email protected]

Abstract Mel Spectrogram


Acoustic
Text Vocoder
Model
We present ParrotTTS, a modularized text-to-
speech synthesis model leveraging disentan- (a)
gled self-supervised speech representations. It SSL representations
can train a multi-speaker variant effectively Speech Encoder
STE
using transcripts from a single speaker. Par- 2 22 22 23 3 3 3 Speech Decoder
7 7 98 76 . . . ETS
rotTTS adapts to a new language in low re- Text TTE
source setup and generalizes to languages not Speaker Id

seen while training the self-supervised back- (b)


bone. Moreover, without training on bilin-
gual or parallel examples, ParrotTTS can trans- Figure 1: (a) Traditional mel-based TTS and (b) Pro-
fer voices across languages while preserving posed TTS model
the speaker-specific characteristics, e.g., syn-
thesizing fluent Hindi speech using a French
speaker’s voice and accent. We present exten-
of sound units in a self-supervised manner using the
sive results in monolingual and multi-lingual
scenarios. ParrotTTS outperforms state-of-the- raw audio data. The second phase builds on top of
art multi-lingual text-to-speech (TTS) models the first by learning a content mapping from text to
using only a fraction of paired data as lat- quantized speech representations (or embeddings).
ter. Speech samples from ParrotTTS and Only the latter step uses paired text-speech data.
code can be found at https://parrot-tts. The two phases are analogous to first learning to
github.io/tts/ talk followed by learning to read.
1 Introduction Figure 1 illustrates ParrotTTS contrasting it with
the traditional mel-based TTS. The SSL module
Vocal learning forms the first phase of infants start- includes a speech-to-embedding (STE) encoder
ing to talk (Locke, 1996, 1994) by simply listen- trained on masked prediction task to learn an
ing to sounds/speech. It is hypothesized (Kuhl embedding representation of the input raw au-
and Meltzoff, 1996) that infants listening to ambi- dio (Baevski et al., 2020; Hsu et al., 2021; Van
ent language store perceptually derived represen- Den Oord et al., 2017). An embedding-to-speech
tations of the speech sounds they hear, which in (ETS) decoder is independently trained to invert
turn serve as targets for the production of speech embeddings to synthesize audio waveforms and is
utterances. Interestingly, in this phase, the infant additionally conditioned on speaker identity. This
has no conception of text or linguistic rules, and learning to talk is the first of the two-step train-
speech is considered sufficient to influence speech ing pipeline. In the subsequent learning to read
production (Kuhl and Meltzoff, 1996) as can par- step, a separate text-to-embedding (TTE) encoder
rots (Locke, 1994). is trained to generate embeddings from text (or
Our proposed ParrotTTS model follows a similar equivalent phonetic) inputs. This step requires la-
learning process. Our idea mimics the two-step beled speech with aligned transcriptions.
approach, with the first learning to produce sounds ParrotTTS offer several advantages over the tra-
capturing the whole gamut of phonetic variations. ditional mel-based neural TTS models (Ren et al.,
It is attained by learning quantized representations 2020; Wang et al., 2017). For instance, (a) Quan-
* Authors contributed equally to this work. tized speech embedding has lower variance than
79
Findings of the Association for Computational Linguistics: EACL 2024, pages 79–91
March 17-22, 2024 c 2024 Association for Computational Linguistics
that of Mel frames reducing the complexity to train While architecturally similar to other SSL-based
TTE (b) Direct waveform prediction bypasses po- TTS (Wang et al., 2023; Siuzdak et al., 2022), our
tential vocoder generalization issues (Kim et al., primary contribution lies in achieving promising
2021). (c) Reduced complexity helps in stabler outcomes in the low resource scenario, where mini-
training of TTE encoder for either autoregressive mal paired data from a single speaker per language
or non-autoregressive choice. For example, we is accessible for TTS training.
observe at least eight-fold faster convergence in
training iterations of our TTE module compared to 2 Related work
that of (Ren et al., 2020) and (Wang et al., 2017).
2.1 Foundational Neural TTS models
While our work closely relates with recent
works (Du et al., 2022; Wang et al., 2023; Siuz- Traditional neural TTS model encodes text or pho-
dak et al., 2022) utilizing self-supervised repre- netic inputs to hidden states, followed by a de-
sentations for TTS synthesis, our focus differs by coder that generates Mels from the hidden states.
aiming to achieve a unified multi-speaker, multi- Predicted Mel frames contain all the necessary in-
lingual TTS system in low-resource scenarios (Xu formation to reconstruct speech (Griffin and Lim,
et al., 2020). In our work, low-resource refers to 1984) and an independently trained vocoder (Oord
the scarcity of paired TTS data. Here are the key et al., 2016; Kong et al., 2020) transforms them
distinctions of our model compared to recent ef- into time-domain waves. Mel predicting decoders
forts: could be autoregressive/sequential (Wang et al.,
2017; Valle et al., 2020; Shen et al., 2018) or
• Unlike contemporary efforts concentrated on non-autoregressive/parallel (Ren et al., 2019, 2020;
large scale training (Wang et al., 2023), we focus Łańcucki, 2021). Non-autoregressive models addi-
on low resource adaptation. tionally predict intermediate features like duration,
• We employ disentangled self-supervised repre- pitch, and energy for each phoneme. They are
sentations (Polyak et al., 2021) paired with inde- faster at inference and robust to word skipping or
pendently trained STE. This allows us to train repetition errors (Ren et al., 2020). Multi-speaker
multi-speaker TTS using paired data from a sin- capabilities are often achieved by conditioning the
gle speaker and still adapt it to novel voices with decoder on speaker embeddings (one-hot embed-
untranscribed speech alone. In contrast, prior dings or ones obtained from speaker verification
efforts either limit to a single speaker TTS (Du networks (Jia et al., 2018; Sivaprasad et al., 2021)).
et al., 2022) or require paired text-audio data Training multi-speaker TTS models requires paired
from multiple speakers during training (Siuzdak text-audio data from multiple speakers. Methods
et al., 2022). relying on speaker-embeddings can, in theory, per-
• We show that the ParrotTTS can be extended to a form zero-shot speaker adaptation; however, the
new language with as little as five hours of paired rendered speech is known to be of poorer quality,
data from a single speaker. The model general- especially for speakers not sufficiently represented
izes to languages unseen during the learning of in the train set (Tan et al., 2021).
self-supervised representation.
• Moreover, without training on any bilingual or 2.2 Raw-audio for TTS
parallel examples, ParrotTTS can transfer voices
Unsupervised speech synthesis (Ni et al., 2022)
across languages while preserving the speaker-
does not require transcribed text-audio pairs for
specific characteristics. We present extensive
training. They typically employ unsupervised
results on six languages in terms of speech nat-
ASR (Baevski et al., 2021; Liu et al., 2022a) to
uralness and speaker similarity in parallel and
transcribe raw speech to generate pseudo labels.
cross-lingual synthesis.
However, their performance tends to be bounded by
Additionally, it’s worth mentioning that certain the performance of the unsupervised ASR model,
methods (Wang et al., 2023) depend partially or which still has to close a significant gap compared
entirely on Automatic Speech Recognition (ASR) to supervised counterparts (Baevski et al., 2021).
to obtain paired data. It should be noted that these Switching to a multi-speaker setup further widens
ASR models are trained using substantial amounts this quality gap (Liu et al., 2022b).
of supervised data, inaccessible in low resource Some prior works have looked at adapting TTS
settings. to novel speakers using untranscribed audio (Yan
80
et al., 2021; Luong and Yamagishi, 2019; Taigman were subpar due to entanglement of text represen-
et al., 2017). Unlike ours, their methods require a tation with language/speaker information. Recent
large amount of paired data from multiple speakers approaches (Zhang et al., 2019; Cho et al., 2022;
during initial training. Some of these (Luong and Nekvinda and Dušek, 2020) addressed this issue
Yamagishi, 2019; Taigman et al., 2017) jointly train by incorporating an explicit disentanglement loss
the TTS pipeline and the modules for speaker adap- term, using reverse gradients through a language
tation but model training’s convergence is trickier. or speaker classification branch.
In contrast, ParrotTTS benefits from the disentan- Nekvinda and Dušek (2020) propose MetaTTS,
glement of linguistic content from speaker informa- that uses a contextual parameter generation through
tion, making adaptation easier with stabler training language-specific convolutional text encoders. Cho
as we observe in our experiments. et al. (2022) extend MetaTTS with a speaker reg-
ularization loss and investigate different input for-
2.3 Self-supervised learning mats for text. Knowledge sharing (Prakash et al.,
Self-supervised learning (SSL) methods are be- 2019) and distillation (Xu et al., 2020) have been
coming increasingly popular in speech process- explored for multi-lingual TTS. Recently, Wu et al.
ing due to their ability to utilize abundant unla- (2022) employ a data augmentation technique
beled data. Techniques like masked prediction, based on a cross-lingual voice conversion model
temporally contrastive learning, and next-step pre- trained with speaker-invariant features extracted
diction are commonly used to train SSL models. from a speech representation.
Popular models like Wav2vec2 (Baevski et al., Certain limitations still persist in existing ap-
2020), VQ-VAE (Van Den Oord et al., 2017), Au- proaches (Nekvinda and Dušek, 2020; Chen et al.,
dioLM (Borsos et al., 2022) and HuBERT (Hsu 2019; Zhang et al., 2019; Zhang and Lin, 2020).
et al., 2021) have been successfully deployed in For example, many of them rely on Tacotron (Wang
tasks like ASR (Baevski et al., 2020), phoneme et al., 2017) as their backbone, which is prone to
segmentation (Kreuk et al., 2020), spoken language word alignment and repetition errors. Prior multi-
modeling (Lakhotia et al., 2021), and speech resyn- lingual TTS models typically support only 2-3 lan-
thesis (Polyak et al., 2021). guages simultaneously or require extensive train-
Our work is related to recent efforts (Du et al., ing data as noted by Nekvinda and Dušek (2020).
2022; Wang et al., 2023; Siuzdak et al., 2022) that Additionally, they have not yet capitalized on self-
utilize self-supervised audio embeddings in text- supervised embeddings and our efforts aim to ad-
to-speech synthesis. However, those of Du et al. dress this gap.
(2022) and Siuzdak et al. (2022) require speaker-
specific SSL embeddings while we use generic 3 ParrotTTS architecture
HuBERT embeddings (Hsu et al., 2021; Lee et al.,
2022) train for multiple speakers. ParrotTTS has three modules; two encoders that
map speech or text inputs to common embed-
2.4 Multi-lingual TTS ding space (referred to as STE and TTE respec-
It is challenging to build an unified TTS model tively) and a decoder (ETS) that renders speech
supporting multiple languages and speakers, even signal from these embeddings. Our speech encoder-
more so for cross lingual synthesis, i.e., allowing decoder choices are borrowed from (Polyak et al.,
multiple languages to be spoken in each of the 2021). Our speech decoder ETS is a modified ver-
speaker’s voices. The primary challenge is in ac- sion of HiFiGAN (Kong et al., 2020). Text encoder
quiring paired data to train language dependent TTE is an encoder-decoder architecture and we
components that often includes its embeddings. experiment with both autoregressive (AR) and non-
The trick ParrotTTS employs to break this data autoregressive (NAR) choices for the same.
dependence is to decouple acoustics from content
handling, of which only the latter is language de- 3.1 Speech encoder STE
pendent and requires paired data while the former The self-supervised HuBERT model we use for
is deferred to self-supervised models. our STE is pre-trained on large raw audio data
Initial attempts (Liu and Mak, 2019; Zhang et al., from English, on BERT-like masked prediction
2019) address these by conditioning the decoder on task (Devlin et al., 2018) to learn “combined acous-
language and speaker embeddings, but the results tic and language model over the continuous inputs”
81
Inference
Stage 1 Stage 2
Stage 2 2 29 29 29 4 5 5 ..... 2 2 29 29 29 4 5 5 .....

Speech Decoder Speech Decoder


AR Decoder NAR Decoder
(ETS) (ETS)

Speaker Id Speaker Id Length Regulator

Attention
2 2 29 29 29 4 5 5 ..... 2 2 29 29 29
` 4 5 5 ..... 2 2 29 29 29 4 5 5 .....
Duration Predictor

Speech Encoder Phoneme Encoder Phoneme Encoder


TTE TTE
(STE)

Text Text Phoneme Embeddings Phoneme Embeddings

STE-ETS TTE TTS Text Text

(a) ParrotTTS (b) Autoregressive TTE (c) Non Autoregressive TTE

Figure 2: (a) ParrotTTS performs a two stage training. In stage1, ETS is trained to synthesize speech from discrete
units obtained though an independently trained STE module. In Stage2, TTE learns to map text sequence to
corresponding speech units obtained from STE. (b) and (c) illustrate the explored TTE architectures.

of speech. It borrows the base architecture from scales of the input signal. Overall, the model at-
Wav2vec 2.0 (Baevski et al., 2020) with convolu- tempts to minimize D(X, X) b over all its parame-
tions on raw inputs followed by a few transformer ters to train ETS.
layers, however, replaces its contrastive loss with a
BERT-like classification. The “noisy” classes for 3.3 Text encoder TTE
this classification are derived by clustering MFCC The third module we train, TTE is a text en-
features of short speech signals. Encoder input is coder that maps phoneme/character sequence P =
audio signal X = (x1 , ....xT ) sampled at a rate of (p1 , . . . , pN ) to embedding sequence hp =
16kHz. Let Er denote the raw-audio encoder, and (h1 , . . . , hNb ). We train a sequence-to-sequence
its output be, architecture to achieve this hp := Ep (P ). Ep ini-
tially encodes P into a sequence of fixed dimen-
hr = (h1 , ...., hTb ) := Er (X), sional vectors (phoneme embeddings), conditioned
upon which its sequence generator produces vari-
where Tb = T /320 indicates downsampling and able dimensional hp . Embedding hp is intended
each hi ∈ {1, . . . , K} with K being the number of to mimic hr := Er (X) extracted from the audio
clusters in HuBERT’s clustering step, set to 100 in X corresponding to the text P . Hence, the require-
our experiments. For multi-lingual experiments, in- ment of transcribed data (X, P ) to derive the tar-
stead of using HuBERT, we utilize mHuBERT (Lee get hr for training TTE by optimizing over the
et al., 2022), which is trained on a multi-lingual parameters of Ep .
corpus. We use K=1000 for mHuBERT embed- One could model Ep to generate hp autoregres-
dings. sively one step at a time, which we refer to as AR-
TTE model (Figure 2(b)). Input phoneme sequence
3.2 Speech decoder ETS is encoded through a feed-forward transformer
We adapt the HiFiGAN-v2 decoder for our ETS to block that stacks self-attention layers (Vaswani
decode from h = (hr , hs ) to speech, where hs is et al., 2017) and 1D convolutions similar to Fast-
the one-hot speaker embedding. It has a generator Speech2 (Ren et al., 2019). Decoding for hp uses
G and a discriminator D. G runs h through trans- a transformer module with self-attention and cross-
posed convolutions for upsampling to recover the attention. Future-masked self-attention attends to
original sampling rate followed by residual block ground truth at train and to previous decoder pre-
with dilations to increase the receptive field to syn- dictions at inference. Cross-attention attends to
thesize the signal, Xb := G(h). phoneme encoding in both cases.
The discriminator distinguishes synthesized X b Alternatively, for a non-autoregressive choice
from the original signal X and is evaluated by of Ep , the model NAR-TTE determines the out-
two sets of discriminator networks. Multi-period put length N b followed by a step to simultaneously
discriminators operate on equally spaced samples, predict all N b entries of hp . Figure 2(c) depicts
and multi-scale discriminators operate at different NAR-TTE where the input phoneme sequence en-
82
coding is similar to that of AR-TTE. The duration French data from VoxPopuli unlabelled speech cor-
predictor and length regulator modules are respon- pus (Lee et al., 2022; Wang et al., 2021). In both
sible for determining N b followed by the decoding cases, the model splits each T seconds long audio
step to generate hp . In multi-lingual scenario, we into units of T /320 seconds and maps each of the
investigate both character and phoneme sequences obtained units to a 768 dimensional vector.
for representing the input text. For character repre- TTE training (monolingual). We use LJSpeech
sentation, we extract the tokens using a dictionary to train two different TTE encoder modules;
created by iterating over the entire text corpus. In TTELJS that uses all the data from our LJSpeech
contrast, for phoneme representation, we utilize an train set and a second, TTE 1 LJS with only half the
off-the-shelf phonemizer (version: 3.2.1) (Bernard 2
data. This is used to understand the effect of train-
and Titeux, 2021) to extract phonemes belonging ing data size on TTS performance. All variants
to the IPA vocabulary, which are common across of TTE we experiment with are trained only on
languages. samples from the single speaker in LJSpeech data.

4 Experiments Text converted to phoneme sequence as de-


scribed by Sun et al. (2019) are inputs with hr
We perform experiments in monolingual and targets extracted from STE for training. Addition-
multi-lingual scenarios. Details of various Par- ally, NAR-TTE requires phonetic alignment to train
rotTTS models trained and of those each of them the duration predictor. We use Montreal forced-
is compared to is covered below. aligner (McAuliffe et al., 2017) to generate them
for its training. We use cross-entropy loss with the
4.1 ParrotTTS training 100 clusters derived from discretization codebook
Datasets (monolingual) For single language exper- of HuBERT units as classes.
iments, we use two public datasets. LJSpeech (Ito TTE training (multi-lingual). Focusing on low-
and Johnson, 2017) provides 24 hours high qual- resource setting, we use only 5 hours of paired data
ity transcribed data from a single speaker. Data for a single speaker in each language to train the
are split into two, with 512 samples set aside for TTE that totals to merely 30 hours of paired data
validation and the remaining available for model across six languages. We report the evaluation met-
training. VCTK (Veaux et al., 2017) with about rics for seen speakers where the model has seen
44 hours of transcribed speech from 108 different the speaker paired data and unseen speakers whose
speakers is used for the multi-speaker setup. It has paired data is not used to train the TTE. To evaluate
a minimum, average, and maximum of 7, 22.8, and the performance on various text representations,
31 minutes per speaker speech length, respectively. we train two variants of the TTE , the character
Datasets (multi-lingual) We collate our multi- TTE (CTE) and the phoneme TTE (PTE). CTE
lingual dataset using publicly available corpora uses character tokens across the languages to learn
containing samples from multiple speakers in six sound units while PTE uses phoneme tokens. Addi-
languages: (1) 80.76 hours of Hindi and Marathi tionally, we employ Deep Forced Aligner (in Indian
from (SYSPIN-IISC, 2022) from 2 speakers, re- Languages , SYSPIN) to align ground-truth speech
spectively; (2) 71.69 hours of German (GmbH., and input text representations to train the duration
2017) from 3 speakers; (3) 83.01 hours of Spanish predictor. Cross-entropy loss with 1000 clusters of
(GmbH., 2017) from 3 speakers; (4) 10.70 hours mHuBERT are used as classes to predict hp .
of French (Honnet et al., 2017) from 1 speaker; ETS training. We train a single-speaker ETS,
(5) 23.92 hours of English (Ito and Johnson, 2017) SS-ETS using only speech clips from LJSpeech
from 1 speaker. Overall the dataset comprises of since its training does not require transcriptions.
354.12 hours of paired TTS data from 12 speakers Similarly, our multi-speaker ETS, MS-ETS de-
across all six languages. We resample all speech coder model uses only raw audio of all speakers
samples to 16 kHz. from VCTK data (Veaux et al., 2017). So only em-
STE training. We use a 12 layer transformer beddings hr extracted from VCTK audio clips are
model for HuBERT training. It is trained using 960 used along with one-hot speaker vector hs . We em-
hour-long LibriSpeech corpus (Panayotov et al., phasize that VCTK data were used only in training
2015). The multi-lingual variant mHuBERT is the multi-speaker-ETS module, and the TTE has
trained using 13.5k hours of English, Spanish and not seen any from this set. For multi-lingual sce-
83
nario, we train a multi-speaker ETS using speech- from a supervised ASR called MS-FastSpeech2-
only data with 12 speakers from all six languages. SupASR. Additionally, we also report numbers
from GT-Mel+Vocoder that converts ground truth
4.2 Comparison to prior art Mels from actual audio clip back to speech using
Single Speaker TTS: We train Tacotron2 (Wang a vocoder (Kong et al., 2020) for a perspective of
et al., 2017) and FastSpeech2 (Ren et al., 2020) best achievable with ideal Mel frames.
using the ground truth transcripts of LJspeech and Multi-lingual TTS: We compare against, (a)
referred to as SS-Tacotron2 and SS-FastSpeech2. FastSpeech2-MLS which is a fully-supervised
We additionally trained an unsupervised version FastSpeech2 model and (b) state-of-the-art
of FastSpeech2 by replacing the ground truth tran- meta learning-based multi-lingual TTS model
scripts with transcriptions obtained from the ASR MetaTTS (Nekvinda and Dušek, 2020). Both these
model. FastSpeech2-SupASR uses supervised models are trained on the entirety of train data
ASR model (Radford et al., 2022) to generate (354 hours of transcribed speech). In contrast, the
the transcripts while Tacotron2-UnsupASR (Ni TTE training in ParrotTTS model (our sole module
et al., 2022) alternatively uses unsupervised ASR that needs paired data) uses only 1/12th of this i.e,
Wav2vec-U 2.0 (Liu et al., 2022a). We further a total of 30 hours of paired text-speech (5 hours
adapt WavThruVec (Siuzdak et al., 2022) to our per language). The remaining data is used for eval-
setup and train a model (SS-WavThruVec) using uation purposes, serving as the test set. We refer
intermediate embeddings extracted from Wav2Vec to this model as NAR-TTE 1 MLS +ML-ETS. We
2.0 (Baevski et al., 2020). Additionally, we apply a 12
also compare character (CTE) and phoneme (PTE)
similar approach to the embeddings obtained from tokenization for encoding text in this setting.
VQ-VAE (Van Den Oord et al., 2017) and term it as
SS-VQ-VAES. We compare against three variants 4.3 Evaluation metrics
of ParrotTTS; We evaluate the intelligibility of various models
1. AR-TTELJS +SS-ETS that is autoregressive using Word Error Rate (WER) with the pre-trained
TTE trained on full LJSpeech with single Whisper small model (Radford et al., 2022). We
speaker ETS, validate the speaker adaptability using Equal Error
Rate (EER) from a pre-trained speaker verification
2. NAR-TTELJS +SS-ETS that pairs TTE with network proposed in (Desplanques et al., 2020) and
non-autoregressive decoding trained on full trained on VoxCeleb2 (Chung et al., 2018). The
LJSpeech with single speaker ETS, and WER and EER metrics are computed on entire
3. NAR-TTE 1 LJS +SS-ETS that uses TTE with validation set. We perform subjective evaluations
2
non-autoregressive decoding trained on half using Mean Opinion Score (MOS) with five native
LJSpeech with single speaker ETS. speakers per language, rating samples synthesized
by different models, where five sentences from the
Multi-speaker TTS: We compare against a fully test set are randomly selected for evaluation.
supervised Fastspeech2 baseline trained on VCTK
using paired data from all speakers and that we re- 5 Results
fer to as MS-FastSpeech2. For ParrotTTS we bor-
row the TTE module trained on LJSpeech and use 5.1 Single-speaker TTS
the raw audio of VCTK to train the multi-speaker Naturalness and intelligibility. As shown in Ta-
ETS module. We refer to this multi-speaker vari- ble 1, ParrotTTS is competitive to state-of-the-art
ant of our ParrotTTS model as NAR-TTELJS +MS- in the single-speaker setting. In the autoregressive
ETS that uses non-autoregressive decoding. case, our AR-TTELJS +SS-ETS has a statistically
For a fair comparison, we also curate a multi- insignificant drop (of about 0.05 units) on the MOS
speaker TTS baseline using a combination of scale relative to the Tacotron2 baseline. The non-
single-speaker TTS and a voice cloning model. autoregressive case has a similar observation (with
We use FastSpeech2 trained on LJspeech with a 0.01 drop) on MOS in our NAR-TTELJS +SS-
state-of-the-art voice cloning model (Polyak et al., ETS model relative to FastSpeech2. This empiri-
2021) in our experiments and refer to this model as cally establishes that the naturalness of the speech
VC-FastSpeech2. We also compare against multi- rendered by ParrotTTS is on par with the currently
speaker TTS trained by obtaining pseudo labels established methods. The WER scores show a sim-
84
Model MOS ↑ WER ↓
SS-FastSpeech2 3.87 4.52
SS-Tacotron2 3.90 4.59
Traditional TTS
FastSpeech2-SupASR 3.78 4.72
Tacotron2-UnsupASR 3.50 11.3
WavThruVec SS-WavThruVec 3.57 6.27
VQ-VAE SS-VQ-VAES 3.12 21.78
AR-TTELJS +SS-ETS 3.85 4.80
ParrotTTS NAR-TTELJS +SS-ETS 3.86 4.58
NAR-TTE 1 LJS +SS-ETS 3.81 6.14
2

Table 1: Subjective and objective comparison of TTS models in the single speaker setting.

Model VCTK MOS ↑ WER ↓ EER ↓


GT-Mel+Vocoder Yes 4.12 2.25 2.12
MS-FastSpeech2 Yes 3.62 5.32 3.21
MS-FastSpeech2-SupASR No 3.58 6.65 3.85
VC-FastSpeech2 No 3.41 7.44 8.18
WavThruVec-MS No 3.17 6.79 5.08
NAR-TTELJS +MS-ETS No 3.78 6.53 4.38

Table 2: Comparison of the multi-speaker TTS models on the VCTK dataset. Column 2 indicates if the correspond-
ing method uses VCTK transcripts while training.

ilar trend with a statistically insignificant drop (of TTELJS +SS-ETS (with only about 0.04 units MOS
under 0.2pp1 ) among the autoregressive and non- drop). Finally, the MOS numbers of FastSpeech2-
autoregressive model classes. The performance SupASR, need to be read with some caution since
of SS-WavThruVec and SS-VQ-VAES is lower in the supervised ASR model used, Whisper, is it-
both naturalness and intelligibility, indicating that self trained with plenty of transcriptions (spanning
the utilization of Wav2Vec 2.0 and VQ-VAE em- over 600k hours) from the web, including human
beddings results in a decrease in performance. and machine transcribed data achieving very low
Supervision and data efficiency. In the study WERs on various public and test sets. So, the ma-
to understand how the degree of supervision af- chine transcriptions used in FastSpeech2-SupASR
fects TTS speech quality, we see a clear drop by are indeed close to ground truth.
0.28 MOS units in moving from the FastSpeech2-
SupASR model that employs supervised ASR for 5.2 Multi-speaker TTS
transcriptions to Tacotron2-UnsupASR model us- Naturalness and intelligibility. Table 2 summa-
ing unsupervised ASR. Despite some modeling rizes results from our multi-speaker experiments.
variations, this is generally indicative of the impor- NAR-TTELJS +MS-ETS clearly outperforms all
tance of clean transcriptions on TTS output quality, other models ranking only below GT-Mel+Vocoder
given that all other models are within 0.05 MOS that re-synthesizes from ground truth Mels. In-
units of each other. terestingly, ParrotTTS fares even better than MS-
The data requirement for TTS supervision needs FastSpeech2, which is, in turn, better than other
to be understood in light of this impact on output models that ignore transcripts at the train, namely,
quality, and we show how ParrotTTS helps cut MS-FastSpeech2-SupASR and VC-FastSpeech2.
down on this dependence. TTE is the only mod- On the WER metric for intelligibility, ParrotTTS is
ule that needs transcriptions to train, and we show about 1pp behind supervised MS-FastSpeech2 but
that by reducing the size of the train set by half in fares better than the other two models that discard
NAR-TTE 1 LJS +SS-ETS the MOS is still compa- VCTK transcripts for training. WavThruVec-MS
2
rable to that of the model trained on all data NAR- model leveraging Wav2Vec 2.0 embeddings has a
noticeable quality drop in the multi-speaker setting
1
Percentage points abbreviated as pp. with lowest MOS.
85
GT CTE (Ours) PTE (Ours) FS2-MLS MetaTTS
Hindi 3.78 ± 0.14 3.33 ± 0.19 3.22 ± 0.15 3.33 ± 0.12 2.12 ± 0.12
Marathi 4.81 ± 0.07 3.78 ± 0.12 3.04 ± 0.19 3.59 ± 0.15 2.13 ± 0.15
German 3.54 ± 0.20 3.33 ± 0.19 3.58 ± 0.12 3.21 ± 0.16 1.8 ± 0.15
French 3.83 ± 0.19 2.23 ± 0.14 4.17 ± 0.19 3.50 ± 0.16 1.7 ± 0.16
English 4.20 ± 0.12 3.11 ± 0.11 3.50 ± 0.10 2.50 ± 0.18 1.6 ± 0.17
Spanish 3.67 ± 0.12 3.5 ± 0.21 3.67 ± 0.20 2.50 ± 0.21 2.1 ± 0.15

Table 3: Comparison of naturalness MOS on seen speakers with FastSpeech2-MLS (FS2-MLS) and MetaTTS model

GT CTE (Ours) PTE (Ours) FS2-MLS MetaTTS


Hindi 4.22 ± 0.18 3.28 ± 0.19 3.05 ± 0.20 3.22 ± 0.21 2.02 ± 0.18
Marathi 4.48 ± 0.13 3.63 ± 0.18 3.11 ± 0.18 3.15 ± 0.19 1.91 ± 0.19
German 3.17 ± 0.22 2.72 ± 0.23 3.55 ± 0.20 2.05 ± 0.22 1.8 ± 0.17
Spanish 3.67 ± 0.19 3.17 ± 0.17 3.33 ± 0.18 3.17 ± 0.19 1.3 ± 0.16

Table 4: Comparison of naturalness MOS on unseen speakers with FastSpeech2-MLS (FS2-MLS) and
MetaTTS model

Speaker adaptability. VC-FastSpeech2 is the rotTTS over FastSpeech2-MLS and MetaTTS, in-
closest in terms of experimental setup since it is dicating its effectiveness in separating speaker and
limited to transcriptions from LJSpeech for train- content information. This is attributed to the de-
ing similar to ours, with VCTK used only for adap- coder being conditioned solely on speaker ID while
tation. In this case, EER of NAR-TTELJS +MS- sharing the acoustic space across all languages.
ETS is about twice as good as that of VC- Cross lingual synthesis. We also assess the
FastSpeech2. However, improvements are visible model’s performance in synthesizing samples of
when VCTK transcripts are part of training data a speaker in a language different from native lan-
but remain within 1pp relative to ParrotTTS while guage. Table 6 presents these results comparing
GT-Mel+Vocoder continues to dominate the score- naturalness of MOS in a cross-lingual setting. The
board leaving room for further improvement. first column lists a pair of languages of which
the first is the speaker’s native language while the
5.3 Multi-lingual TTS second is language of text that is rendered. Par-
The results from our multi-lingual experiments are rotTTS achieved higher MOS demonstrating strong
in Tables 3, 4, 5, and 6. It is notable that speech decoupling of content from speaker characteristics
rendered by ParrotTTS has superior naturalness that is controlled in the decoder. Further, more than
compared to baselines that are trained with twelve 90% of the participants were able to discern the
times more paired samples stressing its viability for nativity of the synthesized speech.
low-resource languages. Further, the naturalness
6 Conclusion
also changes with the text tokenization method.
Choosing character tokens for Indic languages out- We investigate a data-efficient ParrotTTS model
performed phoneme tokens while it was the oppo- that leverages audio pre-training from self-
site for the European languages. ParrotTTS with supervised models and ties it to separately trained
the best performing tokeniser in each language was speech decoding and text encoding modules. We
superior to FastSpeech2-MLS and MetaTTS for evaluate this architecture in various settings. Qual-
both seen speakers (Table 3) as well as unseen ity of rendered speech with as little as five hours
speakers (Table 4). It is interesting to note that of paired data per language is on par with or su-
scores for ParrotTTS were better than groundtruth perior to competitive baselines. This is the key
and this is possibly due to noise in original sample result from our experiments that we believe will
that was suppressed by HuBERT embeddings that help scale TTS training easily to new languages by
are known to discard ambient information. bringing low-resource ones into the same quality
Speaker similarity. Results in Table 5 con- range as the resource-rich ones. Moreover, we have
sistently demonstrate the superiority of Par- released an open-source, multi-lingual TTS model
86
Language Our model FS2-MLS MetaTTS
Hindi 4.29 ± 0.18 3.92 ± 0.21 2.23 ± 0.19
Marathi 4.21 ± 0.16 3.83 ± 0.08 2.12 ± 0.16
German 4.09 ± 0.11 3.25 ± 0.14 2.05 ± 0.14
French 3.87 ± 0.20 3.50 ± 0.19 2.24 ± 0.17
English 3.94 ± 0.18 3.00 ± 0.19 2.32 ± 0.19
Spanish 4.33 ± 0.17 3.50 ± 0.19 2.0 ± 0.18

Table 5: Comparison of speaker similarity MOS with FastSpeech2-MLS (FS2-MLS) and MetaTTS model

Speaker-Text Our model FS2-MLS MetaTTS


Hindi-Spanish 3.87 ± 0.22 3.25 ± 0.19 1.26 ± 0.15
Marathi-English 3.63 ± 0.21 3.5 ± 0.22 1.23 ± 0.19
French-Hindi 4.07 ± 0.12 2.71 ± 0.21 1.23 ± 0.16
Spanish-German 4.14 ± 0.20 2.29 ± 0.21 1.45 ± 0.19
English-German 3.57 ± 0.15 2.43 ± 0.18 1.56 ± 0.16
English-Hindi 3.57 ± 0.19 2.57 ± 0.18 1.23 ± 0.19
French-German 3.93 ± 0.17 2.71 ± 0.18 1.18 ± 0.17
Spanish-French 3.71 ± 0.18 2.57 ± 0.17 1.4 ± 0.16
Hindi-Marathi 4.13 ± 0.21 3.25 ± 0.19 1.3 ± 0.18
Marathi-French 2.87 ± 0.19 2.75 ± 0.18 1.25 ± 0.19

Table 6: Comparison of naturalness MOS for cross-lingual speech synthesis with FastSpeech2-MLS (FS2-MLS)
and MetaTTS model

to enable the wider application of our findings to to influence duration prediction in the synthesis
resource-scarce and less privileged languages. process.

7 Limitations and Future Work 8 Ethical Considerations


The mHuBERT self-supervised representation uti- Our research is grounded in ethical considerations.
lized in this study may not accurately reproduce the We recognize the potential of text-to-speech syn-
pronunciation of certain words native to Indian lan- thesis in various domains, such as accessibility,
guages, given its pre-training exclusively on Span- human-computer interaction, telecommunications,
ish, French, and English. To address this limitation, and education. However, we acknowledge the risk
our future work will focus on fine-tuning the mHu- of misuse, particularly with regards to unethical
BERT model to encompass a more comprehensive cloning and the creation of false audio recordings.
set of sound units native to South Asian languages Our experiments strictly use publicly available
and potentially develop a universal representation datasets and our method does not aim to synthe-
of sound units. size someone’s voice without their consent. We are
An unexplored aspect in our research is the mindful of the negative consequences associated
examination of emotive speech and controllable with these actions. While the benefits currently out-
generation. Hubert embeddings, as known, lack weigh the concerns, we strongly advocate for the
prosody information, creating a challenge in incor- research community to actively explore methods
porating emotional nuances into speech. In our for detecting and preventing misuse.
forthcoming research, we intend to address this It is important to note that our approach is trained
by concatenating emotive embeddings, enabling on a limited set of languages and has not been val-
the synthesis of speech with diverse emotions and idated on different languages or individuals with
prosody. Additionally, the NAR model’s duration speech impediments. Therefore, the dataset and
predictor may exhibit a bias toward the style of results may not be representative of the entire pop-
a single seen speaker. Our subsequent research ulation. A comprehensive understanding of this
endeavors will explore methods to achieve speaker- issue necessitates further studies in conjunction
adaptive duration prediction and introduce controls with linguistic and socio-cultural insights.
87
9 Acknowledgments Munich Artificial Intelligence Laboratories GmbH.
2017. The m-ailabs speech dataset. https://
We express our gratitude to the reviewers for their github.com/imdatsolak/m-ailabs-dataset.
dedicated time and thoughtful assessment of our
Daniel Griffin and Jae Lim. 1984. Signal estimation
manuscript. We would like to specifically acknowl-
from modified short-time fourier transform. IEEE
edge Mr. Niranjan Pedanekar from Sony Research, Transactions on acoustics, speech, and signal pro-
India, for his constructive comments and insightful cessing, 32(2):236–243.
suggestions, which played a key role in refining the
Pierre-Edouard Honnet, Alexandros Lazaridis, Philip N
overall quality of our work. Garner, and Junichi Yamagishi. 2017. The siwis
french speech synthesis database? design and record-
ing of a high quality french database for speech syn-
References thesis. Technical report, Idiap.
Alexei Baevski, Wei-Ning Hsu, Alexis Conneau, and Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Michael Auli. 2021. Unsupervised speech recogni- Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel-
tion. Advances in Neural Information Processing rahman Mohamed. 2021. Hubert: Self-supervised
Systems, 34:27826–27839. speech representation learning by masked prediction
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, of hidden units. IEEE/ACM Transactions on Audio,
and Michael Auli. 2020. wav2vec 2.0: A framework Speech, and Language Processing, 29:3451–3460.
for self-supervised learning of speech representations.
Advances in Neural Information Processing Systems, Synthesizing Speech in Indian Languages (SYSPIN).
33:12449–12460. 2017. Deep forced alligner. https://github.com/
bloodraven66/DeepForcedAligner.
Mathieu Bernard and Hadrien Titeux. 2021. Phonem-
izer: Text to phones transcription for multiple lan- Keith Ito and Linda Johnson. 2017. The lj
guages in python. Journal of Open Source Software, speech dataset. https://keithito.com/
6(68):3958. LJ-Speech-Dataset/.

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eu- Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan
gene Kharitonov, Olivier Pietquin, Matt Sharifi, Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Igna-
Olivier Teboul, David Grangier, Marco Tagliasacchi, cio Lopez Moreno, Yonghui Wu, et al. 2018. Trans-
and Neil Zeghidour. 2022. Audiolm: a language mod- fer learning from speaker verification to multispeaker
eling approach to audio generation. arXiv preprint text-to-speech synthesis. Advances in neural infor-
arXiv:2209.03143. mation processing systems, 31.
Mengnan Chen, Minchuan Chen, Shuang Liang, Jun Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021.
Ma, Lei Chen, Shaojun Wang, and Jing Xiao. 2019. Conditional variational autoencoder with adversar-
Cross-lingual, multi-speaker text-to-speech synthe- ial learning for end-to-end text-to-speech. In Inter-
sis using neural speaker embedding. In Interspeech, national Conference on Machine Learning, pages
pages 2105–2109. 5530–5540. PMLR.
Hyunjae Cho, Wonbin Jung, Junhyeok Lee, and Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
Sang Hoon Woo. 2022. SANE-TTS: Stable And Hifi-gan: Generative adversarial networks for effi-
Natural End-to-End Multilingual Text-to-Speech. In cient and high fidelity speech synthesis. Advances in
Proc. Interspeech 2022, pages 1–5. Neural Information Processing Systems, 33:17022–
17033.
Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
man. 2018. Voxceleb2: Deep speaker recognition. Felix Kreuk, Joseph Keshet, and Yossi Adi. 2020. Self-
arXiv preprint arXiv:1806.05622. Supervised Contrastive Learning for Unsupervised
Brecht Desplanques, Jenthe Thienpondt, and Kris Phoneme Segmentation. In Proc. Interspeech 2020,
Demuynck. 2020. Ecapa-tdnn: Emphasized pages 3700–3704.
channel attention, propagation and aggregation in
Patricia K Kuhl and Andrew N Meltzoff. 1996. Infant
tdnn based speaker verification. arXiv preprint
vocalizations in response to speech: Vocal imitation
arXiv:2005.07143.
and developmental change. The journal of the Acous-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tical Society of America, 100(4):2425–2438.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu,
ing. arXiv preprint arXiv:1810.04805. Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh
Nguyen, Jade Copet, Alexei Baevski, Abdelrahman
Chenpeng Du, Yiwei Guo, Xie Chen, and Kai Yu. Mohamed, et al. 2021. On generative spoken lan-
2022. VQTTS: High-Fidelity Text-to-Speech Syn- guage modeling from raw audio. Transactions of the
thesis with Self-Supervised VQ Acoustic Feature. In Association for Computational Linguistics, 9:1336–
Proc. Interspeech 2022, pages 1596–1600. 1354.
88
Adrian Łańcucki. 2021. Fastpitch: Parallel text-to- Adam Polyak, Yossi Adi, Jade Copet, Eugene
speech with pitch prediction. In ICASSP 2021-2021 Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab-
IEEE International Conference on Acoustics, Speech delrahman Mohamed, and Emmanuel Dupoux. 2021.
and Signal Processing (ICASSP), pages 6588–6592. Speech Resynthesis from Discrete Disentangled Self-
IEEE. Supervised Representations. In Proc. Interspeech
2021, pages 3615–3619.
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne,
Holger Schwenk, Peng-Jen Chen, Changhan Wang, Anusha Prakash, A Leela Thomas, S Umesh, and
Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Hema A Murthy. 2019. Building multilingual end-
and Wei-Ning Hsu. 2022. Textless speech-to-speech to-end speech synthesisers for indian languages.
translation on real data. In NAACl-HLT. In Proc. of 10th ISCA Speech Synthesis Workshop
(SSW’10), pages 194–199.
Alexander H Liu, Wei-Ning Hsu, Michael Auli, and
Alexei Baevski. 2022a. Towards end-to-end un- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
supervised speech recognition. arXiv preprint man, Christine McLeavey, and Ilya Sutskever. 2022.
arXiv:2204.02492. Robust speech recognition via large-scale weak su-
pervision. arXiv preprint arXiv:2212.04356.
Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael
Auli, Alexei Baevski, and James Glass. 2022b. Sim- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
ple and Effective Unsupervised Speech Synthesis. In Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech
Proc. Interspeech 2022, pages 843–847. 2: Fast and high-quality end-to-end text to speech.
arXiv preprint arXiv:2006.04558.
Zhaoyu Liu and Brian Mak. 2019. Cross-lingual multi-
speaker text-to-speech synthesis for voice cloning Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,
without using parallel corpus for unseen speakers. Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast,
arXiv preprint arXiv:1911.11601. robust and controllable text to speech. Advances in
Neural Information Processing Systems, 32.
John L Locke. 1994. Phases in the child’s development
of language. American Scientist, 82(5):436–445. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
John L Locke. 1996. Why do infants begin to talk? Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
language as an unintended consequence. Journal of et al. 2018. Natural tts synthesis by conditioning
child language, 23(2):251–268. wavenet on mel spectrogram predictions. In 2018
IEEE international conference on acoustics, speech
Hieu-Thi Luong and Junichi Yamagishi. 2019. A uni- and signal processing (ICASSP), pages 4779–4783.
fied speaker adaptation method for speech synthe- IEEE.
sis using transcribed and untranscribed speech with
backpropagation. arXiv preprint arXiv:1906.07414. Hubert Siuzdak, Piotr Dura, Pol van Rijn, and Nori
Jacoby. 2022. WavThruVec: Latent speech repre-
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, sentation as intermediate features for neural speech
Michael Wagner, and Morgan Sonderegger. 2017. synthesis. In Proc. Interspeech 2022, pages 833–837.
Montreal forced aligner: Trainable text-speech align-
ment using kaldi. In Interspeech, volume 2017, pages Sarath Sivaprasad, Saiteja Kosgi, and Vineet Gandhi.
498–502. 2021. Emotional Prosody Control for Speech Gener-
ation. In Proc. Interspeech 2021, pages 4653–4657.
Tomáš Nekvinda and Ondřej Dušek. 2020. One model,
many languages: Meta-learning for multilingual text- Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng
to-speech. arXiv preprint arXiv:2008.00768. Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-Level
Ensemble Distillation for Grapheme-to-Phoneme
Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Conversion. In Proc. Interspeech 2019, pages 2115–
Yang Zhang, Shiyu Chang, and Mark Hasegawa- 2119.
Johnson. 2022. Unsupervised text-to-speech syn-
thesis by unsupervised automatic speech recognition. SYSPIN-IISC. 2022. Text-to-speech synthesizer in nine
arXiv preprint arXiv:2203.15796. indian languages. https://syspin.iisc.ac.in/
datasets.
Aaron van den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves, Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke
Nal Kalchbrenner, Andrew Senior, and Koray Aihara. 2018. Efficiently trainable text-to-speech
Kavukcuoglu. 2016. Wavenet: A generative model system based on deep convolutional networks with
for raw audio. arXiv preprint arXiv:1609.03499. guided attention. In 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
Vassil Panayotov, Guoguo Chen, Daniel Povey, and (ICASSP), pages 4784–4788. IEEE.
Sanjeev Khudanpur. 2015. Librispeech: an asr cor-
pus based on public domain audio books. In 2015 Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya
IEEE international conference on acoustics, speech Nachmani. 2017. Voiceloop: Voice fitting and
and signal processing (ICASSP), pages 5206–5210. synthesis via a phonological loop. arXiv preprint
IEEE. arXiv:1707.06588.
89
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. Haitong Zhang and Yue Lin. 2020. Unsupervised
A survey on neural speech synthesis. arXiv preprint Learning for Sequence-to-Sequence Text-to-Speech
arXiv:2106.15561. for Low-Resource Languages. In Proc. Interspeech
2020, pages 3161–3165.
Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan
Catanzaro. 2020. Flowtron: an autoregressive flow- Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu,
based generative network for text-to-speech synthesis. Zhifeng Chen, R.J. Skerry-Ryan, Ye Jia, Andrew
arXiv preprint arXiv:2005.05957. Rosenberg, and Bhuvana Ramabhadran. 2019. Learn-
ing to Speak Fluently in a Foreign Language: Multi-
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural lingual Speech Synthesis and Cross-Language Voice
discrete representation learning. Advances in neural Cloning. In Proc. Interspeech 2019, pages 2080–
information processing systems, 30. 2084.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob A Appendix


Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. Advances in neural information processing
systems, 30.

Christophe Veaux, Junichi Yamagishi, and Kirsten Mac-


Donald. 2017. Cstr vctk corpus: English multi-
speaker corpus for cstr voice cloning toolkit.

Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu,


Chaitanya Talnikar, Daniel Haziza, Mary Williamson,
Juan Pino, and Emmanuel Dupoux. 2021. VoxPop- Figure 3: Evolution of attention matrix with training
uli: A large-scale multilingual speech corpus for rep- steps for Tacotron2 and AR-TTE
resentation learning, semi-supervised learning and
interpretation. In ACL.

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang,


Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu,
Huaming Wang, Jinyu Li, et al. 2023. Neural codec
language models are zero-shot text to speech synthe-
sizers. arXiv preprint arXiv:2301.02111.

Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui


Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang,
Ying Xiao, Zhifeng Chen, Samy Bengio, et al.
2017. Tacotron: Towards end-to-end speech syn- Figure 4: Attention loss plotted against training steps
thesis. arXiv preprint arXiv:1703.10135. Tacotron2 and AR-TTE

Jilong Wu, Adam Polyak, Yaniv Taigman, Jason Fong,


Prabhav Agrawal, and Qing He. 2022. Multilingual A.1 Stabler training and faster inference
text-to-speech training using cross language voice
conversion and self-supervised learning of speech In Figure 3 and Figure 4, we compare training pro-
representations. In ICASSP 2022-2022 IEEE Inter- files of Tacotron2 and AR-TTE keeping batch size
national Conference on Acoustics, Speech and Signal the same. As visualized in Figure 3, the attention
Processing (ICASSP), pages 8017–8021. IEEE.
matrix in Tacotron2 takes about 20k iterations to
Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, stabilize with an anti-diagonal structure and pre-
and Tie-Yan Liu. 2020. Lrspeech: Extremely low- dict a phoneme-aligned Mel sequence. AR-TTE, in
resource speech synthesis and recognition. In Pro- contrast, is about ten times faster at predicting a dis-
ceedings of the 26th ACM SIGKDD International crete HuBERT unit sequence that aligns with input
Conference on Knowledge Discovery & Data Mining,
pages 2802–2812. phonemes taking only about 2k iterations to arrive
at a similar-looking attention plot. While the snap-
Yuzi Yan, Xu Tan, Bohan Li, Tao Qin, Sheng Zhao, shots are illustrative, we use the guided-attention
Yuan Shen, and Tie-Yan Liu. 2021. Adaspeech loss described by Tachibana et al. (2018) as a met-
2: Adaptive text to speech with untranscribed data.
In ICASSP 2021-2021 IEEE International Confer-
ric to quantify the evolution of the attention matrix
ence on Acoustics, Speech and Signal Processing through training steps. As shown in Figure 4, the
(ICASSP), pages 6613–6617. IEEE. loss dives down a lot sooner for ParrotTTS relative
90
to its Tacotron2 counterpart. In a similar compar-
ison, we observe that NAR-TTE converges (20k
steps) about eight times faster than FastSpeech2
(160k steps).
We suppose that the faster convergence derives
from the lower variance of discrete embeddings in
ParrotTTS as opposed to the richness of Mels that
are complete with all acoustic variations, including
speaker identity, prosody, etc. The output speech is
independent of inputs given the Mel-spectrogram
unlike ParrotTTS embeddings that further need
cues like speaker identity in later ETS module. We
hypothesize that segregating content mapping away
from learning acoustics like speaker identity helps
improve training stability, convergence, and data
efficiency for the TTE encoder.
The proposed NAR-TTE system also improves
inference latency and memory footprint, which
are crucial factors for real-world deployment. On
NVIDIA RTX 2080 Ti GPU, we observe Par-
rotTTS serves 15% faster than FastSpeech2, re-
ducing the average per utterance inference time to
11ms from 13 ms. Furthermore, the TTE module
uses 17M parameters in contrast to 35M parame-
ters of the Mel synthesizer module in Fastspeech2.

91

You might also like