Parrot TTS
Parrot TTS
representations
Neil Shah1,2 * Saiteja Kosgi 1 * Vishal Tambrahalli 1 Neha Sahipjohn1
Anil Kumar Nelakanti3 Vineet Gandhi1
1
Kohli Centre on Intelligent Systems, CVIT, IIIT Hyderabad
2
TCS Research, Pune
3
Amazon, Bengaluru, India.
{neilkumar.shah,saiteja.k,vishal.tambrahalli,neha.s}@research.iiit.ac.in
[email protected],[email protected],[email protected]
Attention
2 2 29 29 29 4 5 5 ..... 2 2 29 29 29
` 4 5 5 ..... 2 2 29 29 29 4 5 5 .....
Duration Predictor
Figure 2: (a) ParrotTTS performs a two stage training. In stage1, ETS is trained to synthesize speech from discrete
units obtained though an independently trained STE module. In Stage2, TTE learns to map text sequence to
corresponding speech units obtained from STE. (b) and (c) illustrate the explored TTE architectures.
of speech. It borrows the base architecture from scales of the input signal. Overall, the model at-
Wav2vec 2.0 (Baevski et al., 2020) with convolu- tempts to minimize D(X, X) b over all its parame-
tions on raw inputs followed by a few transformer ters to train ETS.
layers, however, replaces its contrastive loss with a
BERT-like classification. The “noisy” classes for 3.3 Text encoder TTE
this classification are derived by clustering MFCC The third module we train, TTE is a text en-
features of short speech signals. Encoder input is coder that maps phoneme/character sequence P =
audio signal X = (x1 , ....xT ) sampled at a rate of (p1 , . . . , pN ) to embedding sequence hp =
16kHz. Let Er denote the raw-audio encoder, and (h1 , . . . , hNb ). We train a sequence-to-sequence
its output be, architecture to achieve this hp := Ep (P ). Ep ini-
tially encodes P into a sequence of fixed dimen-
hr = (h1 , ...., hTb ) := Er (X), sional vectors (phoneme embeddings), conditioned
upon which its sequence generator produces vari-
where Tb = T /320 indicates downsampling and able dimensional hp . Embedding hp is intended
each hi ∈ {1, . . . , K} with K being the number of to mimic hr := Er (X) extracted from the audio
clusters in HuBERT’s clustering step, set to 100 in X corresponding to the text P . Hence, the require-
our experiments. For multi-lingual experiments, in- ment of transcribed data (X, P ) to derive the tar-
stead of using HuBERT, we utilize mHuBERT (Lee get hr for training TTE by optimizing over the
et al., 2022), which is trained on a multi-lingual parameters of Ep .
corpus. We use K=1000 for mHuBERT embed- One could model Ep to generate hp autoregres-
dings. sively one step at a time, which we refer to as AR-
TTE model (Figure 2(b)). Input phoneme sequence
3.2 Speech decoder ETS is encoded through a feed-forward transformer
We adapt the HiFiGAN-v2 decoder for our ETS to block that stacks self-attention layers (Vaswani
decode from h = (hr , hs ) to speech, where hs is et al., 2017) and 1D convolutions similar to Fast-
the one-hot speaker embedding. It has a generator Speech2 (Ren et al., 2019). Decoding for hp uses
G and a discriminator D. G runs h through trans- a transformer module with self-attention and cross-
posed convolutions for upsampling to recover the attention. Future-masked self-attention attends to
original sampling rate followed by residual block ground truth at train and to previous decoder pre-
with dilations to increase the receptive field to syn- dictions at inference. Cross-attention attends to
thesize the signal, Xb := G(h). phoneme encoding in both cases.
The discriminator distinguishes synthesized X b Alternatively, for a non-autoregressive choice
from the original signal X and is evaluated by of Ep , the model NAR-TTE determines the out-
two sets of discriminator networks. Multi-period put length N b followed by a step to simultaneously
discriminators operate on equally spaced samples, predict all N b entries of hp . Figure 2(c) depicts
and multi-scale discriminators operate at different NAR-TTE where the input phoneme sequence en-
82
coding is similar to that of AR-TTE. The duration French data from VoxPopuli unlabelled speech cor-
predictor and length regulator modules are respon- pus (Lee et al., 2022; Wang et al., 2021). In both
sible for determining N b followed by the decoding cases, the model splits each T seconds long audio
step to generate hp . In multi-lingual scenario, we into units of T /320 seconds and maps each of the
investigate both character and phoneme sequences obtained units to a 768 dimensional vector.
for representing the input text. For character repre- TTE training (monolingual). We use LJSpeech
sentation, we extract the tokens using a dictionary to train two different TTE encoder modules;
created by iterating over the entire text corpus. In TTELJS that uses all the data from our LJSpeech
contrast, for phoneme representation, we utilize an train set and a second, TTE 1 LJS with only half the
off-the-shelf phonemizer (version: 3.2.1) (Bernard 2
data. This is used to understand the effect of train-
and Titeux, 2021) to extract phonemes belonging ing data size on TTS performance. All variants
to the IPA vocabulary, which are common across of TTE we experiment with are trained only on
languages. samples from the single speaker in LJSpeech data.
Table 1: Subjective and objective comparison of TTS models in the single speaker setting.
Table 2: Comparison of the multi-speaker TTS models on the VCTK dataset. Column 2 indicates if the correspond-
ing method uses VCTK transcripts while training.
ilar trend with a statistically insignificant drop (of TTELJS +SS-ETS (with only about 0.04 units MOS
under 0.2pp1 ) among the autoregressive and non- drop). Finally, the MOS numbers of FastSpeech2-
autoregressive model classes. The performance SupASR, need to be read with some caution since
of SS-WavThruVec and SS-VQ-VAES is lower in the supervised ASR model used, Whisper, is it-
both naturalness and intelligibility, indicating that self trained with plenty of transcriptions (spanning
the utilization of Wav2Vec 2.0 and VQ-VAE em- over 600k hours) from the web, including human
beddings results in a decrease in performance. and machine transcribed data achieving very low
Supervision and data efficiency. In the study WERs on various public and test sets. So, the ma-
to understand how the degree of supervision af- chine transcriptions used in FastSpeech2-SupASR
fects TTS speech quality, we see a clear drop by are indeed close to ground truth.
0.28 MOS units in moving from the FastSpeech2-
SupASR model that employs supervised ASR for 5.2 Multi-speaker TTS
transcriptions to Tacotron2-UnsupASR model us- Naturalness and intelligibility. Table 2 summa-
ing unsupervised ASR. Despite some modeling rizes results from our multi-speaker experiments.
variations, this is generally indicative of the impor- NAR-TTELJS +MS-ETS clearly outperforms all
tance of clean transcriptions on TTS output quality, other models ranking only below GT-Mel+Vocoder
given that all other models are within 0.05 MOS that re-synthesizes from ground truth Mels. In-
units of each other. terestingly, ParrotTTS fares even better than MS-
The data requirement for TTS supervision needs FastSpeech2, which is, in turn, better than other
to be understood in light of this impact on output models that ignore transcripts at the train, namely,
quality, and we show how ParrotTTS helps cut MS-FastSpeech2-SupASR and VC-FastSpeech2.
down on this dependence. TTE is the only mod- On the WER metric for intelligibility, ParrotTTS is
ule that needs transcriptions to train, and we show about 1pp behind supervised MS-FastSpeech2 but
that by reducing the size of the train set by half in fares better than the other two models that discard
NAR-TTE 1 LJS +SS-ETS the MOS is still compa- VCTK transcripts for training. WavThruVec-MS
2
rable to that of the model trained on all data NAR- model leveraging Wav2Vec 2.0 embeddings has a
noticeable quality drop in the multi-speaker setting
1
Percentage points abbreviated as pp. with lowest MOS.
85
GT CTE (Ours) PTE (Ours) FS2-MLS MetaTTS
Hindi 3.78 ± 0.14 3.33 ± 0.19 3.22 ± 0.15 3.33 ± 0.12 2.12 ± 0.12
Marathi 4.81 ± 0.07 3.78 ± 0.12 3.04 ± 0.19 3.59 ± 0.15 2.13 ± 0.15
German 3.54 ± 0.20 3.33 ± 0.19 3.58 ± 0.12 3.21 ± 0.16 1.8 ± 0.15
French 3.83 ± 0.19 2.23 ± 0.14 4.17 ± 0.19 3.50 ± 0.16 1.7 ± 0.16
English 4.20 ± 0.12 3.11 ± 0.11 3.50 ± 0.10 2.50 ± 0.18 1.6 ± 0.17
Spanish 3.67 ± 0.12 3.5 ± 0.21 3.67 ± 0.20 2.50 ± 0.21 2.1 ± 0.15
Table 3: Comparison of naturalness MOS on seen speakers with FastSpeech2-MLS (FS2-MLS) and MetaTTS model
Table 4: Comparison of naturalness MOS on unseen speakers with FastSpeech2-MLS (FS2-MLS) and
MetaTTS model
Speaker adaptability. VC-FastSpeech2 is the rotTTS over FastSpeech2-MLS and MetaTTS, in-
closest in terms of experimental setup since it is dicating its effectiveness in separating speaker and
limited to transcriptions from LJSpeech for train- content information. This is attributed to the de-
ing similar to ours, with VCTK used only for adap- coder being conditioned solely on speaker ID while
tation. In this case, EER of NAR-TTELJS +MS- sharing the acoustic space across all languages.
ETS is about twice as good as that of VC- Cross lingual synthesis. We also assess the
FastSpeech2. However, improvements are visible model’s performance in synthesizing samples of
when VCTK transcripts are part of training data a speaker in a language different from native lan-
but remain within 1pp relative to ParrotTTS while guage. Table 6 presents these results comparing
GT-Mel+Vocoder continues to dominate the score- naturalness of MOS in a cross-lingual setting. The
board leaving room for further improvement. first column lists a pair of languages of which
the first is the speaker’s native language while the
5.3 Multi-lingual TTS second is language of text that is rendered. Par-
The results from our multi-lingual experiments are rotTTS achieved higher MOS demonstrating strong
in Tables 3, 4, 5, and 6. It is notable that speech decoupling of content from speaker characteristics
rendered by ParrotTTS has superior naturalness that is controlled in the decoder. Further, more than
compared to baselines that are trained with twelve 90% of the participants were able to discern the
times more paired samples stressing its viability for nativity of the synthesized speech.
low-resource languages. Further, the naturalness
6 Conclusion
also changes with the text tokenization method.
Choosing character tokens for Indic languages out- We investigate a data-efficient ParrotTTS model
performed phoneme tokens while it was the oppo- that leverages audio pre-training from self-
site for the European languages. ParrotTTS with supervised models and ties it to separately trained
the best performing tokeniser in each language was speech decoding and text encoding modules. We
superior to FastSpeech2-MLS and MetaTTS for evaluate this architecture in various settings. Qual-
both seen speakers (Table 3) as well as unseen ity of rendered speech with as little as five hours
speakers (Table 4). It is interesting to note that of paired data per language is on par with or su-
scores for ParrotTTS were better than groundtruth perior to competitive baselines. This is the key
and this is possibly due to noise in original sample result from our experiments that we believe will
that was suppressed by HuBERT embeddings that help scale TTS training easily to new languages by
are known to discard ambient information. bringing low-resource ones into the same quality
Speaker similarity. Results in Table 5 con- range as the resource-rich ones. Moreover, we have
sistently demonstrate the superiority of Par- released an open-source, multi-lingual TTS model
86
Language Our model FS2-MLS MetaTTS
Hindi 4.29 ± 0.18 3.92 ± 0.21 2.23 ± 0.19
Marathi 4.21 ± 0.16 3.83 ± 0.08 2.12 ± 0.16
German 4.09 ± 0.11 3.25 ± 0.14 2.05 ± 0.14
French 3.87 ± 0.20 3.50 ± 0.19 2.24 ± 0.17
English 3.94 ± 0.18 3.00 ± 0.19 2.32 ± 0.19
Spanish 4.33 ± 0.17 3.50 ± 0.19 2.0 ± 0.18
Table 5: Comparison of speaker similarity MOS with FastSpeech2-MLS (FS2-MLS) and MetaTTS model
Table 6: Comparison of naturalness MOS for cross-lingual speech synthesis with FastSpeech2-MLS (FS2-MLS)
and MetaTTS model
to enable the wider application of our findings to to influence duration prediction in the synthesis
resource-scarce and less privileged languages. process.
Zalán Borsos, Raphaël Marinier, Damien Vincent, Eu- Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan
gene Kharitonov, Olivier Pietquin, Matt Sharifi, Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Igna-
Olivier Teboul, David Grangier, Marco Tagliasacchi, cio Lopez Moreno, Yonghui Wu, et al. 2018. Trans-
and Neil Zeghidour. 2022. Audiolm: a language mod- fer learning from speaker verification to multispeaker
eling approach to audio generation. arXiv preprint text-to-speech synthesis. Advances in neural infor-
arXiv:2209.03143. mation processing systems, 31.
Mengnan Chen, Minchuan Chen, Shuang Liang, Jun Jaehyeon Kim, Jungil Kong, and Juhee Son. 2021.
Ma, Lei Chen, Shaojun Wang, and Jing Xiao. 2019. Conditional variational autoencoder with adversar-
Cross-lingual, multi-speaker text-to-speech synthe- ial learning for end-to-end text-to-speech. In Inter-
sis using neural speaker embedding. In Interspeech, national Conference on Machine Learning, pages
pages 2105–2109. 5530–5540. PMLR.
Hyunjae Cho, Wonbin Jung, Junhyeok Lee, and Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020.
Sang Hoon Woo. 2022. SANE-TTS: Stable And Hifi-gan: Generative adversarial networks for effi-
Natural End-to-End Multilingual Text-to-Speech. In cient and high fidelity speech synthesis. Advances in
Proc. Interspeech 2022, pages 1–5. Neural Information Processing Systems, 33:17022–
17033.
Joon Son Chung, Arsha Nagrani, and Andrew Zisser-
man. 2018. Voxceleb2: Deep speaker recognition. Felix Kreuk, Joseph Keshet, and Yossi Adi. 2020. Self-
arXiv preprint arXiv:1806.05622. Supervised Contrastive Learning for Unsupervised
Brecht Desplanques, Jenthe Thienpondt, and Kris Phoneme Segmentation. In Proc. Interspeech 2020,
Demuynck. 2020. Ecapa-tdnn: Emphasized pages 3700–3704.
channel attention, propagation and aggregation in
Patricia K Kuhl and Andrew N Meltzoff. 1996. Infant
tdnn based speaker verification. arXiv preprint
vocalizations in response to speech: Vocal imitation
arXiv:2005.07143.
and developmental change. The journal of the Acous-
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tical Society of America, 100(4):2425–2438.
Kristina Toutanova. 2018. Bert: Pre-training of deep
bidirectional transformers for language understand- Kushal Lakhotia, Eugene Kharitonov, Wei-Ning Hsu,
ing. arXiv preprint arXiv:1810.04805. Yossi Adi, Adam Polyak, Benjamin Bolte, Tu-Anh
Nguyen, Jade Copet, Alexei Baevski, Abdelrahman
Chenpeng Du, Yiwei Guo, Xie Chen, and Kai Yu. Mohamed, et al. 2021. On generative spoken lan-
2022. VQTTS: High-Fidelity Text-to-Speech Syn- guage modeling from raw audio. Transactions of the
thesis with Self-Supervised VQ Acoustic Feature. In Association for Computational Linguistics, 9:1336–
Proc. Interspeech 2022, pages 1596–1600. 1354.
88
Adrian Łańcucki. 2021. Fastpitch: Parallel text-to- Adam Polyak, Yossi Adi, Jade Copet, Eugene
speech with pitch prediction. In ICASSP 2021-2021 Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Ab-
IEEE International Conference on Acoustics, Speech delrahman Mohamed, and Emmanuel Dupoux. 2021.
and Signal Processing (ICASSP), pages 6588–6592. Speech Resynthesis from Discrete Disentangled Self-
IEEE. Supervised Representations. In Proc. Interspeech
2021, pages 3615–3619.
Ann Lee, Hongyu Gong, Paul-Ambroise Duquenne,
Holger Schwenk, Peng-Jen Chen, Changhan Wang, Anusha Prakash, A Leela Thomas, S Umesh, and
Sravya Popuri, Yossi Adi, Juan Pino, Jiatao Gu, Hema A Murthy. 2019. Building multilingual end-
and Wei-Ning Hsu. 2022. Textless speech-to-speech to-end speech synthesisers for indian languages.
translation on real data. In NAACl-HLT. In Proc. of 10th ISCA Speech Synthesis Workshop
(SSW’10), pages 194–199.
Alexander H Liu, Wei-Ning Hsu, Michael Auli, and
Alexei Baevski. 2022a. Towards end-to-end un- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
supervised speech recognition. arXiv preprint man, Christine McLeavey, and Ilya Sutskever. 2022.
arXiv:2204.02492. Robust speech recognition via large-scale weak su-
pervision. arXiv preprint arXiv:2212.04356.
Alexander H. Liu, Cheng-I Lai, Wei-Ning Hsu, Michael
Auli, Alexei Baevski, and James Glass. 2022b. Sim- Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao,
ple and Effective Unsupervised Speech Synthesis. In Zhou Zhao, and Tie-Yan Liu. 2020. Fastspeech
Proc. Interspeech 2022, pages 843–847. 2: Fast and high-quality end-to-end text to speech.
arXiv preprint arXiv:2006.04558.
Zhaoyu Liu and Brian Mak. 2019. Cross-lingual multi-
speaker text-to-speech synthesis for voice cloning Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao,
without using parallel corpus for unseen speakers. Zhou Zhao, and Tie-Yan Liu. 2019. Fastspeech: Fast,
arXiv preprint arXiv:1911.11601. robust and controllable text to speech. Advances in
Neural Information Processing Systems, 32.
John L Locke. 1994. Phases in the child’s development
of language. American Scientist, 82(5):436–445. Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike
Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng
John L Locke. 1996. Why do infants begin to talk? Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan,
language as an unintended consequence. Journal of et al. 2018. Natural tts synthesis by conditioning
child language, 23(2):251–268. wavenet on mel spectrogram predictions. In 2018
IEEE international conference on acoustics, speech
Hieu-Thi Luong and Junichi Yamagishi. 2019. A uni- and signal processing (ICASSP), pages 4779–4783.
fied speaker adaptation method for speech synthe- IEEE.
sis using transcribed and untranscribed speech with
backpropagation. arXiv preprint arXiv:1906.07414. Hubert Siuzdak, Piotr Dura, Pol van Rijn, and Nori
Jacoby. 2022. WavThruVec: Latent speech repre-
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, sentation as intermediate features for neural speech
Michael Wagner, and Morgan Sonderegger. 2017. synthesis. In Proc. Interspeech 2022, pages 833–837.
Montreal forced aligner: Trainable text-speech align-
ment using kaldi. In Interspeech, volume 2017, pages Sarath Sivaprasad, Saiteja Kosgi, and Vineet Gandhi.
498–502. 2021. Emotional Prosody Control for Speech Gener-
ation. In Proc. Interspeech 2021, pages 4653–4657.
Tomáš Nekvinda and Ondřej Dušek. 2020. One model,
many languages: Meta-learning for multilingual text- Hao Sun, Xu Tan, Jun-Wei Gan, Hongzhi Liu, Sheng
to-speech. arXiv preprint arXiv:2008.00768. Zhao, Tao Qin, and Tie-Yan Liu. 2019. Token-Level
Ensemble Distillation for Grapheme-to-Phoneme
Junrui Ni, Liming Wang, Heting Gao, Kaizhi Qian, Conversion. In Proc. Interspeech 2019, pages 2115–
Yang Zhang, Shiyu Chang, and Mark Hasegawa- 2119.
Johnson. 2022. Unsupervised text-to-speech syn-
thesis by unsupervised automatic speech recognition. SYSPIN-IISC. 2022. Text-to-speech synthesizer in nine
arXiv preprint arXiv:2203.15796. indian languages. https://syspin.iisc.ac.in/
datasets.
Aaron van den Oord, Sander Dieleman, Heiga Zen,
Karen Simonyan, Oriol Vinyals, Alex Graves, Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke
Nal Kalchbrenner, Andrew Senior, and Koray Aihara. 2018. Efficiently trainable text-to-speech
Kavukcuoglu. 2016. Wavenet: A generative model system based on deep convolutional networks with
for raw audio. arXiv preprint arXiv:1609.03499. guided attention. In 2018 IEEE International Con-
ference on Acoustics, Speech and Signal Processing
Vassil Panayotov, Guoguo Chen, Daniel Povey, and (ICASSP), pages 4784–4788. IEEE.
Sanjeev Khudanpur. 2015. Librispeech: an asr cor-
pus based on public domain audio books. In 2015 Yaniv Taigman, Lior Wolf, Adam Polyak, and Eliya
IEEE international conference on acoustics, speech Nachmani. 2017. Voiceloop: Voice fitting and
and signal processing (ICASSP), pages 5206–5210. synthesis via a phonological loop. arXiv preprint
IEEE. arXiv:1707.06588.
89
Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu. 2021. Haitong Zhang and Yue Lin. 2020. Unsupervised
A survey on neural speech synthesis. arXiv preprint Learning for Sequence-to-Sequence Text-to-Speech
arXiv:2106.15561. for Low-Resource Languages. In Proc. Interspeech
2020, pages 3161–3165.
Rafael Valle, Kevin Shih, Ryan Prenger, and Bryan
Catanzaro. 2020. Flowtron: an autoregressive flow- Yu Zhang, Ron J. Weiss, Heiga Zen, Yonghui Wu,
based generative network for text-to-speech synthesis. Zhifeng Chen, R.J. Skerry-Ryan, Ye Jia, Andrew
arXiv preprint arXiv:2005.05957. Rosenberg, and Bhuvana Ramabhadran. 2019. Learn-
ing to Speak Fluently in a Foreign Language: Multi-
Aaron Van Den Oord, Oriol Vinyals, et al. 2017. Neural lingual Speech Synthesis and Cross-Language Voice
discrete representation learning. Advances in neural Cloning. In Proc. Interspeech 2019, pages 2080–
information processing systems, 30. 2084.
91