Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
Adam Gabryś? , Goeric Huybrechts? , Manuel Sam Ribeiro? , Chung-Ming Chien† , Julian Roth? ,
Giulia Comini? , Roberto Barra-Chicote? , Bartek Perz? , Jaime Lorenzo-Trueba?
? †
Alexa AI National Taiwan University (NTU)
Fig. 1. Synthetic parallel data preparation and training flowchart for the proposed Voice Filter.
Batch
BatchNorm Batch Norm
TTS predicted BatchNorm
Norm Batch
BatchNorm
Norm Voice-converted
Conv1D Conv1D LSTM Dense Phoneme Single speaker Target Universal
Mel-spectrogram Conv1D
Conv1D Concat Conv1D Mel-spectrogram
(512 channels, (512 Conv1D
channels, (1024 (80 sequence TTS Voice Filter Vocoder
(80 x T) (512
(512channels,
channels, (512 channels, (80 x T)
kernel size 5) (512
kernel channels,
size 5) channels) channels)
kernel
kernelsize
size5) kernel
5) kernelsize
size5)
5)
L1 loss
Up-sample
Fig. 2. Proposed Voice Filter architecture. Fig. 3. Proposed Voice Filter inference flowchart.
1. The background Voice Filter, is trained in a one-to-many fash- XLSR-53 [31] wav2vec-2.0 [32] model is used to generate the ac-
ion for 1 million steps on the entire synthetic parallel multi- tivation distributions for recordings and synthesised samples. The
speaker corpus generated in the previous step. This model is distributions are then compared with a Fréchet Distance, which pro-
capable of converting to any of the speakers seen during train- vides a measure of how close the generated speech is to an actual
ing but isn’t robust enough to generalize to unseen speakers recording. To objectively estimate a speaker similarity metric, we
without further processing. used the mean cosine distance between speaker embeddings (CSED)
2. We adapt the background model to the target Voice Filter by of recordings and predicted samples.
fine-tuning all the parameters of the background Voice Filter MUltiple Stimuli with Hidden Reference and Anchor
for 1000 steps on the target speaker’s single minute of speech (MUSHRA) tests were used to perceptually assess natural-
in a one-to-one fashion. We use the centroid of the target ness, signal quality, speaker and speaking style similarity. Samples
utterance-level speaker embeddings as we observed that, in from the systems being evaluated were presented to participants side
our few-shot scenario, fine-tuning on a constant speaker em- by side. They were asked to score them on a scale from 0 (the worst)
bedding rather than on variable utterance-level embeddings to 100 (the best) in terms of the metric being evaluated. We used the
resulted in more stable models. We have not tested the im- crowd-sourcing platform ClickWorker to assess each test utterance
pact in quality on non-target speakers after fine-tuning, but by a panel of 25 listeners. Target speaker recordings were always
we consider the resulting target Voice Filter to be speaker- included as a hidden upper-anchor system and we did not enforce
dependent. requirements for at least one system to be rated 100. We provided
the listeners with a reference sample for both the speaker and style
Both the background and target Voice Filter models are trained similarity evaluations. The lower-anchor for the speaker similarity
using the L1 spectral loss and the ADAM optimiser with default set- evaluation was the voice-converted samples of the furthest same
tings. gender speaker in the speaker embedding space. The lower-anchor
for the style similarity evaluation was the un-filtered TTS system.
2.4. Model Inference Paired two-sided Student T-tests with Holm-Bonferroni correction
were used to validate the statistical significance of the differences
Inference on the complete model (Figure 3) requires us to run several between two systems at a p-value threshold of 0.05.
models in succession:
1. Inference of the source Mel-spectrogram for the desired text 4. RESULTS
and predicted durations with the single-speaker TTS model.
4.1. Extreme-low resource speech synthesis performance
2. Estimation of f0 from the source Mel-spectrogram and re-
normalization to the mean and variance of the target speaker. We compared our system with two multi-speaker state-of-the-art
3. Conversion of the source Mel-spectrogram to the target technologies that have shown high performance for low-resource
speaker by the fine-tuned Voice Filter. speech synthesis: CopyCat (CC) [33] model with an additional f0
conditioning3 and a duration-controllable multi-speaker TTS (MS-
4. Synthesis of the voice-converted Mel-spectrogram to time- TTS) [24] model without the data augmentation component. Both
domain waveform via the vocoder. models were trained on the same dataset and conditions described in
At this moment, we do not adapt speaking rate or phone dura- section 2.3, including fine-tuning for 1000 steps on the 1 minute of
tions to those of the target speaker as they are not simple to estimate target speaker data.
on extremely low-resource scenarios and lead to significant artefacts. Objective metrics and perceptual MUSHRA evaluation scores
are reported in tables 1 (rows 1-3) and 2, respectively. We ob-
serve a statistical preference in the MUSHRA evaluation for the pro-
3. EXPERIMENTAL SETUP
posed system in terms of all evaluated metrics. The objective met-
rics are aligned with these observations, indicating that our proposed
Models were assessed using both objective and perceptual metrics.
method outperforms other speaker adaptation techniques when using
For our evaluations we have used 4 male and 4 female speakers with
the same amount of data.
50 test utterances per speaker, resulting in 400 prompts overall.
Signal quality is objectively measured using the conditional 3 Preliminary results indicated that f0-conditioned CopyCat showed better
Fréchet Speech Distance (cFSD) [30]. Specifially, a pre-trained signal quality stability and constituted a fairer comparison.
System Speaker sim. (CSED) Signal quality (cFSD) 4.3. Comparison against a competitive TTS
VF 0.192 0.197
Finally, we investigate the quality of the generated speech when
CC 0.198 0.249
compared to a TTS system trained on a larger amount of target
MS-TTS 0.207 0.263
TTS-DA 0.205 0.224
recordings. For that, we compared Voice Filter against a validated
low-resource TTS technology that showed to be competitive when
trained on 30 minutes of target speech (TTS-DA) [6]. It is worth
noting that such a technology implicitly results in the TTS system
Table 1. Average objective metrics for all evaluated systems. Best
estimating the phone duration for the target speaker, which is not
numbers are highlighted in bold. TTS-DA was trained on 30 minutes
the case for the proposed Voice Filter. Effectively this means we are
of speech instead of 1.
comparing the proposed Voice Filter against a system trained on 30
times more target data that has been validated also to be competitive
System Sp. sim. Style sim. Nat. Sig. Q. against TTS voices trained on 5+ hours of target recordings.
Rec 79.43 ± 0.66 75.53 ± 0.73 78.72 ± 0.64 76.67 ± 0.75 Objective metrics and perceptual MUSHRA evaluation scores
VF 67.96 ± 0.86 64.4 ± 0.82 53.09 ± 0.81 56.28 ± 0.84
CC 66.57 ± 0.88 63.56 ± 0.84 52.08 ± 0.81 55.28 ± 0.85 are reported in tables 1 (rows 1 & 4) and 5, respectively. While
MS-TTS 65.83 ± 0.91 62.10 ± 0.86 50.71 ± 0.81 54.24 ± 0.85 we observe a 4% relative degradation in speaker similarity, the
Lower-anchor 37.90 ± 1.00 37.96 ± 1.12 − −
MUSHRA evaluations indicate no statistical difference between the
systems in terms of signal quality, naturalness, and style similarity.
On the other hand, in terms of objective metrics, we observe that
Table 2. Average MUSHRA results with confidence interval of 95%.
speaker similarity and signal quality are better for our proposed
Best scores with statistical difference between Voice filter (VF) and
method. Overall, results show that our model is on par with TTS-
reference systems are highlighted in bold (p < 0.05).
DA with a slight human-perceived degradation in terms of speaker
similarity, and this despite the much smaller target training dataset
being used.
4.2. Ablation study on data quantity
System Sp. sim. Style sim. Nat. Sig. Q.
In order to understand the impact of the extremely low-resource sce-
Rec 83.23 ± 0.66 79.82 ± 0.80 77.05 ± 0.56 76.75 ± 0.52
nario, we trained the Voice Filter using 1, 5 and 25 minutes of target VF 70.38 ± 0.93 69.53 ± 0.98 55.24 ± 0.80 55.60 ± 0.72
speaker data during fine-tuning. Objective metrics and perceptual TTS-DA 73.67 ± 0.85 69.97 ± 1.01 55.21 ± 0.81 55.83 ± 0.71
Lower-anchor 37.74 ± 1.10 39.51 ± 1.46 − −
MUSHRA evaluation scores are reported in tables 3 and 4, respec-
tively.
In the subjective evaluation, listeners did not perceive a statisti- Table 5. Average MUSHRA results. Best scores with statistical
cally significant difference between the different data scenarios, al- difference between Voice filter (VF) and reference systems are high-
though objective metrics still show slight improvements with bigger lighted in bold (p < 0.05)
amounts of data. This hints that, while there may be room to im-
prove the performance of the system, the nature of our Voice Filter
doesn’t benefit perceptually from richer data scenarios. With only 1
minute of target data, we are able to create high-quality samples.
5. CONCLUSIONS
# min Speaker sim. (CSED) Signal quality (cFSD) In this work, we proposed a novel extremely low-resource TTS
1 min 0.192 0.197 method called Voice Filter that can produce high-quality speech
5 min 0.183 0.189 when using only 1 minute of speech.
25 min 0.185 0.176 Voice Filter splits the TTS process in a speech content and
speaker identity generation task. The speaker identity is generated
via a fine-tuned one-to-many VC module, which makes it easily
Table 3. Average objective metrics for Voice Filter trained using scalable to new speakers even on extremely low-resource settings.
varying amount of data. Best results are highlighted in bold. The speech content generation module is a duration-controllable
single-speaker TTS system that has the added benefit of enabling us
to generate a synthetic parallel corpus. This enables Voice Filter to
work in a parallel frame-level condition, which has a higher quality
# min Sp. sim. Style sim. Nat. Sig. Q. ceiling and lower modelling complexity. Evaluations show that our
Rec 73.77 ± 0.99 72.06 ± 0.95 78.69 ± 0.70 73.94 ± 0.91 Voice Filter outperforms other few-shot speech synthesis techniques
1 min 51.90 ± 0.99 54.67 ± 0.98 55.10 ± 0.84 53.94 ± 0.95 in terms of objective and subjective metrics on the 1-minute data
5 min 51.93 ± 1.01 55.36 ± 0.97 55.06 ± 0.83 53.83 ± 0.97 scenario, with quality comparable to a SOTA system trained on 30
25 min 52.00 ± 1.01 54.75 ± 0.99 55.21 ± 0.82 53.63 ± 0.95 times more data.
In conclusion, we consider the Voice Filter paradigm to be a
first step towards building extremely low-resource TTS as a post-
Table 4. Average MUSHRA results with confidence interval of 95%
processing VC plug-in. Moreover, we believe that the generation of
for Voice Filter trained with varying amount of data. No statistically
synthetic parallel duration-controllable data will enable further sce-
significant differences (p < 0.05) between systems.
narios in speech technologies that were limited previously because
of data availability.
6. REFERENCES [17] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice
cloning with a few samples,” in Advances in Neural Informa-
[1] J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, et al., “Effect of tion Processing Systems, 2018, vol. 31.
data reduction on sequence-to-sequence neural tts,” in Proc [18] Y. Jia, Y. Zhang, R. J. Weiss, et al., “Transfer learning from
ICASSP. IEEE, 2019, pp. 7075–7079. speaker verification to multispeaker text-to-speech synthesis,”
[2] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry- in Advances in Neural Information Processing Systems 31,
Ryan, “Semi-supervised training for improving data efficiency 2018, pp. 4485–4495.
in end-to-end speech synthesis,” in Proc. ICASSP). IEEE, [19] Z. Kons, S. Shechtman, A. Sorin, C. Rabinovitz, and R. Hoory,
2019, pp. 6940–6944. “High Quality, Lightweight and Adaptable TTS Using LPC-
[3] Y.-J. Chen, T. Tu, C. chieh Yeh, and H.-Y. Lee, “End-to- Net,” in Proc. Interspeech 2019, 2019, pp. 176–180.
End Text-to-Speech for Low-Resource Languages by Cross- [20] H. B. Moss, V. Aggarwal, N. Prateek, J. González, and
Lingual Transfer Learning,” in Proc. Interspeech 2019, 2019, R. Barra-Chicote, “Boffin tts: Few-shot speaker adaptation
pp. 2075–2079. by bayesian optimization,” in Proc. ICASSP. IEEE, 2020, pp.
7639–7643.
[4] H. Zhang and Y. Lin, “Unsupervised Learning for Sequence-
[21] Z. Zhang, Q. Tian, H. Lu, L.-H. Chen, and S. Liu, “Adadurian:
to-Sequence Text-to-Speech for Low-Resource Languages,” in
Few-shot adaptation for neural text-to-speech with durian,”
Proc. Interspeech 2020, 2020, pp. 3161–3165.
arXiv preprint arXiv:2005.05642, 2020.
[5] J. Xu, X. Tan, Y. Ren, et al., “Lrspeech: Extremely low- [22] M. Chen, X. Tan, B. Li, et al., “Adaspeech: Adaptive text to
resource speech synthesis and recognition,” in Proceedings of speech for custom voice,” International Conference on Learn-
the 26th ACM SIGKDD International Conference on Knowl- ing Representations (ICLR), 2021.
edge Discovery & Data Mining, 2020, pp. 2802–2812.
[23] B. Lorincz, A. Stan, and M. Giurgiu, “Speaker verification-
[6] G. Huybrechts, T. Merritt, G. Comini, et al., “Low-resource derived loss and data augmentation for dnn-based multispeaker
expressive text-to-speech using data augmentation,” in Proc. speech synthesis,” in 29th European Signal Processing Con-
ICASSP. IEEE, 2021, pp. 6593–6597. ference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021.
[7] Q. Xie, X. Tian, G. Liu, et al., “The multi-speaker multi-style 2021, pp. 26–30, IEEE.
voice cloning challenge 2021,” in Proc. ICASSP, 2021, pp. [24] R. Shah, K. Pokora, A. Ezzerg, et al., “Non-Autoregressive
8613–8617. TTS with Explicit Duration Modelling for Low-Resource
Highly Expressive Speech,” in Proc. 11th ISCA Speech Syn-
[8] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: thesis Workshop (SSW 11), 2021, pp. 96–101.
Voice fitting and synthesis via a phonological loop,” in Inter-
national Conference on Learning Representations, 2018. [25] T. Wang, J. Tao, R. Fu, et al., “Spoken content and voice factor-
ization for few-shot speaker adaptation.,” in Proc. Interspeech,
[9] Y. Chen, Y. Assael, B. Shillingford, et al., “Sample effi- 2020, pp. 796–800.
cient adaptive text-to-speech,” in International Conference on [26] A. Oord, Y. Li, I. Babuschkin, et al., “Parallel wavenet: Fast
Learning Representations, 2019. high-fidelity speech synthesis,” in International conference on
[10] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural dis- machine learning. PMLR, 2018, pp. 3918–3926.
crete representation learning,” in Advances in Neural Informa- [27] V. Peddinti, G. Chen, V. Manohar, et al., “Jhu aspire system:
tion Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, Robust lvcsr with tdnns, ivector adaptation and rnn-lms,” in
et al., Eds. 2017, vol. 30, Curran Associates, Inc. 2015 IEEE Workshop on Automatic Speech Recognition and
[11] A. Razavi, A. van den Oord, and O. Vinyals, “Generating di- Understanding (ASRU). IEEE, 2015, pp. 539–546.
verse high-fidelity images with vq-vae-2,” in Advances in neu- [28] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized
ral information processing systems, 2019, pp. 14866–14876. end-to-end loss for speaker verification,” in 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Process-
[12] D.-Y. Wu, Y.-H. Chen, and H. yi Lee, “VQVC+: One-Shot
ing (ICASSP). IEEE, 2018, pp. 4879–4883.
Voice Conversion by Vector Quantization and U-Net Architec-
ture,” in Proc. Interspeech 2020, 2020, pp. 4691–4695. [29] D. Talkin and W. B. Kleijn, “A robust algorithm for pitch track-
ing (RAPT),” Speech coding and synthesis, vol. 495, pp. 518,
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- 1995.
lutional networks for biomedical image segmentation,” in [30] M. Bińkowski, J. Donahue, S. Dieleman, et al., “High fidelity
International Conference on Medical image computing and speech synthesis with adversarial networks,” in International
computer-assisted intervention. Springer, 2015, pp. 234–241. Conference on Learning Representations, 2020.
[14] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: [31] M. Ott, S. Edunov, A. Baevski, et al., “fairseq: A fast, extensi-
Few-Shot Text-to-Speech Utilizing Attention-Based Variable- ble toolkit for sequence modeling,” in Proceedings of NAACL-
Length Embedding,” in Proc. Interspeech 2020, 2020, pp. HLT 2019: Demonstrations, 2019.
2007–2011. [32] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
[15] T. Wang, J. Tao, R. Fu, et al., “Bi-level speaker supervision for 2.0: A framework for self-supervised learning of speech rep-
one-shot speech synthesis.,” in Proc. Interspeech, 2020, pp. resentations,” in Advances in Neural Information Processing
3989–3993. Systems, 2020, vol. 33, pp. 12449–12460.
[16] Z. Cai, C. Zhang, and M. Li, “From Speaker Verification to [33] S. Karlapati, A. Moinet, A. Joly, et al., “Copycat: Many-to-
Multispeaker Speech Synthesis, Deep Transfer with Feedback many fine-grained prosody transfer for neural text-to-speech,”
Constraint,” in Proc. Interspeech 2020, 2020, pp. 3974–3978. in Proc. Interspeech, 2020.