0% found this document useful (0 votes)
46 views5 pages

Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module

The document proposes a new method called Voice Filter for extremely low-resource text-to-speech synthesis using only one minute of speech from a target speaker. It uses voice conversion as a post-processing module appended to a pre-existing text-to-speech system. The Voice Filter is trained on a synthetic parallel corpus generated by the text-to-speech system to map the output speech to the target speaker's identity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views5 pages

Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module

The document proposes a new method called Voice Filter for extremely low-resource text-to-speech synthesis using only one minute of speech from a target speaker. It uses voice conversion as a post-processing module appended to a pre-existing text-to-speech system. The Voice Filter is trained on a synthetic parallel corpus generated by the text-to-speech system to map the output speech to the target speaker's identity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

VOICE FILTER: FEW-SHOT TEXT-TO-SPEECH SPEAKER ADAPTATION USING VOICE

CONVERSION AS A POST-PROCESSING MODULE

Adam Gabryś? , Goeric Huybrechts? , Manuel Sam Ribeiro? , Chung-Ming Chien† , Julian Roth? ,
Giulia Comini? , Roberto Barra-Chicote? , Bartek Perz? , Jaime Lorenzo-Trueba?
? †
Alexa AI National Taiwan University (NTU)

ABSTRACT are techniques such as Vector-Quantized models [10, 11, 12], U-


Net structures [13, 12], attention mechanisms [14] or a combina-
State-of-the-art text-to-speech (TTS) systems require several hours tion of loss functions [15, 16]. Alternatively, speaker identity can
of recorded speech data to generate high-quality synthetic speech. be controlled through representations that are inherited from exter-
When using reduced amounts of training data, standard TTS models nal speaker verification systems or jointly trained with the primary
suffer from speech quality and intelligibility degradations, mak- model [17, 18, 16]. During adaptation, studies have proposed to opti-
ing training low-resource TTS systems problematic. In this paper, mize all model parameters [9, 19], selected components [20, 21, 22],
we propose a novel extremely low-resource TTS method called or to focus instead on external speaker representations [9, 17]. An
Voice Filter that uses as little as one minute of speech from a tar- alternative approach is to address the adaptation problem via data
get speaker. It uses voice conversion (VC) as a post-processing augmentation. This can be done through conventional signal pro-
module appended to a pre-existing high-quality TTS system and cessing techniques [23], but there are more sophisticated methods
marks a conceptual shift in the existing TTS paradigm, framing the that proposed producing high-quality synthetic data for the target
few-shot TTS problem as a VC task. Furthermore, we propose to speaker through a voice conversion (VC) model [6, 24]. The TTS
use a duration-controllable TTS system to create a parallel speech system is then optimized from scratch on a mixture of natural and
corpus to facilitate the VC task. Results show that the Voice Filter synthetic data.
outperforms state-of-the-art few-shot speech synthesis techniques in There are, however, shortcomings associated with these meth-
terms of objective and subjective metrics on one minute of speech ods. Wang et al [25] suggests that, with the speaker adaptation
on a diverse set of voices, while being competitive against a TTS strategy, a single architecture becomes responsible for the modelling
model built on 30 times more data.1 of linguistic content and speaker identity. As in this framework it
Index Terms— Text-To-Speech, Speaker Adaptation, Voice is not entirely clear which model parameters are responsible for
Conversion, Few-Shot Learning speaker identity, the impact of parameter adaptation can be diluted.
Additionally, fine-tuning a complex architecture on a few samples
can easily lead to model over-fitting, reducing overall quality and
1. INTRODUCTION
intelligibility. On the other hand, data augmentation-based ap-
State-of-the-art text-to-speech (TTS) technologies are capable of proaches, still require at least 15 minutes of training data from the
generating high-quality synthetic speech on a variety of situations. target speaker in order to optimise the TTS models. As such they
In order to achieve very high quality, TTS typically requires several aren’t directly applicable to very low-resource scenarios.
hours of studio-quality data, drawn from either a single or multiple In this paper, we address the problem of extremely low-resource
speakers [1]. As such, reducing the amount of speech data to a few TTS by using VC as a post-processing module which we refer
hours imposes limitations on the quality and intelligibility of those to as “Voice Filter” on top of a high-quality single-speaker TTS
systems [2]. Because it is not always feasible to collect several model. This single-speaker TTS model is also used to generate a
hours of speech data, particularly when scaling TTS voices to a synthetic parallel corpus for the Voice Filter training. Our proposal
large number of new speakers, the problem of building low-resource presents the following novelties and advantages: (1) The overall
TTS voices has been thoroughly explored [3, 4, 5, 6]. When the process becomes modular, splitting it into speech content gener-
aim is building high-quality TTS systems in extremely low-resource ation task followed by a speaker identity generation one. This
scenarios, such as when only one minute of speech from the target improves efficiency, robustness and interpretability, as well as en-
speaker is available, we primarily aim to capture speaker identity abling task-dependent adaptation. (2) We leverage the strengths of
and we must defer the modelling of phonetic and prosodic variability parallel VC without assuming that a parallel corpus is available by
to supplementary speech data. synthetically generating frame-level matching speech pairs via a
Common approaches to this problem therefore rely on speaker duration-controllable TTS model.
adaptation, whereby the parameters of a multi-speaker model are In summary, we split the problem of traditional few-shot TTS
optimized on a few samples from the target speaker [7, 8, 9]. The voice creation into two tasks: speech content and speaker identity
adaptation process aims to modify only the speaker identity, which generation. The split means we can lower the amount of speech
are the speech attributes defining the target speaker as an individ- required to train a synthetic voice for a particular speaker identity
ual. To control speaker identity in few-shot speaker adaptation, there down to as little as one minute of speech by limiting the complexity
of the problem. This translates in the produced synthetic speech
1 Audio samples will be made publicly available with a related blog post quality being comparable to those of TTS models trained on 30 times
on www.amazon.science. more data.
† Work carried out as an intern.
Data preparation Pre-train the Voice Filter Fine-tune the Voice Filter
on multiple speakers on a target speaker

Generate frame-matching parallel


TTS predicted parallel TTS predicted parallel
Mel-spectrogram
Mel-spectrogram Mel-spectrogram
Phoneme
sequence
Background Target
Extract Mel-spectrogram Voice Filter Voice Filter
Target Mel-spectrogram Target Mel-spectrogram
Waveform

Extract fundamental frequency Target F0 Target F0

Utterance-level speaker Centroid of speaker


Extract speaker embedding
embedding embedding

Fig. 1. Synthetic parallel data preparation and training flowchart for the proposed Voice Filter.

2. METHOD This three-step process results in a parallel corpus that is aligned


at the frame-level between synthetic single-speaker and natural
Voice Filter approaches the problem of extremely low-resource TTS multi-speaker speech samples. It forms the training data for the
voice building by decoupling the speech content and the speaker proposed Voice Filter, which to the best of our knowledge, is a novel
identity generation into two separate tasks, with the Voice Filter inclusion to the VC problem.
itself focusing on the latter. This results in more modularization For the purpose of this paper and experimentation, the single-
and more robust speaker identity generation compared to adapting speaker corpus contains 10 hours of high-quality speech data read in
a multi-speaker TTS model. Since the Voice Filter is in charge of a neutral speaking style by a male US English speaker. The multi-
speaker identity generation only, it operates at a lower abstraction speaker corpus consists of 120 gender-balanced male and female US
level (Mel-spectrograms) than the overall TTS system (phonemes). English speakers, with approximately 40 minutes of data per speaker
We believe that this speech-to-speech task is easier than the text-to- with complete phonetic coverage.
speech one, especially in very low-resource settings.
The proposed VC module (Voice Filter) is located between a
single-speaker duration-controllable TTS model [24] and a universal 2.2. Model architecture
neural vocoder [26]. This enables us to generate Mel-spectrograms
for any desired text with the TTS model, which is subsequently given The Voice Filter model (Figure 2) takes as inputs and outputs 80-bin
the appropriate speaker identity by the Voice Filter and ultimately Mel-spectrograms of equal-length and consists of a 6-stack of size-
converted into a time-domain waveform by the vocoder. preserving 1D convolutions with 512 channels and a kernel size of 5
with batch-norm. This is followed by a uni-directional LSTM and a
2.1. Creation of synthetic parallel corpus Dense layer with 1024 nodes.
The proposed method begins by synthetically generating a parallel We concatenate the target speaker embedding and log-f0 con-
dataset to enable training of our Voice Filter (Figure 1, Data prepa- tour to the hidden representation of the third convolution layer. The
ration). Working with a synthetic parallel corpus allows us to over- speaker embedding is a 256-dimensional vector defined at the utter-
come two of the biggest limitations of parallel VC, while still lever- ance level and broadcasted to the frame level. The speaker verifica-
aging its strengths: (1) It is no longer required to have a large parallel tion system used to extract the embeddings was trained on the multi-
speech corpus between source and target speakers, which is hard and speaker corpus and optimized on a Generalized End-to-End Loss
expensive to collect; and (2) It is no longer necessary to apply dura- [28]. We observed that the log-f0 contour helps the model better ab-
tion warping methods to align the parallel corpora across the multi- sorb the prosodic differences between the input and target speaker.
ple speakers. This is possible because the input to the Voice Filter is Effectively speaking, this means that the Voice Filter doesn’t need
duration-controlled constructed synthetic speech and not recordings. to learn how to adjust prosody information between the source and
However, this means that in order to generate the parallel corpus we the target speakers but rather focuses on speaker-defining informa-
require two different datasets: a single-speaker corpus for the TTS tion. To extract the log-f0 contour from the target speech recordings,
system and a multi-speaker corpus for the Voice Filter training. we used the RAPT algorithm [29] of the Speech Processing Toolkit
Creating the synthetic parallel corpus consists of 3 steps: (SPTK2 ) with a threshold of 0 for voiced/unvoiced regions.
1. Force-align all available data at the phone level, which we
did using the pre-trained Kaldi ASpIRE TDNN system [27], 2.3. Model training and fine-tuning
including both the single-speaker and multi-speaker corpora.
2. Train a duration-controllable TTS system [24], making use of Training a Voice Filter model that can generate speech from 1 minute
the single-speaker corpus. of unseen speaker data is a two-step process: (1) background model
3. Generate synthetic data matching the transcripts and phone- training (Figure 1, Pre-train VF) and (2) fine-tuning to the minute of
level durations of the multi-speaker corpus that was aligned target unseen speaker data (Figure 1, Fine-tune VF).
in the preliminary step using the trained duration-controllable
TTS system. 2 http://sp-tk.sourceforge.net
F0 (1 x T)
Centroid of target
speaker embedding
Voice Filter

Batch
BatchNorm Batch Norm
TTS predicted BatchNorm
Norm Batch
BatchNorm
Norm Voice-converted
Conv1D Conv1D LSTM Dense Phoneme Single speaker Target Universal
Mel-spectrogram Conv1D
Conv1D Concat Conv1D Mel-spectrogram
(512 channels, (512 Conv1D
channels, (1024 (80 sequence TTS Voice Filter Vocoder
(80 x T) (512
(512channels,
channels, (512 channels, (80 x T)
kernel size 5) (512
kernel channels,
size 5) channels) channels)
kernel
kernelsize
size5) kernel
5) kernelsize
size5)
5)
L1 loss
Up-sample

Target Extract and adapt F0


Speaker embedding Mel-spectrogram mean and variance
(256 x 1) (80 x T)

Fig. 2. Proposed Voice Filter architecture. Fig. 3. Proposed Voice Filter inference flowchart.

1. The background Voice Filter, is trained in a one-to-many fash- XLSR-53 [31] wav2vec-2.0 [32] model is used to generate the ac-
ion for 1 million steps on the entire synthetic parallel multi- tivation distributions for recordings and synthesised samples. The
speaker corpus generated in the previous step. This model is distributions are then compared with a Fréchet Distance, which pro-
capable of converting to any of the speakers seen during train- vides a measure of how close the generated speech is to an actual
ing but isn’t robust enough to generalize to unseen speakers recording. To objectively estimate a speaker similarity metric, we
without further processing. used the mean cosine distance between speaker embeddings (CSED)
2. We adapt the background model to the target Voice Filter by of recordings and predicted samples.
fine-tuning all the parameters of the background Voice Filter MUltiple Stimuli with Hidden Reference and Anchor
for 1000 steps on the target speaker’s single minute of speech (MUSHRA) tests were used to perceptually assess natural-
in a one-to-one fashion. We use the centroid of the target ness, signal quality, speaker and speaking style similarity. Samples
utterance-level speaker embeddings as we observed that, in from the systems being evaluated were presented to participants side
our few-shot scenario, fine-tuning on a constant speaker em- by side. They were asked to score them on a scale from 0 (the worst)
bedding rather than on variable utterance-level embeddings to 100 (the best) in terms of the metric being evaluated. We used the
resulted in more stable models. We have not tested the im- crowd-sourcing platform ClickWorker to assess each test utterance
pact in quality on non-target speakers after fine-tuning, but by a panel of 25 listeners. Target speaker recordings were always
we consider the resulting target Voice Filter to be speaker- included as a hidden upper-anchor system and we did not enforce
dependent. requirements for at least one system to be rated 100. We provided
the listeners with a reference sample for both the speaker and style
Both the background and target Voice Filter models are trained similarity evaluations. The lower-anchor for the speaker similarity
using the L1 spectral loss and the ADAM optimiser with default set- evaluation was the voice-converted samples of the furthest same
tings. gender speaker in the speaker embedding space. The lower-anchor
for the style similarity evaluation was the un-filtered TTS system.
2.4. Model Inference Paired two-sided Student T-tests with Holm-Bonferroni correction
were used to validate the statistical significance of the differences
Inference on the complete model (Figure 3) requires us to run several between two systems at a p-value threshold of 0.05.
models in succession:
1. Inference of the source Mel-spectrogram for the desired text 4. RESULTS
and predicted durations with the single-speaker TTS model.
4.1. Extreme-low resource speech synthesis performance
2. Estimation of f0 from the source Mel-spectrogram and re-
normalization to the mean and variance of the target speaker. We compared our system with two multi-speaker state-of-the-art
3. Conversion of the source Mel-spectrogram to the target technologies that have shown high performance for low-resource
speaker by the fine-tuned Voice Filter. speech synthesis: CopyCat (CC) [33] model with an additional f0
conditioning3 and a duration-controllable multi-speaker TTS (MS-
4. Synthesis of the voice-converted Mel-spectrogram to time- TTS) [24] model without the data augmentation component. Both
domain waveform via the vocoder. models were trained on the same dataset and conditions described in
At this moment, we do not adapt speaking rate or phone dura- section 2.3, including fine-tuning for 1000 steps on the 1 minute of
tions to those of the target speaker as they are not simple to estimate target speaker data.
on extremely low-resource scenarios and lead to significant artefacts. Objective metrics and perceptual MUSHRA evaluation scores
are reported in tables 1 (rows 1-3) and 2, respectively. We ob-
serve a statistical preference in the MUSHRA evaluation for the pro-
3. EXPERIMENTAL SETUP
posed system in terms of all evaluated metrics. The objective met-
rics are aligned with these observations, indicating that our proposed
Models were assessed using both objective and perceptual metrics.
method outperforms other speaker adaptation techniques when using
For our evaluations we have used 4 male and 4 female speakers with
the same amount of data.
50 test utterances per speaker, resulting in 400 prompts overall.
Signal quality is objectively measured using the conditional 3 Preliminary results indicated that f0-conditioned CopyCat showed better
Fréchet Speech Distance (cFSD) [30]. Specifially, a pre-trained signal quality stability and constituted a fairer comparison.
System Speaker sim. (CSED) Signal quality (cFSD) 4.3. Comparison against a competitive TTS
VF 0.192 0.197
Finally, we investigate the quality of the generated speech when
CC 0.198 0.249
compared to a TTS system trained on a larger amount of target
MS-TTS 0.207 0.263
TTS-DA 0.205 0.224
recordings. For that, we compared Voice Filter against a validated
low-resource TTS technology that showed to be competitive when
trained on 30 minutes of target speech (TTS-DA) [6]. It is worth
noting that such a technology implicitly results in the TTS system
Table 1. Average objective metrics for all evaluated systems. Best
estimating the phone duration for the target speaker, which is not
numbers are highlighted in bold. TTS-DA was trained on 30 minutes
the case for the proposed Voice Filter. Effectively this means we are
of speech instead of 1.
comparing the proposed Voice Filter against a system trained on 30
times more target data that has been validated also to be competitive
System Sp. sim. Style sim. Nat. Sig. Q. against TTS voices trained on 5+ hours of target recordings.
Rec 79.43 ± 0.66 75.53 ± 0.73 78.72 ± 0.64 76.67 ± 0.75 Objective metrics and perceptual MUSHRA evaluation scores
VF 67.96 ± 0.86 64.4 ± 0.82 53.09 ± 0.81 56.28 ± 0.84
CC 66.57 ± 0.88 63.56 ± 0.84 52.08 ± 0.81 55.28 ± 0.85 are reported in tables 1 (rows 1 & 4) and 5, respectively. While
MS-TTS 65.83 ± 0.91 62.10 ± 0.86 50.71 ± 0.81 54.24 ± 0.85 we observe a 4% relative degradation in speaker similarity, the
Lower-anchor 37.90 ± 1.00 37.96 ± 1.12 − −
MUSHRA evaluations indicate no statistical difference between the
systems in terms of signal quality, naturalness, and style similarity.
On the other hand, in terms of objective metrics, we observe that
Table 2. Average MUSHRA results with confidence interval of 95%.
speaker similarity and signal quality are better for our proposed
Best scores with statistical difference between Voice filter (VF) and
method. Overall, results show that our model is on par with TTS-
reference systems are highlighted in bold (p < 0.05).
DA with a slight human-perceived degradation in terms of speaker
similarity, and this despite the much smaller target training dataset
being used.
4.2. Ablation study on data quantity
System Sp. sim. Style sim. Nat. Sig. Q.
In order to understand the impact of the extremely low-resource sce-
Rec 83.23 ± 0.66 79.82 ± 0.80 77.05 ± 0.56 76.75 ± 0.52
nario, we trained the Voice Filter using 1, 5 and 25 minutes of target VF 70.38 ± 0.93 69.53 ± 0.98 55.24 ± 0.80 55.60 ± 0.72
speaker data during fine-tuning. Objective metrics and perceptual TTS-DA 73.67 ± 0.85 69.97 ± 1.01 55.21 ± 0.81 55.83 ± 0.71
Lower-anchor 37.74 ± 1.10 39.51 ± 1.46 − −
MUSHRA evaluation scores are reported in tables 3 and 4, respec-
tively.
In the subjective evaluation, listeners did not perceive a statisti- Table 5. Average MUSHRA results. Best scores with statistical
cally significant difference between the different data scenarios, al- difference between Voice filter (VF) and reference systems are high-
though objective metrics still show slight improvements with bigger lighted in bold (p < 0.05)
amounts of data. This hints that, while there may be room to im-
prove the performance of the system, the nature of our Voice Filter
doesn’t benefit perceptually from richer data scenarios. With only 1
minute of target data, we are able to create high-quality samples.
5. CONCLUSIONS

# min Speaker sim. (CSED) Signal quality (cFSD) In this work, we proposed a novel extremely low-resource TTS
1 min 0.192 0.197 method called Voice Filter that can produce high-quality speech
5 min 0.183 0.189 when using only 1 minute of speech.
25 min 0.185 0.176 Voice Filter splits the TTS process in a speech content and
speaker identity generation task. The speaker identity is generated
via a fine-tuned one-to-many VC module, which makes it easily
Table 3. Average objective metrics for Voice Filter trained using scalable to new speakers even on extremely low-resource settings.
varying amount of data. Best results are highlighted in bold. The speech content generation module is a duration-controllable
single-speaker TTS system that has the added benefit of enabling us
to generate a synthetic parallel corpus. This enables Voice Filter to
work in a parallel frame-level condition, which has a higher quality
# min Sp. sim. Style sim. Nat. Sig. Q. ceiling and lower modelling complexity. Evaluations show that our
Rec 73.77 ± 0.99 72.06 ± 0.95 78.69 ± 0.70 73.94 ± 0.91 Voice Filter outperforms other few-shot speech synthesis techniques
1 min 51.90 ± 0.99 54.67 ± 0.98 55.10 ± 0.84 53.94 ± 0.95 in terms of objective and subjective metrics on the 1-minute data
5 min 51.93 ± 1.01 55.36 ± 0.97 55.06 ± 0.83 53.83 ± 0.97 scenario, with quality comparable to a SOTA system trained on 30
25 min 52.00 ± 1.01 54.75 ± 0.99 55.21 ± 0.82 53.63 ± 0.95 times more data.
In conclusion, we consider the Voice Filter paradigm to be a
first step towards building extremely low-resource TTS as a post-
Table 4. Average MUSHRA results with confidence interval of 95%
processing VC plug-in. Moreover, we believe that the generation of
for Voice Filter trained with varying amount of data. No statistically
synthetic parallel duration-controllable data will enable further sce-
significant differences (p < 0.05) between systems.
narios in speech technologies that were limited previously because
of data availability.
6. REFERENCES [17] S. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice
cloning with a few samples,” in Advances in Neural Informa-
[1] J. Latorre, J. Lachowicz, J. Lorenzo-Trueba, et al., “Effect of tion Processing Systems, 2018, vol. 31.
data reduction on sequence-to-sequence neural tts,” in Proc [18] Y. Jia, Y. Zhang, R. J. Weiss, et al., “Transfer learning from
ICASSP. IEEE, 2019, pp. 7075–7079. speaker verification to multispeaker text-to-speech synthesis,”
[2] Y.-A. Chung, Y. Wang, W.-N. Hsu, Y. Zhang, and R. Skerry- in Advances in Neural Information Processing Systems 31,
Ryan, “Semi-supervised training for improving data efficiency 2018, pp. 4485–4495.
in end-to-end speech synthesis,” in Proc. ICASSP). IEEE, [19] Z. Kons, S. Shechtman, A. Sorin, C. Rabinovitz, and R. Hoory,
2019, pp. 6940–6944. “High Quality, Lightweight and Adaptable TTS Using LPC-
[3] Y.-J. Chen, T. Tu, C. chieh Yeh, and H.-Y. Lee, “End-to- Net,” in Proc. Interspeech 2019, 2019, pp. 176–180.
End Text-to-Speech for Low-Resource Languages by Cross- [20] H. B. Moss, V. Aggarwal, N. Prateek, J. González, and
Lingual Transfer Learning,” in Proc. Interspeech 2019, 2019, R. Barra-Chicote, “Boffin tts: Few-shot speaker adaptation
pp. 2075–2079. by bayesian optimization,” in Proc. ICASSP. IEEE, 2020, pp.
7639–7643.
[4] H. Zhang and Y. Lin, “Unsupervised Learning for Sequence-
[21] Z. Zhang, Q. Tian, H. Lu, L.-H. Chen, and S. Liu, “Adadurian:
to-Sequence Text-to-Speech for Low-Resource Languages,” in
Few-shot adaptation for neural text-to-speech with durian,”
Proc. Interspeech 2020, 2020, pp. 3161–3165.
arXiv preprint arXiv:2005.05642, 2020.
[5] J. Xu, X. Tan, Y. Ren, et al., “Lrspeech: Extremely low- [22] M. Chen, X. Tan, B. Li, et al., “Adaspeech: Adaptive text to
resource speech synthesis and recognition,” in Proceedings of speech for custom voice,” International Conference on Learn-
the 26th ACM SIGKDD International Conference on Knowl- ing Representations (ICLR), 2021.
edge Discovery & Data Mining, 2020, pp. 2802–2812.
[23] B. Lorincz, A. Stan, and M. Giurgiu, “Speaker verification-
[6] G. Huybrechts, T. Merritt, G. Comini, et al., “Low-resource derived loss and data augmentation for dnn-based multispeaker
expressive text-to-speech using data augmentation,” in Proc. speech synthesis,” in 29th European Signal Processing Con-
ICASSP. IEEE, 2021, pp. 6593–6597. ference, EUSIPCO 2021, Dublin, Ireland, August 23-27, 2021.
[7] Q. Xie, X. Tian, G. Liu, et al., “The multi-speaker multi-style 2021, pp. 26–30, IEEE.
voice cloning challenge 2021,” in Proc. ICASSP, 2021, pp. [24] R. Shah, K. Pokora, A. Ezzerg, et al., “Non-Autoregressive
8613–8617. TTS with Explicit Duration Modelling for Low-Resource
Highly Expressive Speech,” in Proc. 11th ISCA Speech Syn-
[8] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “Voiceloop: thesis Workshop (SSW 11), 2021, pp. 96–101.
Voice fitting and synthesis via a phonological loop,” in Inter-
national Conference on Learning Representations, 2018. [25] T. Wang, J. Tao, R. Fu, et al., “Spoken content and voice factor-
ization for few-shot speaker adaptation.,” in Proc. Interspeech,
[9] Y. Chen, Y. Assael, B. Shillingford, et al., “Sample effi- 2020, pp. 796–800.
cient adaptive text-to-speech,” in International Conference on [26] A. Oord, Y. Li, I. Babuschkin, et al., “Parallel wavenet: Fast
Learning Representations, 2019. high-fidelity speech synthesis,” in International conference on
[10] A. van den Oord, O. Vinyals, and k. kavukcuoglu, “Neural dis- machine learning. PMLR, 2018, pp. 3918–3926.
crete representation learning,” in Advances in Neural Informa- [27] V. Peddinti, G. Chen, V. Manohar, et al., “Jhu aspire system:
tion Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, Robust lvcsr with tdnns, ivector adaptation and rnn-lms,” in
et al., Eds. 2017, vol. 30, Curran Associates, Inc. 2015 IEEE Workshop on Automatic Speech Recognition and
[11] A. Razavi, A. van den Oord, and O. Vinyals, “Generating di- Understanding (ASRU). IEEE, 2015, pp. 539–546.
verse high-fidelity images with vq-vae-2,” in Advances in neu- [28] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized
ral information processing systems, 2019, pp. 14866–14876. end-to-end loss for speaker verification,” in 2018 IEEE Inter-
national Conference on Acoustics, Speech and Signal Process-
[12] D.-Y. Wu, Y.-H. Chen, and H. yi Lee, “VQVC+: One-Shot
ing (ICASSP). IEEE, 2018, pp. 4879–4883.
Voice Conversion by Vector Quantization and U-Net Architec-
ture,” in Proc. Interspeech 2020, 2020, pp. 4691–4695. [29] D. Talkin and W. B. Kleijn, “A robust algorithm for pitch track-
ing (RAPT),” Speech coding and synthesis, vol. 495, pp. 518,
[13] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convo- 1995.
lutional networks for biomedical image segmentation,” in [30] M. Bińkowski, J. Donahue, S. Dieleman, et al., “High fidelity
International Conference on Medical image computing and speech synthesis with adversarial networks,” in International
computer-assisted intervention. Springer, 2015, pp. 234–241. Conference on Learning Representations, 2020.
[14] S. Choi, S. Han, D. Kim, and S. Ha, “Attentron: [31] M. Ott, S. Edunov, A. Baevski, et al., “fairseq: A fast, extensi-
Few-Shot Text-to-Speech Utilizing Attention-Based Variable- ble toolkit for sequence modeling,” in Proceedings of NAACL-
Length Embedding,” in Proc. Interspeech 2020, 2020, pp. HLT 2019: Demonstrations, 2019.
2007–2011. [32] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
[15] T. Wang, J. Tao, R. Fu, et al., “Bi-level speaker supervision for 2.0: A framework for self-supervised learning of speech rep-
one-shot speech synthesis.,” in Proc. Interspeech, 2020, pp. resentations,” in Advances in Neural Information Processing
3989–3993. Systems, 2020, vol. 33, pp. 12449–12460.
[16] Z. Cai, C. Zhang, and M. Li, “From Speaker Verification to [33] S. Karlapati, A. Moinet, A. Joly, et al., “Copycat: Many-to-
Multispeaker Speech Synthesis, Deep Transfer with Feedback many fine-grained prosody transfer for neural text-to-speech,”
Constraint,” in Proc. Interspeech 2020, 2020, pp. 3974–3978. in Proc. Interspeech, 2020.

You might also like