Lightweight End-To-End Text-To-Speech Synthesis Fo
Lightweight End-To-End Text-To-Speech Synthesis Fo
net/publication/373421023
CITATION READS
1 26
7 authors, including:
All content following this page was uploaded by Andrzej Pomirski on 05 December 2023.
Alexa AI
[email protected]
225 10.21437/SSW.2023-35
Figure 1: LE2E model architecture and training discriminators.
The proposed model is shown in Figure 1. LE2E follows the State-of-the-art E2E-TTS systems use multiple discriminators
typical GAN-based setup of a generator (G) that includes an that guide the generator in synthesizing a coherent waveform
acoustic latent encoder and an acoustic decoder, and a set of while minimizing perceptual artifacts that can be easily distin-
discriminators (D) divided in two different sets of multi-period guished by a human ear. We adopted the set of discriminators
(MPD) and multi-resolution discriminators (MRD). proposed in BigVGAN [21], which includes a multi-period dis-
criminator (MPD) and a multi-resolution discriminator (MRD).
It is important to note that each discriminator comprises several
3.1. Generator sub-discriminators that operate on different resolution windows
of the waveform. The MPD reshapes the predicted waveform
The waveform generator consists of two components: an acous- into 2D representations with varying heights and widths to sep-
tic latent model and a neural vocoder, which are concatenated arately capture multiple periodic structures. On the other hand,
into a jointly trained single architecture. The acoustic latent the MRD is composed of several sub-discriminators that op-
model is inspired by the LightSpeech [7] model, but instead of erate on multiple linear spectrograms with different short-time
predicting mel-spectrograms, it is trained to produce unsuper- Fourier transform (STFT) resolutions.
vised acoustic latents that are used as input by the vocoder. The
model takes phoneme and positional embeddings and outputs 3.3. Training objectives
latent frame-level up-sampled acoustic embeddings. It is di-
vided in three components: a text encoder (E), a variance adap- The training objective of LE2E includes losses applied to the
tor (V) and an acoustic decoder (D). The text encoder gener- duration predictor, pitch predictor and waveform level, which
ates positional aware phoneme embeddings through a stack of include GAN and regression losses in form of power loss and
transformer layers. Then, these are fed into the variance adap- multi-resolution STFT loss.
tor. The variance adaptor is comprised of a duration predic-
tor and a pitch predictor. The duration predictor takes output 3.3.1. Duration loss
of the text encoder and predicts phoneme-level durations, used
to up-sample the phoneme embeddings. Then, the up-sampled The acoustic latent model takes the hidden representation of the
phoneme embeddings are fed to the pitch predictor, which is input phonemes and predicts the frame-level duration of each
trained to predict frame-level pitch latents that are added up to one of them in the logarithmic scale. The duration predictor is
the up-sampled phoneme embeddings. Finally, the output of optimized with mean square error (MSE) loss, Ldur between
the variance adaptor is inputted to the acoustic decoder. The predicted and oracle durations extracted by an external aligner
acoustic decoder follows the exact same stack of transformer based on the Kaldi Speech Recognition Toolkit [22] used in our
layer as the phoneme encoder. The vocoder model generates experiments.
waveform signals from the latent embeddings produced by the
acoustic model. The model is based on Multi-Band MelGAN 3.3.2. Pitch loss
[8] and takes the intermediate acoustic latents and upsamples
them to generate the waveform. It is composed of a stack of Pitch prediction is typically achieved through regression tasks
up-sampling blocks that follow a transposed convolution and that estimate the exact pitch value [12, 23]. However, due to
a set of residual blocks with dilated convolutions that increase the high variability of ground-truth pitch contours, we replace
the receptive field. The proposed vocoder architecture follows the regression task with cross-entropy density modeling. To ac-
the method in [20] and generates different waveform sub-bands complish this, we follow [24] and apply a 256-bin quantization
that are then combined to generate the final output waveform to the standardized pitch signal, followed by a cross-entropy
through Pseudo Quadrature Mirror Filter bank (PQMF). loss function, Lf 0 on top of softmaxed predicted pitch logits.
226
3.3.3. Adversarial training 3.3.5. Total loss
For simplicity with the notation, we name all the K discrimina- Summing up all the described loss functions, we end up with
tors from MPD and MRD under the set of discriminators Dk . the final loss L for the generator in the jointly E2E training of
GAN Loss. The main adversarial training objective follows the the proposed architecture:
least squares loss functions for non-vanishing gradient flows.
The discriminator is trained to classify target samples to 1 and
predicted samples to 0. The generator is trained to fake the L = Ldur + Lf 0 + LG + λF M LF M +
discriminator by updating the sample quality to be classified. + λmel Lmel + λST F T LST F T (8)
227
Table 1: Evaluation metrics results on the LJSpeech dataset validation split. MOS is reported with a 95% confidence interval (CI)
Model cFSD (↓) F0 RMSE (↓) XWLM (↑) MOS (CI) (↑)
Recordings - - - 4.25 (±0.10)
VITS [9] 0.254 0.042 ± 0.024 0.985 ± 0.006 3.81 (±0.14)
FastSpeech2 + HiFi-GAN (JETS [10]) 0.212 0.041 ± 0.058 0.979 ± 0.011 4.01 (±0.13)
LightSpeech + MB-MelGAN+ (cascade) 0.248 0.029 ± 0.028 0.968 ± 0.016 3.73 (±0.14)
LightSpeech + MB-MelGAN+ (LE2E) 0.167 0.033 ± 0.027 0.972 ± 0.017 3.79 (±0.14)
Training process The model was trained for 1M steps with an Table 2: MOS comparison between recordings and the same
effective batch size of 128 samples. We used the AdamW [27] vocoder model (MB-MelGAN) with different training paradigm.
optimizer with β1 = 0.8 and β2 = 0.99 and an initial learning
rate of 2 × 10−4 . We used an exponential decay scheduler that Model MOS (CI) (↑)
reduced the learning rate by a factor γ = 0.99 at each training
Recordings 4.24 (±0.13)
epoch and a weight decay penalty factor of 1e − 2.
MB-MelGAN+ 4.02 (±0.14)
MB-MelGAN [8] 3.59 (±0.14)
4.3. Evaluation metrics
To measure and compare the quality of the proposed model,
we used a combination of three objective metrics and a subjec- open-source implementation, in which a checkpoint of each
tive mean opinion score (MOS) evaluation. For objective eval- model trained on LJSpeech dataset [26] is available. In addition,
uation, first we evaluated signal quality using the conditional we trained the LightSpeech model to predict mel-spectrograms
Fréchet Speech Distance (cFSD). To compute cFSD, we gener- following the original implementation [7] to demonstrate that
ated activation distributions for both the recordings and the syn- the proposed training paradigm improves the traditional cascade
thesized samples using a pre-trained XLSR-53 [28] wav2vec- approach. To do so, we generated predicted mel-spectrograms
2.0 [29] model. We then compared the distributions using the from it and fine-tuned the proposed MB-MelGAN+ vocoder for
Fréchet Distance metric. Second, to assess the intonation fi- an extra 200K steps to mitigate the domain mismatch in text-to-
delity, we computed the root mean-squared error (RMSE) of the speech inference. Table 1 summarizes the comparison results,
fundamental frequency (F0) between the predicted waveforms while Table 3 presents the memory consumption and computa-
and the recordings. A lower RMSE indicates higher F0 fidelity. tional complexity. LE2E does not only perform slightly better
Third, we evaluated the similarity between the recordings and than the cascade method with the exact same architecture but
the generated speaker embeddings using the mean cosine dis- also simplifies the training process by eliminating the need for
tance metric between extracted embeddings obtained through two independent trainings and an additional fine-tuning step.
the pre-trained XVector head from WavLM [30] (XWLM). For Compared to state-of-the-art E2E models, our model achieves
the MOS evaluation, we gathered 100 US English speakers slightly lower metrics, but it has a much smaller size and faster
from the crowd-sourcing platform ClickWorker to evaluate the inference time. Specifically, LE2E is 90% smaller and 10×
audio quality of 20 random samples from the test set. Each faster than JETS while reporting marginally inferior metrics.
speaker evaluated 8 test cases and rated each sample on a scale
of 1 (very low quality) to 5 (very high quality). We collected
Table 3: Model comparison in terms of memory consumption
a total of 40 ratings for each sample to assess the subjective
and computational complexity in a Nvidia A100 GPU
quality of the synthesized speech.
Vocoder ablation study In order to show that the new pro- VITS [9] 29.36M 0.0814 (±0.0304)
posed loss objective has a positive impact in our standalone neu- JETS [10] 40.94M 0.0765 (±0.0206)
ral vocoder architecture, we trained the Multi-Band MelGAN LE2E 3.71M 0.0084 (±0.0480)
model with the original loss objective [8] and with the proposed
loss described in Section 2.3. We will reefer to this latter as
Multi-Band MelGAN+. Both architectures were trained on the
same LJSpeech dataset and evaluated in the re-synthesis task
5. Conclusions and future work
of generating waveform signals from the test split. Following We proposed a lightweight end-to-end text-to-speech (LE2E)
the original training paradigm, we pre-trained the generator for architecture that achieves comparable results to VITS and slight
200K steps and then train the whole architecture for 1M steps worse performance than JETS while being significantly smaller
in both models. We used Adam [31] optimizer with a learning and faster, suitable for on-device applications in low-resource
rate of 1e − 4 and a batch size of 128. Subjective MOS in Ta- scenarios. Our proposed training paradigm improves existing
ble 2 clearly shows that the proposed loss objective generates vocoder architectures and enables the training of a lightweight
better quality waveform compared to original training objective E2E-TTS system, which replaces the traditional cascade ap-
without any change in the model architecture. proach and simplifies the training process to a single step. Fu-
Model comparison. We compared the proposed LE2E archi- ture research could expand our findings to multi-speaker and/or
tecture against two state-of-the-art E2E models: VITS [9] and multi-lingual use-cases, as well as to further explore new dis-
JETS [10]. Both models are obtained from the ESPNet [32] criminator architectures for lightweight TTS models.
228
6. References [18] S. Achanta, A. Antony, L. Golipour, J. Li, T. Raitio, R. Rasipu-
ram, F. Rossi, J. Shi, J. Upadhyay, D. Winarsky et al., “On-device
[1] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural neural speech synthesis,” in 2021 IEEE Automatic Speech Recog-
speech synthesis,” arXiv preprint arXiv:2106.15561, 2021. nition and Understanding Workshop (ASRU). IEEE, 2021, pp.
[2] Y.-C. Wu, P. L. Tobing, K. Yasuhara, N. Matsunaga, Y. Ohtani, 1155–1161.
and T. Toda, “A Cyclical Post-Filtering Approach to Mismatch
[19] M. Kawamura, Y. Shirahata, R. Yamamoto, and K. Tachibana,
Refinement of Neural Vocoder for Text-to-Speech Systems,” in
“Lightweight and high-fidelity end-to-end text-to-speech with
Proc. Interspeech 2020, 2020, pp. 3540–3544.
multi-band generation and inverse short-time fourier transform,”
[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, in 2023 IEEE International Conference on Acoustics, Speech and
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural Signal Processing (ICASSP). IEEE, 2023.
tts synthesis by conditioning wavenet on mel spectrogram pre-
[20] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu,
dictions,” in 2018 IEEE international conference on acoustics,
D. Tuo, S. Kang, G. Lei et al., “Durian: Duration informed
speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–
attention network for multimodal synthesis,” arXiv preprint
4783.
arXiv:1909.01700, 2019.
[4] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A genera-
[21] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
tive flow for text-to-speech via monotonic alignment search,” Ad-
“Bigvgan: A universal neural vocoder with large-scale training,”
vances in Neural Information Processing Systems, vol. 33, pp.
arXiv preprint arXiv:2206.04658, 2022.
8067–8077, 2020.
[5] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
works for efficient and high fidelity speech synthesis,” Advances N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
in Neural Information Processing Systems, vol. 33, pp. 17 022– “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
17 033, 2020. on automatic speech recognition and understanding, no. CONF.
IEEE Signal Processing Society, 2011.
[6] X. Yuan, Y. Feng, M. Ye, C. Tuo, and M. Zhang, “Adavocoder:
Adaptive vocoder for custom voice,” in Proc. Interspeech 2022, [23] G. Comini, G. Huybrechts, M. S. Ribeiro, A. Gabrys,
2022, pp. 21–25. and J. Lorenzo-Trueba, “Low-data? no problem: low-
resource, language-agnostic conversational text-to-speech
[7] R. Luo, X. Tan, R. Wang, T. Qin, J. Li, S. Zhao, E. Chen, and T.-Y. via f0-conditioned data augmentation,” arXiv preprint
Liu, “Lightspeech: Lightweight and fast text to speech with neural arXiv:2207.14607, 2022.
architecture search,” in 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, [24] K. Qian, Z. Jin, M. Hasegawa-Johnson, and G. J. Mysore, “F0-
pp. 5699–5703. consistent many-to-many non-parallel voice conversion via con-
ditional autoencoder,” in 2020 IEEE International Conference on
[8] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi- Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020,
band melgan: Faster waveform generation for high-quality text- pp. 6284–6288.
to-speech,” in 2021 IEEE Spoken Language Technology Workshop
(SLT). IEEE, 2021, pp. 492–498. [25] K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh,
J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Mel-
[9] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder gan: Generative adversarial networks for conditional waveform
with adversarial learning for end-to-end text-to-speech,” in Inter- synthesis,” Advances in neural information processing systems,
national Conference on Machine Learning. PMLR, 2021, pp. vol. 32, 2019.
5530–5540.
[26] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.
[10] D. Lim, S. Jung, and E. Kim, “JETS: Jointly Training Fast- com/LJ-Speech-Dataset/, 2017.
Speech2 and HiFi-GAN for End to End Text to Speech,” in Proc.
Interspeech 2022, 2022, pp. 21–25. [27] I. Loshchilov and F. Hutter, “Decoupled weight de-
cay regularization,” in International Conference on Learn-
[11] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and ing Representations, 2019. [Online]. Available: https:
D. P. Kingma, “Wave-tacotron: Spectrogram-free end-to-end text- //openreview.net/forum?id=Bkg6RiCqY7
to-speech synthesis,” in 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, [28] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grang-
pp. 5679–5683. ier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence
modeling,” arXiv preprint arXiv:1904.01038, 2019.
[12] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,
“Fastspeech 2: Fast and high-quality end-to-end text to speech,” [29] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
in International Conference on Learning Representations, 2021. 2.0: A framework for self-supervised learning of speech repre-
sentations,” Advances in neural information processing systems,
[13] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast
vol. 33, pp. 12 449–12 460, 2020.
waveform generation model based on generative adversarial net-
works with multi-resolution spectrogram,” in 2020 IEEE Inter- [30] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li,
national Conference on Acoustics, Speech and Signal Processing N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-
(ICASSP). IEEE, 2020, pp. 6199–6203. supervised pre-training for full stack speech processing,” IEEE
Journal of Selected Topics in Signal Processing, vol. 16, no. 6,
[14] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang,
pp. 1505–1518, 2022.
Y. Leng, Y. Yi, L. He et al., “Naturalspeech: End-to-end text
to speech synthesis with human-level quality,” arXiv preprint [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
arXiv:2205.04421, 2022. tion,” International Conference on Learning Representations, 12
[15] J. Vainer and O. Dušek, “Speedyspeech: Efficient neural speech 2014.
synthesis,” arXiv preprint arXiv:2008.03802, 2020. [32] T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi,
[16] X. Zhou, Z. Zhou, and X. Shi, “Fch-tts: Fast, controllable and T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe,
high-quality non-autoregressive text-to-speech synthesis,” in 2022 “Espnet2-tts: Extending the edge of tts research,” arXiv preprint
International Joint Conference on Neural Networks (IJCNN). arXiv:2110.07840, 2021.
IEEE, 2022, pp. 01–08.
[17] H.-K. Nguyen, K. Jeong, S.-Y. Um, M.-J. Hwang, E. Song, and
H.-G. Kang, “Litetts: A lightweight mel-spectrogram-free text-
to-wave synthesizer based on generative adversarial networks.” in
Interspeech, 2021, pp. 3595–3599.
229