0% found this document useful (0 votes)
20 views6 pages

Lightweight End-To-End Text-To-Speech Synthesis Fo

The paper presents a Lightweight End-to-End Text-to-Speech (LE2E-TTS) model designed for low-resource on-device applications, achieving high-quality speech synthesis with significantly reduced computational requirements. The model integrates an acoustic and vocoder model into a single architecture, demonstrating improved performance while being up to 90% smaller and 10 times faster than existing E2E solutions. Evaluated on the LJSpeech dataset, LE2E-TTS achieves a mean opinion score comparable to state-of-the-art models, making it suitable for real-time offline applications.

Uploaded by

harshplays260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views6 pages

Lightweight End-To-End Text-To-Speech Synthesis Fo

The paper presents a Lightweight End-to-End Text-to-Speech (LE2E-TTS) model designed for low-resource on-device applications, achieving high-quality speech synthesis with significantly reduced computational requirements. The model integrates an acoustic and vocoder model into a single architecture, demonstrating improved performance while being up to 90% smaller and 10 times faster than existing E2E solutions. Evaluated on the LJSpeech dataset, LE2E-TTS achieves a mean opinion score comparable to state-of-the-art models, making it suitable for real-time offline applications.

Uploaded by

harshplays260
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/373421023

Lightweight End-to-end Text-to-speech Synthesis for low resource on-


device applications

Conference Paper · August 2023


DOI: 10.21437/SSW.2023-35

CITATION READS

1 26

7 authors, including:

Andrzej Pomirski Marius Cotescu


Yohana Amazon
4 PUBLICATIONS 9 CITATIONS 21 PUBLICATIONS 301 CITATIONS

SEE PROFILE SEE PROFILE

All content following this page was uploaded by Andrzej Pomirski on 05 December 2023.

The user has requested enhancement of the downloaded file.


12th ISCA Speech Synthesis Workshop (SSW2023)
26-28 August 2023, Grenoble, France

Lightweight End-to-end Text-to-speech Synthesis for low


resource on-device applications
Biel Tura Vecino, Adam Gabryś, Daniel Matwicki,
˛ Andrzej Pomirski, Tom Iddon,
Marius Cotescu, Jaime Lorenzo-Trueba

Alexa AI
[email protected]

Abstract the training process of the vocoder. To overcome this issue, a


fine-tune process is performed to a pre-trained vocoder model
Recent works have shown that modelling raw waveform di- on predicted acoustic features [5, 6]. This approach requires
rectly from text in an end-to-end (E2E) fashion produces more different sets of hyperparameters and training/fine-tuning pro-
natural-sounding speech than traditional neural text-to-speech cedures for each model, which can lead to suboptimal perfor-
(TTS) systems based on a cascade or two-stage approach. How- mance and a more complex pipeline.
ever, current E2E state-of-the-art models are computationally
complex and memory-consuming, making them unsuitable for In this paper, we propose a new end-to-end text-to-speech
real-time offline on-device applications in low-resource scenar- (E2E-TTS) model based on a joint training of a lightweight
ios. To address this issue, we propose a Lightweight E2E-TTS acoustic and vocoder model in a single efficient architecture,
(LE2E) model that generates high-quality speech requiring min- enabling the benefits of end-to-end speech modeling to be ap-
imal computational resources. We evaluate the proposed model plied to on-device TTS systems. Our proposed model is sig-
on the LJSpeech dataset and show that it achieves state-of-the- nificantly smaller than other E2E solutions while maintaining
art performance while being up to 90% smaller in terms of comparable performance, making our approach more practical
model parameters and 10× faster in real-time-factor. Further- and accessible for low-resource scenarios. The main contribu-
more, we demonstrate that the proposed E2E training paradigm tions of our work are as follows: 1) we present the Lightweight
achieves better quality compared to an equivalent architecture E2E-TTS (LE2E) model, based on a joint training paradigm of
trained in a two-stage approach. Our results suggest that LE2E LightSpeech [7] and Multi-Band MelGAN [8] which outper-
is a promising approach for developing real-time, high quality, forms its originally designed two-step cascade approach while
low-resource TTS applications for on-device applications. only requiring a single joint training scheme; 2) we introduce
Index Terms: speech synthesis, text-to-speech, end-to-end, on- an upgraded loss objective based on recent GAN speech dis-
device. criminators, which we show to be effective not only on a known
neural vocoder architecture but also in an E2E-TTS system and;
3) we show that the proposed model achieves a mean opinion
1. Introduction score (MOS) of 3.79 in the LJSpeech dataset, on par with VITS
Text-to-speech (TTS) technology has come a long way in recent [9] and just below JETS [10], while being much more memory-
years, with state-of-the-art (SOTA) models generating highly efficient and faster, suitable for offline on-device applications.
realistic and natural-sounding speech [1]. Given its success,
TTS technology is now widely used in many different applica- 2. Related work
tions, either with cloud-based online connections or offline syn-
thesis. However, as the demand for on-device, real-time speech Several recent studies have proposed removing the mel-
synthesis grows, offline TTS systems are becoming increasingly spectrogram as an intermediate representation for E2E-TTS
important in today’s society. With the rise of smart devices and synthesis [11, 9]. FastSpeech2 [12] uses Parallel WaveGAN
the Internet of Things (IoT), there is a need for TTS systems [13] to synthesize speech directly from text, while VITS [9] and
that can function without the need for a remote server connec- NaturalSpeech [14] proposes a flow-based TTS system trained
tion, providing users with instant and reliable access to speech jointly with the HiFi-GAN [5] vocoder to generate waveforms
synthesis. Offline on-device TTS systems are particularly use- from text sequences. JETS [10] improves upon FastSpeech2 by
ful in scenarios where internet access is limited, unreliable, or jointly training with the HiFi-GAN vocoder. Both JETS and
where privacy concerns are critical to the application. Hence, VITS have shown impressive results, but they are not suitable
the development of offline TTS systems is essential to fulfill the for on-device low resource real-time synthesis due to their high
requirements of diverse applications. computational requirements. In contrast, low-memory foot-
Text-to-speech (TTS) models are typically composed of print TTS systems like LightSpeech [7], SpeedySpeech [15]
two components: an acoustic model that predicts acoustic units and FCH-TTS [16] strive to minimize computational power, but
from the input text, and a vocoder model that generates the operate on a cascade approach and require specific vocoder ar-
speech waveform from these acoustic features. However, the chitectures. LiteTTS [17] is a lightweight mel-spectrogram-free
two-stage approach of training separate models leads to a mis- TTS system that achieved competitive results compared to state-
match between the acoustic features used during training and of-the-art E2E models, but employs the memory-heavy HiFi-
those used during inference [2], resulting in a degradation of GAN as vocoder. Low-resource E2E speech synthesis models
synthesis quality. This occurs because the predicted acoustic have been explored in works like [18] which propose an au-
features by the acoustic model, usually in the form of a mel- toregressive model or in [19] which present a lightweight flow-
spectrogram [3, 4], may not precisely match the ones used in based E2E architecture for on-device applications.

225 10.21437/SSW.2023-35
Figure 1: LE2E model architecture and training discriminators.

3. Methodology 3.2. Discriminators

The proposed model is shown in Figure 1. LE2E follows the State-of-the-art E2E-TTS systems use multiple discriminators
typical GAN-based setup of a generator (G) that includes an that guide the generator in synthesizing a coherent waveform
acoustic latent encoder and an acoustic decoder, and a set of while minimizing perceptual artifacts that can be easily distin-
discriminators (D) divided in two different sets of multi-period guished by a human ear. We adopted the set of discriminators
(MPD) and multi-resolution discriminators (MRD). proposed in BigVGAN [21], which includes a multi-period dis-
criminator (MPD) and a multi-resolution discriminator (MRD).
It is important to note that each discriminator comprises several
3.1. Generator sub-discriminators that operate on different resolution windows
of the waveform. The MPD reshapes the predicted waveform
The waveform generator consists of two components: an acous- into 2D representations with varying heights and widths to sep-
tic latent model and a neural vocoder, which are concatenated arately capture multiple periodic structures. On the other hand,
into a jointly trained single architecture. The acoustic latent the MRD is composed of several sub-discriminators that op-
model is inspired by the LightSpeech [7] model, but instead of erate on multiple linear spectrograms with different short-time
predicting mel-spectrograms, it is trained to produce unsuper- Fourier transform (STFT) resolutions.
vised acoustic latents that are used as input by the vocoder. The
model takes phoneme and positional embeddings and outputs 3.3. Training objectives
latent frame-level up-sampled acoustic embeddings. It is di-
vided in three components: a text encoder (E), a variance adap- The training objective of LE2E includes losses applied to the
tor (V) and an acoustic decoder (D). The text encoder gener- duration predictor, pitch predictor and waveform level, which
ates positional aware phoneme embeddings through a stack of include GAN and regression losses in form of power loss and
transformer layers. Then, these are fed into the variance adap- multi-resolution STFT loss.
tor. The variance adaptor is comprised of a duration predic-
tor and a pitch predictor. The duration predictor takes output 3.3.1. Duration loss
of the text encoder and predicts phoneme-level durations, used
to up-sample the phoneme embeddings. Then, the up-sampled The acoustic latent model takes the hidden representation of the
phoneme embeddings are fed to the pitch predictor, which is input phonemes and predicts the frame-level duration of each
trained to predict frame-level pitch latents that are added up to one of them in the logarithmic scale. The duration predictor is
the up-sampled phoneme embeddings. Finally, the output of optimized with mean square error (MSE) loss, Ldur between
the variance adaptor is inputted to the acoustic decoder. The predicted and oracle durations extracted by an external aligner
acoustic decoder follows the exact same stack of transformer based on the Kaldi Speech Recognition Toolkit [22] used in our
layer as the phoneme encoder. The vocoder model generates experiments.
waveform signals from the latent embeddings produced by the
acoustic model. The model is based on Multi-Band MelGAN 3.3.2. Pitch loss
[8] and takes the intermediate acoustic latents and upsamples
them to generate the waveform. It is composed of a stack of Pitch prediction is typically achieved through regression tasks
up-sampling blocks that follow a transposed convolution and that estimate the exact pitch value [12, 23]. However, due to
a set of residual blocks with dilated convolutions that increase the high variability of ground-truth pitch contours, we replace
the receptive field. The proposed vocoder architecture follows the regression task with cross-entropy density modeling. To ac-
the method in [20] and generates different waveform sub-bands complish this, we follow [24] and apply a 256-bin quantization
that are then combined to generate the final output waveform to the standardized pitch signal, followed by a cross-entropy
through Pseudo Quadrature Mirror Filter bank (PQMF). loss function, Lf 0 on top of softmaxed predicted pitch logits.

226
3.3.3. Adversarial training 3.3.5. Total loss
For simplicity with the notation, we name all the K discrimina- Summing up all the described loss functions, we end up with
tors from MPD and MRD under the set of discriminators Dk . the final loss L for the generator in the jointly E2E training of
GAN Loss. The main adversarial training objective follows the the proposed architecture:
least squares loss functions for non-vanishing gradient flows.
The discriminator is trained to classify target samples to 1 and
predicted samples to 0. The generator is trained to fake the L = Ldur + Lf 0 + LG + λF M LF M +
discriminator by updating the sample quality to be classified. + λmel Lmel + λST F T LST F T (8)

    where we set λF M = 2, λmel = 5 and λST F T = 2.5.


LD (x, x̂) = min Ex (Dk (x − 1)2 + Ex̂ (Dk (x̂)2 (1) The architecture is optimized to minimize the total loss L in ad-
Dk
dition with the discriminator loss LD in an adversarial training
  approach.
LG (x̂) = min Ex̂ (Dk (x̂ − 1)2 (2)
G
4. Experiments and results
Feature matching loss. Used in [25], it is defined as the simi-
larity metric measured by the difference in features of discrim- 4.1. Experimental setup
inators between a ground truth sample and a generated sample. For reporting the results of the proposed model and for easy
Feature matching loss is computed as the L1 distance between comparison of our architecture, we evaluate our model on
target and predicted discriminator hidden intermediate features the widely used LJSpeech dataset [26]. LJSpeech consists of
and it is defined as: 13.100 pairs of text and speech data with approximately 24
" # hours of speech. We split the dataset into three parts: 12.900
XT
1 (i) (i)
samples for training and 100 samples for both validation and
LF M (x, x̂) = Ex,x̂ Dk (x) − Dk (x̂) (3) test set.
i=1
N i 1

4.2. Model details


3.3.4. Reconstruction losses
Generator The generator of LE2E is built upon two main com-
Applying a reconstruction loss to GAN models helps to gener- ponents: the acoustic latent model and the vocoder model. The
ate realistic results [13]. We utilize two widely used reconstruc- acoustic model follows a 4 block transformer phoneme encoder
tion losses as auxiliary objectives to the GAN-based training. with self-attention. The dimension of the phoneme embeddings
Multi-resolution STFT loss. This loss is defined as the sum of and hidden sizes of the self-attention are 256. The kernel size of
spectral convergence Lsc and STFT magnitude Lmag between the separable convolutional layers within each transformer layer
predicted ŝ and target s STFT linear spectrograms: follow [5, 25, 13, 9] respectively. The duration predictor is a
2-layer 1D separable convolutional neural network with kernel-
∥s − ŝ∥ 1 size 3. The pitch predictor is a 5-layer 1D separable convolution
Lsc (s, ŝ) = , Lmag (s, ŝ) = ∥log(s) − log(ŝ)∥
∥s∥F S neural network with kernel-size 5 followed by a linear projec-
(4) tion layer to 256 hidden dimensionality for pitch logits. The
decoder follows the same architecture as the encoder, where
M each separable convolution has its kernel size to [17, 21, 9, 13]
1 X respectively. As for the vocoder module, 300× upsampling is
LST F T (x, x̂) = Ex,x̂ [Lsc (s, ŝ) + Lmag (s, ŝ)]
M m=1 conducted through 3 upsampling layers with [3, 5, 5] upsam-
(5) pling factors respectively and a PQMF synthesis filter. The
output channels of each upsampling layer are [192, 96, 48] and
where ∥·∥F denotes the Frobenius norm and ∥·∥1 the L1 each transposed convolution in the upsampling layer has its
norm, S refers to the total number of values in both time and kernel-size of [6, 10, 10] respectively. Each upsampling layer
channel dimension in the linear spectrogram and M denotes has 4 stacked residual blocks consisting of a 1D dilated con-
the number of different STFT resolutions, which coincide with volution with kernel-size 3, a Leaky Relu activation with 0.2
the different inputs of the MRD. Note that following [8] we slope, and a final 1D convolution with kernel-size 1. The di-
apply the defined multi-resolution STFT loss in both full-band lation component in each dilated convolution of the 4 residual
and sub-band predictions. Therefore, the final multi-resolution blocks follow [1, 3, 9, 27] respectively. The final 1D convolu-
STFT used is: tion has a kernel-size of 7 and 4 output channels, which they
get combined through a carefully designed PQMF filter with 62
1  f ull  taps, β = 0.9 and a cutoff ratio of 0.1492.
LST F T (x, x̂) = LST F T (x, x̂) + Lsub
ST F T (x, x̂) (6) Discriminator LE2E discriminators are divided into the MRD
2
Mel-Spectrogram loss. In addition to the multi-resolution and MPD module. Each module contains a set of sub-
STFT loss, we also incorporate a mel-spectrogram loss, also discriminators that use a stack of 2D convolutions followed
known as power loss, in the full-band prediction to improve the by ReLU activations. In the MPD, the input waveform is
training stability [21, 5] . It is defined as the L1 norm between first reshaped into a 2D signal by concatenating samples every
the predicted m̂ and target m mel-spectrogram extracted with [2, 3, 5, 7, 11] samples (period) with reflective padding. In the
the same parameters as in [5]: MRD, the input is a linear spectrogram with a variable number
of fast Fourier transform (FFT) points: [1024, 2048, 512] with
  respective hop lengths of [120, 240, 50] and window lengths of
Lmel (x, x̂) = Ex,x̂ ∥m − m̂∥1 (7) [600, 1200, 240].

227
Table 1: Evaluation metrics results on the LJSpeech dataset validation split. MOS is reported with a 95% confidence interval (CI)

Model cFSD (↓) F0 RMSE (↓) XWLM (↑) MOS (CI) (↑)
Recordings - - - 4.25 (±0.10)
VITS [9] 0.254 0.042 ± 0.024 0.985 ± 0.006 3.81 (±0.14)
FastSpeech2 + HiFi-GAN (JETS [10]) 0.212 0.041 ± 0.058 0.979 ± 0.011 4.01 (±0.13)
LightSpeech + MB-MelGAN+ (cascade) 0.248 0.029 ± 0.028 0.968 ± 0.016 3.73 (±0.14)
LightSpeech + MB-MelGAN+ (LE2E) 0.167 0.033 ± 0.027 0.972 ± 0.017 3.79 (±0.14)

Training process The model was trained for 1M steps with an Table 2: MOS comparison between recordings and the same
effective batch size of 128 samples. We used the AdamW [27] vocoder model (MB-MelGAN) with different training paradigm.
optimizer with β1 = 0.8 and β2 = 0.99 and an initial learning
rate of 2 × 10−4 . We used an exponential decay scheduler that Model MOS (CI) (↑)
reduced the learning rate by a factor γ = 0.99 at each training
Recordings 4.24 (±0.13)
epoch and a weight decay penalty factor of 1e − 2.
MB-MelGAN+ 4.02 (±0.14)
MB-MelGAN [8] 3.59 (±0.14)
4.3. Evaluation metrics
To measure and compare the quality of the proposed model,
we used a combination of three objective metrics and a subjec- open-source implementation, in which a checkpoint of each
tive mean opinion score (MOS) evaluation. For objective eval- model trained on LJSpeech dataset [26] is available. In addition,
uation, first we evaluated signal quality using the conditional we trained the LightSpeech model to predict mel-spectrograms
Fréchet Speech Distance (cFSD). To compute cFSD, we gener- following the original implementation [7] to demonstrate that
ated activation distributions for both the recordings and the syn- the proposed training paradigm improves the traditional cascade
thesized samples using a pre-trained XLSR-53 [28] wav2vec- approach. To do so, we generated predicted mel-spectrograms
2.0 [29] model. We then compared the distributions using the from it and fine-tuned the proposed MB-MelGAN+ vocoder for
Fréchet Distance metric. Second, to assess the intonation fi- an extra 200K steps to mitigate the domain mismatch in text-to-
delity, we computed the root mean-squared error (RMSE) of the speech inference. Table 1 summarizes the comparison results,
fundamental frequency (F0) between the predicted waveforms while Table 3 presents the memory consumption and computa-
and the recordings. A lower RMSE indicates higher F0 fidelity. tional complexity. LE2E does not only perform slightly better
Third, we evaluated the similarity between the recordings and than the cascade method with the exact same architecture but
the generated speaker embeddings using the mean cosine dis- also simplifies the training process by eliminating the need for
tance metric between extracted embeddings obtained through two independent trainings and an additional fine-tuning step.
the pre-trained XVector head from WavLM [30] (XWLM). For Compared to state-of-the-art E2E models, our model achieves
the MOS evaluation, we gathered 100 US English speakers slightly lower metrics, but it has a much smaller size and faster
from the crowd-sourcing platform ClickWorker to evaluate the inference time. Specifically, LE2E is 90% smaller and 10×
audio quality of 20 random samples from the test set. Each faster than JETS while reporting marginally inferior metrics.
speaker evaluated 8 test cases and rated each sample on a scale
of 1 (very low quality) to 5 (very high quality). We collected
Table 3: Model comparison in terms of memory consumption
a total of 40 ratings for each sample to assess the subjective
and computational complexity in a Nvidia A100 GPU
quality of the synthesized speech.

4.4. Results Model Params. RTF (↓)

Vocoder ablation study In order to show that the new pro- VITS [9] 29.36M 0.0814 (±0.0304)
posed loss objective has a positive impact in our standalone neu- JETS [10] 40.94M 0.0765 (±0.0206)
ral vocoder architecture, we trained the Multi-Band MelGAN LE2E 3.71M 0.0084 (±0.0480)
model with the original loss objective [8] and with the proposed
loss described in Section 2.3. We will reefer to this latter as
Multi-Band MelGAN+. Both architectures were trained on the
same LJSpeech dataset and evaluated in the re-synthesis task
5. Conclusions and future work
of generating waveform signals from the test split. Following We proposed a lightweight end-to-end text-to-speech (LE2E)
the original training paradigm, we pre-trained the generator for architecture that achieves comparable results to VITS and slight
200K steps and then train the whole architecture for 1M steps worse performance than JETS while being significantly smaller
in both models. We used Adam [31] optimizer with a learning and faster, suitable for on-device applications in low-resource
rate of 1e − 4 and a batch size of 128. Subjective MOS in Ta- scenarios. Our proposed training paradigm improves existing
ble 2 clearly shows that the proposed loss objective generates vocoder architectures and enables the training of a lightweight
better quality waveform compared to original training objective E2E-TTS system, which replaces the traditional cascade ap-
without any change in the model architecture. proach and simplifies the training process to a single step. Fu-
Model comparison. We compared the proposed LE2E archi- ture research could expand our findings to multi-speaker and/or
tecture against two state-of-the-art E2E models: VITS [9] and multi-lingual use-cases, as well as to further explore new dis-
JETS [10]. Both models are obtained from the ESPNet [32] criminator architectures for lightweight TTS models.

228
6. References [18] S. Achanta, A. Antony, L. Golipour, J. Li, T. Raitio, R. Rasipu-
ram, F. Rossi, J. Shi, J. Upadhyay, D. Winarsky et al., “On-device
[1] X. Tan, T. Qin, F. Soong, and T.-Y. Liu, “A survey on neural neural speech synthesis,” in 2021 IEEE Automatic Speech Recog-
speech synthesis,” arXiv preprint arXiv:2106.15561, 2021. nition and Understanding Workshop (ASRU). IEEE, 2021, pp.
[2] Y.-C. Wu, P. L. Tobing, K. Yasuhara, N. Matsunaga, Y. Ohtani, 1155–1161.
and T. Toda, “A Cyclical Post-Filtering Approach to Mismatch
[19] M. Kawamura, Y. Shirahata, R. Yamamoto, and K. Tachibana,
Refinement of Neural Vocoder for Text-to-Speech Systems,” in
“Lightweight and high-fidelity end-to-end text-to-speech with
Proc. Interspeech 2020, 2020, pp. 3540–3544.
multi-band generation and inverse short-time fourier transform,”
[3] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, in 2023 IEEE International Conference on Acoustics, Speech and
Z. Chen, Y. Zhang, Y. Wang, R. Skerrv-Ryan et al., “Natural Signal Processing (ICASSP). IEEE, 2023.
tts synthesis by conditioning wavenet on mel spectrogram pre-
[20] C. Yu, H. Lu, N. Hu, M. Yu, C. Weng, K. Xu, P. Liu,
dictions,” in 2018 IEEE international conference on acoustics,
D. Tuo, S. Kang, G. Lei et al., “Durian: Duration informed
speech and signal processing (ICASSP). IEEE, 2018, pp. 4779–
attention network for multimodal synthesis,” arXiv preprint
4783.
arXiv:1909.01700, 2019.
[4] J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A genera-
[21] S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon,
tive flow for text-to-speech via monotonic alignment search,” Ad-
“Bigvgan: A universal neural vocoder with large-scale training,”
vances in Neural Information Processing Systems, vol. 33, pp.
arXiv preprint arXiv:2206.04658, 2022.
8067–8077, 2020.
[5] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial net- [22] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
works for efficient and high fidelity speech synthesis,” Advances N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al.,
in Neural Information Processing Systems, vol. 33, pp. 17 022– “The kaldi speech recognition toolkit,” in IEEE 2011 workshop
17 033, 2020. on automatic speech recognition and understanding, no. CONF.
IEEE Signal Processing Society, 2011.
[6] X. Yuan, Y. Feng, M. Ye, C. Tuo, and M. Zhang, “Adavocoder:
Adaptive vocoder for custom voice,” in Proc. Interspeech 2022, [23] G. Comini, G. Huybrechts, M. S. Ribeiro, A. Gabrys,
2022, pp. 21–25. and J. Lorenzo-Trueba, “Low-data? no problem: low-
resource, language-agnostic conversational text-to-speech
[7] R. Luo, X. Tan, R. Wang, T. Qin, J. Li, S. Zhao, E. Chen, and T.-Y. via f0-conditioned data augmentation,” arXiv preprint
Liu, “Lightspeech: Lightweight and fast text to speech with neural arXiv:2207.14607, 2022.
architecture search,” in 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, [24] K. Qian, Z. Jin, M. Hasegawa-Johnson, and G. J. Mysore, “F0-
pp. 5699–5703. consistent many-to-many non-parallel voice conversion via con-
ditional autoencoder,” in 2020 IEEE International Conference on
[8] G. Yang, S. Yang, K. Liu, P. Fang, W. Chen, and L. Xie, “Multi- Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020,
band melgan: Faster waveform generation for high-quality text- pp. 6284–6288.
to-speech,” in 2021 IEEE Spoken Language Technology Workshop
(SLT). IEEE, 2021, pp. 492–498. [25] K. Kumar, R. Kumar, T. De Boissiere, L. Gestin, W. Z. Teoh,
J. Sotelo, A. de Brébisson, Y. Bengio, and A. C. Courville, “Mel-
[9] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder gan: Generative adversarial networks for conditional waveform
with adversarial learning for end-to-end text-to-speech,” in Inter- synthesis,” Advances in neural information processing systems,
national Conference on Machine Learning. PMLR, 2021, pp. vol. 32, 2019.
5530–5540.
[26] K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.
[10] D. Lim, S. Jung, and E. Kim, “JETS: Jointly Training Fast- com/LJ-Speech-Dataset/, 2017.
Speech2 and HiFi-GAN for End to End Text to Speech,” in Proc.
Interspeech 2022, 2022, pp. 21–25. [27] I. Loshchilov and F. Hutter, “Decoupled weight de-
cay regularization,” in International Conference on Learn-
[11] R. J. Weiss, R. Skerry-Ryan, E. Battenberg, S. Mariooryad, and ing Representations, 2019. [Online]. Available: https:
D. P. Kingma, “Wave-tacotron: Spectrogram-free end-to-end text- //openreview.net/forum?id=Bkg6RiCqY7
to-speech synthesis,” in 2021 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, [28] M. Ott, S. Edunov, A. Baevski, A. Fan, S. Gross, N. Ng, D. Grang-
pp. 5679–5683. ier, and M. Auli, “fairseq: A fast, extensible toolkit for sequence
modeling,” arXiv preprint arXiv:1904.01038, 2019.
[12] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,
“Fastspeech 2: Fast and high-quality end-to-end text to speech,” [29] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec
in International Conference on Learning Representations, 2021. 2.0: A framework for self-supervised learning of speech repre-
sentations,” Advances in neural information processing systems,
[13] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast
vol. 33, pp. 12 449–12 460, 2020.
waveform generation model based on generative adversarial net-
works with multi-resolution spectrogram,” in 2020 IEEE Inter- [30] S. Chen, C. Wang, Z. Chen, Y. Wu, S. Liu, Z. Chen, J. Li,
national Conference on Acoustics, Speech and Signal Processing N. Kanda, T. Yoshioka, X. Xiao et al., “Wavlm: Large-scale self-
(ICASSP). IEEE, 2020, pp. 6199–6203. supervised pre-training for full stack speech processing,” IEEE
Journal of Selected Topics in Signal Processing, vol. 16, no. 6,
[14] X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang,
pp. 1505–1518, 2022.
Y. Leng, Y. Yi, L. He et al., “Naturalspeech: End-to-end text
to speech synthesis with human-level quality,” arXiv preprint [31] D. Kingma and J. Ba, “Adam: A method for stochastic optimiza-
arXiv:2205.04421, 2022. tion,” International Conference on Learning Representations, 12
[15] J. Vainer and O. Dušek, “Speedyspeech: Efficient neural speech 2014.
synthesis,” arXiv preprint arXiv:2008.03802, 2020. [32] T. Hayashi, R. Yamamoto, T. Yoshimura, P. Wu, J. Shi,
[16] X. Zhou, Z. Zhou, and X. Shi, “Fch-tts: Fast, controllable and T. Saeki, Y. Ju, Y. Yasuda, S. Takamichi, and S. Watanabe,
high-quality non-autoregressive text-to-speech synthesis,” in 2022 “Espnet2-tts: Extending the edge of tts research,” arXiv preprint
International Joint Conference on Neural Networks (IJCNN). arXiv:2110.07840, 2021.
IEEE, 2022, pp. 01–08.
[17] H.-K. Nguyen, K. Jeong, S.-Y. Um, M.-J. Hwang, E. Song, and
H.-G. Kang, “Litetts: A lightweight mel-spectrogram-free text-
to-wave synthesizer based on generative adversarial networks.” in
Interspeech, 2021, pp. 3595–3599.

229

View publication stats

You might also like