0% found this document useful (0 votes)
8 views6 pages

Literature Survey

The document reviews advancements in low-resource text-to-speech (TTS) synthesis, highlighting methods like transfer learning, knowledge distillation, and dual transformation to address data scarcity in underrepresented languages. It discusses neural architectures such as FastSpeech 2 and Glow-TTS, which improve speech synthesis quality and efficiency. The literature emphasizes the importance of scalable TTS systems and suggests future research directions for enhancing generalization and exploring unsupervised learning methods.

Uploaded by

Deepika Talawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views6 pages

Literature Survey

The document reviews advancements in low-resource text-to-speech (TTS) synthesis, highlighting methods like transfer learning, knowledge distillation, and dual transformation to address data scarcity in underrepresented languages. It discusses neural architectures such as FastSpeech 2 and Glow-TTS, which improve speech synthesis quality and efficiency. The literature emphasizes the importance of scalable TTS systems and suggests future research directions for enhancing generalization and exploring unsupervised learning methods.

Uploaded by

Deepika Talawar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Low Resource for Text to Speech Synthesis

Abstraction
Research on low-resource text-to-speech (TTS) synthesis is ongoing,
particularly for languages with little training data available. The majority of
languages, particularly those that are indigenous or underrepresented, lack
the data necessary to create TTS models, in contrast to high-resource
languages like English. With the use of recent investigations, this literature
review seeks to examine state-of-the-art methods and difficulties in
low-resource TTS. Methods including knowledge distillation, transfer
learning, and dual transformation will be covered in the survey along with
neural architectures like Glow-TTS, FastSpeech, and end-to-end models.
We will also talk about evaluation metrics, multi-lingual generalization, and
vocoder choices. The best practices for creating scalable text-to-speech
systems in low-resource environments are highlighted in this survey, with
an emphasis on enhancing speech quality, training effectiveness, and
generalization to underrepresented languages.

1.Introduction

Over time, there has been a substantial evolution in text-to-speech (TTS)


technology, moving from concatenative and parametric methods to deep
neural network-based systems. The naturalness and quality of synthesized
speech have significantly increased thanks to neural TTS systems like
WaveNet and Tacotron. But for them to be trained effectively, a lot of
high-quality matched text and speech data is needed. For most languages
in the world, where resources of this kind are limited, this poses a serious
problem. Lack of data creates a constraint when developing high-quality
TTS systems in low-resource environments.
This review of the literature concentrates on significant advancements in
low-resource text-to-speech systems, where data scarcity issues have
been addressed by creative methods like transfer learning, pre-training,
knowledge distillation, and dual transformation. Furthermore, neural
architectures such as Glow-TTS, FastSpeech, and end-to-end models have
demonstrated promise in addressing the one-to-many mapping difficulties
that are common in text-to-speech (TTS), where a single text input might
have several speech variations corresponding to it.

2. Low-Resource TTS Methods

2.1 Transfer Acquiring and Optimizing


Transfer learning, in which models are pre-trained on high-resource
languages and refined on low-resource languages, is one of most popular
approaches in low-resource text-to-speech (TTS). This method lessens the
quantity of data needed for training by utilizing the vast datasets that are
accessible in high-resource languages and applying that knowledge to
underrepresented languages. The efficiency of this technique was shown
by Xu et al. (2020) in their LRSpeech system, which pre-trains automatic
speech recognition (ASR) and text-to-speech (TTS) models on
rich-resource languages and refines them on low-resource languages [2].
By transferring the ability to learn the alignment between text and speech
from high-resource languages, this technique improves performance
noticeably.

2.2 Knowledge Distillation


Knowledge distillation has shown to be successful technique for enhancing
TTS systems with limited resources. Using this method, a smaller "student"
model can more effectively generalize with less data by using the soft
labels produced by a larger "teacher" model that has been trained on a
larger dataset. Using an improved version of previous knowledge distillation
strategies, FastSpeech 2 [3] solves the one-to-many mapping problem in
non-autoregressive models by using simplified outputs from a teacher
model to train the student model. By eliminating the requirement for an
intermediate distillation step, FastSpeech 2 improves voice quality and
model training efficiency by training directly using ground-truth data.

2.3 Dual Transformation and Cycle Consistency.


Dual transformation is another cutting-edge strategy for addressing
low-resource issues. Using the duality between the ASR and TTS systems,
this method iteratively improves both models. In their LRSpeech system,
Xu et al. (2020) use dual transformation to increase the accuracy of TTS by
employing ASR to convert unpaired speech to text and vice versa [2].
Because unpaired data is more easily accessible in many low-resource
languages, this strategy has shown to be effective.

3. Neural Networks for TTS with Limited Resources

3.1 Models That Are Not Autoregressive


The FastSpeech 2 model, also known as FastSpeech_meta_2022, is a top
model for non-autoregressive TTS. It fixes the problems that
autoregressive models have with sluggish training and robustness (like
word skipping). By adding variation predictors for duration, pitch, and
energy, FastSpeech 2 solves the one-to-many mapping problem and
increases model accuracy in situations with sparse data. Furthermore, the
TTS pipeline is further streamlined and performance and scalability are
improved for low-resource situations by FastSpeech 2's ability to directly
generate speech waveforms without the requirement for intermediary
mel-spectrograms, as in FastSpeech 2s.

3.2 Models Based on Flow: Glow-TTS


Using flow-based generative modeling, Glow-TTS [1] provides an
alternative to non-autoregressive models. By employing a monotonic
alignment search approach, it does away with the requirement for external
aligners and enables the model to translate text inputs directly to
mel-spectrograms. As a result, less data is needed for training and
higher-quality voice synthesis can be achieved.
3.3 Complete Models: VITS
Neural vocoders like HiFi-GAN are combined with the benefits of
flow-based models in end-to-end models like VITS [1]. By directly
synthesizing speech from text, VITS reduces. Need for intensive feature
engineering and enhance speech quality. This model has shown to be very
helpful in situations with limited resources, sparse data, and need for
models to generalize well across speakers and languages.

4. Vocoder Selection for Low-Resource TTS

4.1 Vocoders based on GANs: HiFi-GAN


It has been discovered that HiFi-GAN [1] is a computationally efficient
vocoder for TTS systems can produce high-quality audio with a small
amount of training data. By adding a multi-period discriminator, HiFi-GAN
outperforms previous vocoders and is able to identify long-term
dependencies in the waveform.

4.2 Vocoders Based on Diffusion


WaveGrad WaveGrad [1] is a diffusion-based vocoder that provides slower
inference rate but better quality speech synthesis. Although it requires
more computing power, it iteratively improves noisy input into a high-quality
waveform that can be used to produce high-fidelity speech. Diffusion-based
vocoders are helpful in low-resource situations when output quality is more
important than generation speed.

5. Generalization across several languages and speakers

Transfer of learning between languages and speakers is made possible in


low-resource TTS by multilingual and multi-speaker models. Kumar et al.
(2023) demonstrated how FastPitch was used to train IndicTTS [1] models
on 13 Indian languages, ensuring quick and accurate speech creation in a
variety of languages. With less training data, this method improves the
scalability of TTS models and enables them to generalize across different
language variables.
6. Metrics for Evaluation

Subjective and objective indicators are combined to evaluate low-resource


TTS systems:

● A subjective metric called the Mean Opinion Score (MOS) is used to


evaluate how natural synthetic speech sounds [1].

● The acoustic similarity between the generated and ground-truth


speech [2] is measured objectively using two metrics: Mel-Cepstral
Distortion (MCD) and Root Mean Square Error (RMSE) [1].

● The generated speech's comprehensibility is evaluated using the


Character Error Rate (CER) obtained from automatic speech
recognition systems [2].

7. Results and Outcomes

Research demonstrates that knowledge distillation, dual transformation,


and transfer learning greatly enhance the quality of TTS in low-resource
languages. For example, in low-resource language of Lithuanian,
LRSpeech synthesized speech with above 98% intelligibility [2].
FastSpeech 2 and VITS surpass older models like Tacotron and WaveNet
in terms of lowering training complexity and enhancing speech quality.
High-fidelity audio synthesis has also benefited from the use of HiFi-GAN
as a vocoder, even in situations where data is scarce [1].
8.Conclusion

Research on low-resource text-to-speech synthesis is still difficult but


important, as it has the potential to democratize access to digital material in
underrepresented languages. The reviewed literature identifies a number of
interesting strategies that address the problem of data scarcity, such as
dual transformation, knowledge distillation, and transfer learning.
End-to-end systems like VITS and non-autoregressive models like
FastSpeech 2 provide high-quality speech synthesis with less data needed.
Subsequent investigations ought to concentrate on enhancing the
scalability and generalization of text-to-speech systems for low-resource
languages, as well as investigating the possibilities of unsupervised and
self-supervised learning methods.

9.Reference

[1]Kumar, G. K., S. V, P., Kumar, P., Khapra, M. M., & Nandakumar, K.


(2023). "Towards Building Text-to-Speech Systems for the Next Billion
Users". ICASSP 2023 – IEEE International Conference on Acoustics,
Speech and Signal Processing.
IEEE Xplore

[2]Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., & Liu, T. Y. (2020).
"LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition".
26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD 2020).

[3]Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2022).
"FastSpeech 2: Fast and High-Quality End-to-End Text to Speech".
arXiv

You might also like