0% found this document useful (0 votes)

8 views6 pages

Literature Survey

The document reviews advancements in low-resource text-to-speech (TTS) synthesis, highlighting methods like transfer learning, knowledge distillation, and dual transformation to address data scarcity in underrepresented languages. It discusses neural architectures such as FastSpeech 2 and Glow-TTS, which improve speech synthesis quality and efficiency. The literature emphasizes the importance of scalable TTS systems and suggests future research directions for enhancing generalization and exploring unsupervised learning methods.

Uploaded by

Deepika Talawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views6 pages

Literature Survey

Uploaded by

Deepika Talawar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

Low Resource for Text to Speech Synthesis

Abstraction
Research on low-resource text-to-speech (TTS) synthesis is ongoing,
particularly for languages with little training data available. The majority of
languages, particularly those that are indigenous or underrepresented, lack
the data necessary to create TTS models, in contrast to high-resource
languages like English. With the use of recent investigations, this literature
review seeks to examine state-of-the-art methods and difficulties in
low-resource TTS. Methods including knowledge distillation, transfer
learning, and dual transformation will be covered in the survey along with
neural architectures like Glow-TTS, FastSpeech, and end-to-end models.
We will also talk about evaluation metrics, multi-lingual generalization, and
vocoder choices. The best practices for creating scalable text-to-speech
systems in low-resource environments are highlighted in this survey, with
an emphasis on enhancing speech quality, training effectiveness, and
generalization to underrepresented languages.

1.Introduction

Over time, there has been a substantial evolution in text-to-speech (TTS)

technology, moving from concatenative and parametric methods to deep
neural network-based systems. The naturalness and quality of synthesized
speech have significantly increased thanks to neural TTS systems like
WaveNet and Tacotron. But for them to be trained effectively, a lot of
high-quality matched text and speech data is needed. For most languages
in the world, where resources of this kind are limited, this poses a serious
problem. Lack of data creates a constraint when developing high-quality
TTS systems in low-resource environments.
This review of the literature concentrates on significant advancements in
low-resource text-to-speech systems, where data scarcity issues have
been addressed by creative methods like transfer learning, pre-training,
knowledge distillation, and dual transformation. Furthermore, neural
architectures such as Glow-TTS, FastSpeech, and end-to-end models have
demonstrated promise in addressing the one-to-many mapping difficulties
that are common in text-to-speech (TTS), where a single text input might
have several speech variations corresponding to it.

2. Low-Resource TTS Methods

2.1 Transfer Acquiring and Optimizing

Transfer learning, in which models are pre-trained on high-resource
languages and refined on low-resource languages, is one of most popular
approaches in low-resource text-to-speech (TTS). This method lessens the
quantity of data needed for training by utilizing the vast datasets that are
accessible in high-resource languages and applying that knowledge to
underrepresented languages. The efficiency of this technique was shown
by Xu et al. (2020) in their LRSpeech system, which pre-trains automatic
speech recognition (ASR) and text-to-speech (TTS) models on
rich-resource languages and refines them on low-resource languages [2].
By transferring the ability to learn the alignment between text and speech
from high-resource languages, this technique improves performance
noticeably.

2.2 Knowledge Distillation

Knowledge distillation has shown to be successful technique for enhancing
TTS systems with limited resources. Using this method, a smaller "student"
model can more effectively generalize with less data by using the soft
labels produced by a larger "teacher" model that has been trained on a
larger dataset. Using an improved version of previous knowledge distillation
strategies, FastSpeech 2 [3] solves the one-to-many mapping problem in
non-autoregressive models by using simplified outputs from a teacher
model to train the student model. By eliminating the requirement for an
intermediate distillation step, FastSpeech 2 improves voice quality and
model training efficiency by training directly using ground-truth data.

2.3 Dual Transformation and Cycle Consistency.

Dual transformation is another cutting-edge strategy for addressing
low-resource issues. Using the duality between the ASR and TTS systems,
this method iteratively improves both models. In their LRSpeech system,
Xu et al. (2020) use dual transformation to increase the accuracy of TTS by
employing ASR to convert unpaired speech to text and vice versa [2].
Because unpaired data is more easily accessible in many low-resource
languages, this strategy has shown to be effective.

3. Neural Networks for TTS with Limited Resources

3.1 Models That Are Not Autoregressive

The FastSpeech 2 model, also known as FastSpeech_meta_2022, is a top
model for non-autoregressive TTS. It fixes the problems that
autoregressive models have with sluggish training and robustness (like
word skipping). By adding variation predictors for duration, pitch, and
energy, FastSpeech 2 solves the one-to-many mapping problem and
increases model accuracy in situations with sparse data. Furthermore, the
TTS pipeline is further streamlined and performance and scalability are
improved for low-resource situations by FastSpeech 2's ability to directly
generate speech waveforms without the requirement for intermediary
mel-spectrograms, as in FastSpeech 2s.

3.2 Models Based on Flow: Glow-TTS

Using flow-based generative modeling, Glow-TTS [1] provides an
alternative to non-autoregressive models. By employing a monotonic
alignment search approach, it does away with the requirement for external
aligners and enables the model to translate text inputs directly to
mel-spectrograms. As a result, less data is needed for training and
higher-quality voice synthesis can be achieved.
3.3 Complete Models: VITS
Neural vocoders like HiFi-GAN are combined with the benefits of
flow-based models in end-to-end models like VITS [1]. By directly
synthesizing speech from text, VITS reduces. Need for intensive feature
engineering and enhance speech quality. This model has shown to be very
helpful in situations with limited resources, sparse data, and need for
models to generalize well across speakers and languages.

4. Vocoder Selection for Low-Resource TTS

4.1 Vocoders based on GANs: HiFi-GAN

It has been discovered that HiFi-GAN [1] is a computationally efficient
vocoder for TTS systems can produce high-quality audio with a small
amount of training data. By adding a multi-period discriminator, HiFi-GAN
outperforms previous vocoders and is able to identify long-term
dependencies in the waveform.

4.2 Vocoders Based on Diffusion

WaveGrad WaveGrad [1] is a diffusion-based vocoder that provides slower
inference rate but better quality speech synthesis. Although it requires
more computing power, it iteratively improves noisy input into a high-quality
waveform that can be used to produce high-fidelity speech. Diffusion-based
vocoders are helpful in low-resource situations when output quality is more
important than generation speed.

5. Generalization across several languages and speakers

Transfer of learning between languages and speakers is made possible in

low-resource TTS by multilingual and multi-speaker models. Kumar et al.
(2023) demonstrated how FastPitch was used to train IndicTTS [1] models
on 13 Indian languages, ensuring quick and accurate speech creation in a
variety of languages. With less training data, this method improves the
scalability of TTS models and enables them to generalize across different
language variables.
6. Metrics for Evaluation

Subjective and objective indicators are combined to evaluate low-resource

TTS systems:

● A subjective metric called the Mean Opinion Score (MOS) is used to

evaluate how natural synthetic speech sounds [1].

● The acoustic similarity between the generated and ground-truth

speech [2] is measured objectively using two metrics: Mel-Cepstral
Distortion (MCD) and Root Mean Square Error (RMSE) [1].

● The generated speech's comprehensibility is evaluated using the

Character Error Rate (CER) obtained from automatic speech
recognition systems [2].

7. Results and Outcomes

Research demonstrates that knowledge distillation, dual transformation,

and transfer learning greatly enhance the quality of TTS in low-resource
languages. For example, in low-resource language of Lithuanian,
LRSpeech synthesized speech with above 98% intelligibility [2].
FastSpeech 2 and VITS surpass older models like Tacotron and WaveNet
in terms of lowering training complexity and enhancing speech quality.
High-fidelity audio synthesis has also benefited from the use of HiFi-GAN
as a vocoder, even in situations where data is scarce [1].
8.Conclusion

Research on low-resource text-to-speech synthesis is still difficult but

important, as it has the potential to democratize access to digital material in
underrepresented languages. The reviewed literature identifies a number of
interesting strategies that address the problem of data scarcity, such as
dual transformation, knowledge distillation, and transfer learning.
End-to-end systems like VITS and non-autoregressive models like
FastSpeech 2 provide high-quality speech synthesis with less data needed.
Subsequent investigations ought to concentrate on enhancing the
scalability and generalization of text-to-speech systems for low-resource
languages, as well as investigating the possibilities of unsupervised and
self-supervised learning methods.

9.Reference

[1]Kumar, G. K., S. V, P., Kumar, P., Khapra, M. M., & Nandakumar, K.

(2023). "Towards Building Text-to-Speech Systems for the Next Billion
Users". ICASSP 2023 – IEEE International Conference on Acoustics,
Speech and Signal Processing.
IEEE Xplore

[2]Xu, J., Tan, X., Ren, Y., Qin, T., Li, J., Zhao, S., & Liu, T. Y. (2020).
"LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition".
26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
(KDD 2020).

[3]Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2022).
"FastSpeech 2: Fast and High-Quality End-to-End Text to Speech".
arXiv

Unit 2 Sound or Audio System
No ratings yet
Unit 2 Sound or Audio System
29 pages
ReportCard Non Graded
90% (10)
ReportCard Non Graded
2 pages
DOST Scholar's Handbook
No ratings yet
DOST Scholar's Handbook
52 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Thesis
No ratings yet
Thesis
37 pages
Suoni
No ratings yet
Suoni
38 pages
Style TTS2
No ratings yet
Style TTS2
28 pages
Portable and High-Quality
No ratings yet
Portable and High-Quality
19 pages
Lecture 10 - Text To Speech
No ratings yet
Lecture 10 - Text To Speech
76 pages
Preprints202306 0223 v1
No ratings yet
Preprints202306 0223 v1
20 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
Fastspeech 2: Fast and High-Quality End-To-End Text-To-Speech
No ratings yet
Fastspeech 2: Fast and High-Quality End-To-End Text-To-Speech
11 pages
Text To Audio Generation Instruction LLM
No ratings yet
Text To Audio Generation Instruction LLM
15 pages
Fastspeech 2: Fast and High-Quality End-To-End Text To Speech
No ratings yet
Fastspeech 2: Fast and High-Quality End-To-End Text To Speech
11 pages
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
No ratings yet
NaturalSpeech End-to-End Text-to-Speech Synthesis With Human-Level Quality
12 pages
Flowtron
No ratings yet
Flowtron
10 pages
2023 - Speak, Read and Prompt High-Fidelity Text-To-Speech With Minimal Supervision - Kharitonov Et Al
No ratings yet
2023 - Speak, Read and Prompt High-Fidelity Text-To-Speech With Minimal Supervision - Kharitonov Et Al
19 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
No ratings yet
F - S: L L L M A M T - S S: ISH Peech Everaging Arge Anguage Odels For Dvanced Ultilingual EXT TO Peech Ynthesis
11 pages
Ieee
No ratings yet
Ieee
12 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Efficient Neural Speech Synthesis For Low-Resource Languages Through Multilingual Modeling
No ratings yet
Efficient Neural Speech Synthesis For Low-Resource Languages Through Multilingual Modeling
5 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
S E A T - S: Ample Fficient Daptive EXT TO Peech
No ratings yet
S E A T - S: Ample Fficient Daptive EXT TO Peech
15 pages
Convai Technical Overview Speech Ai Part 2 2301964
No ratings yet
Convai Technical Overview Speech Ai Part 2 2301964
11 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Parrot TTS
No ratings yet
Parrot TTS
13 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
No ratings yet
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
12 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
2002 03788
No ratings yet
2002 03788
5 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
3 Gan
No ratings yet
3 Gan
12 pages
NLPReport Phase 1
No ratings yet
NLPReport Phase 1
5 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
No ratings yet
Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming
10 pages
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
No ratings yet
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
5 pages
LR Speech Tts ASR Combo 2020
No ratings yet
LR Speech Tts ASR Combo 2020
11 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
No ratings yet
BASE TTS: Lessons From Building A Billion-Parameter Text-to-Speech Model On 100K Hours of Data (2402.08093)
27 pages
Lightweight End-To-End Text-To-Speech Synthesis Fo
No ratings yet
Lightweight End-To-End Text-To-Speech Synthesis Fo
6 pages
Framework Comparisons
No ratings yet
Framework Comparisons
7 pages
Slide TortoiseTTS
No ratings yet
Slide TortoiseTTS
11 pages
Informatics 08 00084
No ratings yet
Informatics 08 00084
15 pages
Phonetics 2
No ratings yet
Phonetics 2
14 pages
Tacotron 2
No ratings yet
Tacotron 2
5 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Voice Connect - S2ST Reserch Paper
No ratings yet
Voice Connect - S2ST Reserch Paper
4 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Speech Synthesis
No ratings yet
Speech Synthesis
4 pages
Bootcamp's Step 1 Study Schedule
No ratings yet
Bootcamp's Step 1 Study Schedule
8 pages
Discover Your Life Purpose by Michael Beckwith 1
100% (1)
Discover Your Life Purpose by Michael Beckwith 1
13 pages
5 6136241418135404772
No ratings yet
5 6136241418135404772
68 pages
English Grammer For Cadet
No ratings yet
English Grammer For Cadet
20 pages
6th Grade-Csi Lesson Plan
0% (1)
6th Grade-Csi Lesson Plan
3 pages
22-6-2025 Jee Mains+kcet+boards Enthusiast PH-1 & 2 It-2 (Common For All) Set-B HS
No ratings yet
22-6-2025 Jee Mains+kcet+boards Enthusiast PH-1 & 2 It-2 (Common For All) Set-B HS
9 pages
The Family
No ratings yet
The Family
2 pages
Abstract:: Journal of Al Azhar University Engineering Sector Vol. 7, No. 22, January 2012,1-7
No ratings yet
Abstract:: Journal of Al Azhar University Engineering Sector Vol. 7, No. 22, January 2012,1-7
13 pages
Puerto Rico 2019 2020 Calendar PDF
0% (1)
Puerto Rico 2019 2020 Calendar PDF
1 page
Contributions To Nonlinear Elliptic Equations and Systems
No ratings yet
Contributions To Nonlinear Elliptic Equations and Systems
434 pages
Seicom Deled
No ratings yet
Seicom Deled
6 pages
Spelling Error Correction With BERT Based On Character-Phonetic
No ratings yet
Spelling Error Correction With BERT Based On Character-Phonetic
5 pages
RWS-L4-Properties of A Well-Written Text - Student's
No ratings yet
RWS-L4-Properties of A Well-Written Text - Student's
82 pages
Course Outline in Life and Works of Rizal
No ratings yet
Course Outline in Life and Works of Rizal
4 pages
Broñola, Rancel - Lesson Plan
No ratings yet
Broñola, Rancel - Lesson Plan
6 pages
B2B Markets Week 2 Lecture
No ratings yet
B2B Markets Week 2 Lecture
15 pages
Education 4 Adulthood
No ratings yet
Education 4 Adulthood
14 pages
Campus Bullying in The Senior High School A Qualitative Case Study PDF
No ratings yet
Campus Bullying in The Senior High School A Qualitative Case Study PDF
8 pages
Eng B Text Type Talk or Speech
No ratings yet
Eng B Text Type Talk or Speech
2 pages
(B) English Summative Paper
No ratings yet
(B) English Summative Paper
8 pages
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
No ratings yet
Performance Benchmarking of Automated Sentence Denoising Using Deep Learning
6 pages
Vniblett Resume
No ratings yet
Vniblett Resume
1 page
Mikao Usui Senseis Birthday
No ratings yet
Mikao Usui Senseis Birthday
2 pages
Design and Analysis of Algorithms
No ratings yet
Design and Analysis of Algorithms
20 pages
Kirti Raj - DM20A25 - WIP PROJECT
No ratings yet
Kirti Raj - DM20A25 - WIP PROJECT
23 pages
Attention Seeking Additional Input
No ratings yet
Attention Seeking Additional Input
15 pages
Lecture 32
No ratings yet
Lecture 32
4 pages
Setting Goals
No ratings yet
Setting Goals
3 pages
Crisis Intervention - Chapter Review - Crossword Labs
No ratings yet
Crisis Intervention - Chapter Review - Crossword Labs
2 pages
Asmph Admissions Faqs Sy2015-2016 v1
No ratings yet
Asmph Admissions Faqs Sy2015-2016 v1
7 pages
Cambridge IGCSE™: Chemistry 0620/23 October/November 2021
No ratings yet
Cambridge IGCSE™: Chemistry 0620/23 October/November 2021
3 pages
Understanding Inductive Reasoning and Deductive Reasoning: June 2020
No ratings yet
Understanding Inductive Reasoning and Deductive Reasoning: June 2020
3 pages