DB Report Low Resource Text To Speech Synthesis
DB Report Low Resource Text To Speech Synthesis
Submission Information
Result Information
Similarity 6%
1 10 20 30 40 50 60 70 80 90
Quotes
Words < 1.47%
14,
Internet 4.43%
Journal/ 2.57%
Publicatio
n 3.43%
Ref/Bib
7.96%
A-Satisfactory (0-10%)
B-Upgrade (11-40%)
6 16 A C-Poor (41-60%)
D-Unacceptable (61-100%)
SIMILARITY % MATCHED SOURCES GRADE
3 dspace.bracu.ac.bd Publication
1
5 arxiv.org Publication
<1
6 Voice Telephony for Individuals with Hearing Loss The Effects of Audio Publication
1
Bandwidt by Kozma-Spytek-2019
11 cad-journal.net Publication
<1
12 A Proposal for a Quantitative Model of Mura Level of LCDs on the Basis Publication
<1
of Huma by Yoshitake-2002
14 cirworld.com Publication
<1
15 clutejournals.com Publication
<1
19 www.dx.doi.org Publication
<1
on
Submitted by
C B Nagarathna 21BCS023
Chandana R 21BDS010
Dr. Nataraj K S
17/11/2024
Certificate
1 Introduction 1
2 Related Work 2
5 Conclusion 10
References 11
i
List of Figures
1 Mel-spectrogram of Audio 1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Mel-spectrogram of Audio 2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Mel-spectrogram of Audio 3. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
ii
1 Introduction
TTS synthesis has been recognized as a very important technology that can enhance access,
information and interaction in nearly all domains. Much like screen readers for the blind or
voice-activated devices such as Siri and Alexa, TTS systems are changing the way that society
can engage with digital devices. The acclamation of applying a TTS to transform written text
into spoken text creates chances to users interact with the devices in a more natural way while
interpreting information and commands.
Conventionally, to build accurate TTS models, huge amounts of labelled audio-text data
comprising a variety of phonetic, prosodic, and contextual text and audio variations are needed.
17
The sort of datasets that are required for generating natural, expressive, and intelligible speech
are of great importance for training such models. Nevertheless, such large and diverse datasets
are difficult to source and are often scarce particularly in low resource environments or for low
3
resource languages. This data scarcity can help in the inability to implement good TTS system
that is required in the market.
The primary objective of this project is to overcome this challenge by fine-tuning the
Tortoise TTS model on a limited English-language dataset, demonstrating that high-quality
TTS synthesis is possible even with a constrained data environment. Despite the limited training
data, the goal is to produce clear, intelligible, and natural-sounding speech that maintains a
high level of user satisfaction.
The Tortoise TTS model is particularly well-suited for this task, as it incorporates state-
of-the-art techniques, including autoregressive and diffusion-based speech synthesis methods.
These techniques allow the model to generate high-quality speech even when trained on smaller
datasets. Furthermore, this project explores advanced strategies such as data augmentation,
model configuration, and optimization techniques to further enhance synthesis quality, enabling
the system to deliver a more natural and expressive voice.
In conclusion, we want to reach the desirable goal for the end of the project on showing
that it is possible to develop TTS system of perceptually good quality, fine on breath and
fairly efficient in low-resource conditions. This work has implications to be applied to a vast
assortment of domains, ranging from assistive technologies, to enhancing virtual assistant and
IVR systems.
1
2 Related Work
Recent enhancements of TTS synthesis for low resource scenarios have been directed towards
methods of how to increase the ability of a model to generalize, work efficiently with available
resources and produce high quality output given little data. They are: transfer learning, knowl-
edge distillation, dual transformation, and application of method-specific neural counterparts.
The following section sums up these approaches to benchmark the techniques used in the project.
7
Transfer Learning:Models pre-trained on high-resource datasets can be fine-tuned with
smaller datasets to adapt to low-resource languages. Xu et al. (2020) showcased this with
LRSpeech, achieving improved clarity and intelligibility by leveraging transfer learning.
Knowledge Distillation: Smaller models benefit from learning ”soft” labels generated by
larger ”teacher” models, enabling better generalization with limited data. FastSpeech 2 notably
applied this technique to address one-to-many mapping issues, enhancing training efficiency and
output quality.
Dual Transformation and Cycle Consistency: Dual transformation, a technique that
enhances TTS by utilizing the duality between TTS and ASR (automatic speech recognition)
models, has shown success in low-resource applications. In this approach, ASR converts unpaired
audio to text, while TTS uses this text to improve speech synthesis. This iterative process
allows for improved training using unpaired data, which is often easier to acquire in low-resource
settings.
Neural Architectures for Low-Resource TTS:Among the recent models like Fast-
Speech 2, Glow-TTS, and VITS some innovations are noteworthy with regards to low-resource
TTS:
• FastSpeech 2:A non-autoregressive model that predicts variations like duration, pitch,
and energy to produce robust, high-quality audio efficiently, even with limited data.
• VITS: End-to-end models like VITS integrate flow-based techniques with GAN-based
vocoders like HiFi-GAN to directly synthesize audio from text, minimizing feature engi-
neering and improving quality in low-resource conditions.
2
3 Data and Methods
This section describes the dataset, preprocessing techniques, and the training process used to
fine-tune the Tortoise TTS model on a low-resource English dataset.
3.1 Data
The dataset consists of paired audio recordings and their corresponding textual transcriptions,
specifically curated for developing a Text-to-Speech (TTS) system for English, with some
bilingual support for Kannada.
• Source and Collection: The audio files were sourced from the Indic TTS dataset.
• Format: All audio files are in WAV format, ensuring high audio fidelity for TTS model
training.
• Duration: Audio file lengths vary, ranging from short phrases to full sentences, providing
a diverse representation of linguistic patterns, pronunciation, and speaking speeds.
• Sampling Rate: Audio files originally came with varying sampling rates. To standardize
the dataset, all files were resampled to a uniform rate of 22,050 Hz.
• Total Data Volume: The dataset contains [number] audio files, with a total duration of
approximately [total duration], providing a robust foundation for TTS training.
• Format: Text transcriptions are saved in the structured CSV file in which each row
contains information about specific audio file and its transcription.
3
• Language Coverage: The transcriptions are primarily in English, with some Kannada
words included to assess the model’s performance in bilingual scenarios.
3.2 Methods
This section outlines the approach used to implement, train, and evaluate the Text-to-Speech
(TTS) model, detailing the audio processing techniques, model architecture, training methodol-
ogy, and evaluation metrics. The TTS system aims to generate high-quality, natural-sounding
speech from text, supporting both English and Kannada.
• Resampling: All audio files were resampled to a standardized sample rate of 22,050
Hz. This rate provides a balance between computational efficiency and audio fidelity.
Resampling was performed using the Librosa library to ensure accurate conversion across
varying original sample rates.
• Output Management: Preprocessed audio files were saved with a processed prefix for
easy tracking and retrieval. This organization allowed for efficient management of the
audio dataset throughout the training process.
To ensure that the text transcriptions were in a suitable format for the model, the following
steps were applied:
4
• Character Filtering: Non-alphabetic characters, except for Kannada script characters
(\u0C80-\u0CFF), were removed using regular expressions. This step ensured that only
relevant characters (English and Kannada) were included, which reduced unnecessary
noise in the data.
• CSV Storage: The cleaned transcriptions were saved in a structured CSV file for
analysis, in which every raw indicates an audio file and its transcription. This organization
facilitated easy connection of the audio files to its text file.
The TTS model was designed to generate high-fidelity speech from text. The architecture of
the chosen model, Tortoise TTS, is as follows:
18
• Model Type: The Tortoise Text-to-Speech model was selected for its capability to produce
high-quality, natural-sounding speech. It uses an autoregressive decoder to generate speech
and incorporates a discriminator network to enhance audio quality through adversarial
training.
• Discriminator Network: The discriminator network evaluates the generated speech for
realism, helping the model refine its output through adversarial training. This network
compares generated speech to real speech and encourages the model to produce more
natural-sounding results.
The training of the TTS model followed a well-structured methodology, optimized for the specific
needs of our bilingual dataset:
• Fine-Tuning: The pre-trained Tortoise model was fine-tuned on our dataset, which
consisted of paired audio and transcription data. The fine-tuning process involved training
5
the model for 30 epochs, where each epoch involved processing the entire training set once.
The model learned to adapt to the speech characteristics specific to the dataset, including
both English and Kannada words.
• Loss Function - MSD: During training, the most frequently utilized loss function was
the Mel-Spectral Distortion (MSD). It evaluate the quality of synthesized speech of the
actual and predicted audio features: a value the model aims to reduce. The aim was to
approach generative speech output to the ground truth of the audio data as closely as
possible.
• Optimizer: The model was optimized using the Adam optimizer, which is well-suited
for training deep learning models. Adam’s adaptive learning rate mechanism helped the
model converge quickly and efficiently during training.
20
• Training Environment: Training was done here using PyTorch which is an open source
deep learning framework that avails a suite of tools commonly used in developing neural
networks. Further, other libraries such as NumPy and Librosa were used for computations
and handling audio data respectively.
6
4 Results and Discussions
7
Figure 2: Mel-spectrogram of Audio 2.
Audio samples of both the reference speech and the synthesized speech are available in the
appendix for further comparison. A clear improvement in the naturalness and fluency of speech
can be heard in the fine-tuned model, with reduced distortions and more accurate pronunciation.
8
4.4 Discussion
4.4.1 Interpretation of Results
The results from both the quantitative and qualitative evaluations indicate that fine-tuning the
10
Tortoise TTS model on our custom dataset led to significant improvements in the quality of the
synthesized speech. The MSD scores obtained for the synthesized speech samples (6.70, 6.76,
and 6.20) indicate good quality, close to natural speech, though not within the range classified
15
as excellent (below 6). These results highlight that while fine-tuning the Tortoise TTS model
11
improved the quality of speech synthesis, there is still room for further optimization to achieve
even closer resemblance to natural speech. The subjective evaluations further confirm that
the fine-tuned model produced more natural and intelligible speech compared to the baseline,
suggesting that the model successfully adapted to the characteristics of the custom dataset.
14
Future work could explore additional fine-tuning techniques or enhancements to the dataset to
further reduce the MSD scores and improve the overall quality of synthesized speech.
One of the main challenges during the fine-tuning process was the inconsistency in the audio
quality of the dataset. While most of the audio samples were clear, some had background noise
or low volume levels, which could have impacted the model’s ability to generalize effectively.
Despite these challenges, the model was still able to generate high-quality synthesized speech,
but future work could involve cleaning and augmenting the dataset to reduce noise.
Moreover, the time taken by the model to be trained was also longer because of the large
size of the data set and also the designed model. Such a reduction might be a limitation when
using large scale TTS systems but researchers could in future use techniques such as model
pruning or hardware processing to buffer this issue.
9
5 Conclusion
This project demonstrates the feasibility of achieving high-quality text-to-speech synthesis in
low-resource settings by utilizing advanced neural architectures such as Tortoise TTS. Through
rigorous data preprocessing, effective utilization of a curated dataset, and detailed evaluation
metrics, the synthesized speech achieved a significant improvement in naturalness, clarity, and
intelligibility. Quantitative metrics, such as a reduced Mean Squared Difference score, along
with qualitative evaluations, indicate the model’s ability to generate speech closely resembling
human-like delivery.
The work highlights the potential of leveraging state-of-the-art TTS models to overcome
the challenges posed by limited data availability, emphasizing the importance of careful dataset
preparation and evaluation techniques. By focusing on resource efficiency and quality, this
project paves the way for accessible TTS solutions in diverse languages and under-resourced
domains, making it a significant contribution to the development of inclusive and effective
speech synthesis systems.
In addition to the technical consideration of speech synthesis in low-resource conditions,
this project reminds one of the largeangle area that can benefit from it, including education,
accessibility, and content localization. This paper contributes to the promotion of language
16
inclusiveness where minimal technology has been developed for minority languages in the past.
Besides, adoption of the scalable and adaptive techniques enables expansion of the model to
other domains in the areas such as assistive technology for the visually impaired and the regional
voice-based interfaces. It also makes the project fundamental for the development of any future
adaptations in text-to-speech synthesis and language conservation.
10
References
[1] Jin Xu, Xu Tan, Yi Ren, Tao Qin, Jian Li, Sheng Zhao, and Tie-Yan Liu. 2020.
LRSpeech: Extremely Low-Resource Speech Synthesis and Recognition. In Proceedings
of the 26th ACM SIGKDD International Conference on Knowledge Discovery Data Min-
ing (KDD ’20). Association for Computing Machinery, New York, NY, USA, 2802–2812.
https://doi.org/10.1145/3394486.3403331
[2] Kumar, G. K., S. V, P., Kumar, P., Khapra, M. M., Nandakumar, K. (2023). ”Towards
Building Text-to-Speech Systems for the Next Billion Users”. ICASSP 2023 – IEEE International
Conference on Acoustics, Speech and Signal Processing. IEEE Xplore
[3] Ren, Y., Hu, C., Tan, X., Qin, T., Zhao, S., Zhao, Z., Liu, T. Y. (2022). ”FastSpeech
2: Fast and High-Quality End-to-End Text to Speech”.
[4] Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu, “A survey on neural speech synthesis,”
arXiv preprint arXiv:2106.15561, 2021.
[5] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng
Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al., “Natural TTS synthesis by
conditioning Wavenet on Mel spectrogram predictions,” in 2018 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 4779–4783.
[6] Jaehyeon Kim, Sungwon Kim, Jungil Kong, and Sungroh Yoon, “Glow-TTS: A genera-
tive flow for text-to-speech via monotonic alignment search,” Advances in Neural Information
Processing Systems, vol. 33, pp. 8067–8077, 2020.
[7] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu,
“FastSpeech: Fast, robust, and controllable text to speech,” Advances in Neural Information
Processing Systems, vol. 32, 2019.
[8] “Text to speech model training - vakyansh,” https://github.com/Open-Speech-EkStep/vakyansh-
modelstts-models. Open Speech EkStep.
[9] Stan Salvador and Philip Chan, “Toward accurate dynamic time warping in linear time
and space,” Intelligent Data Analysis, vol. 11, no. 5, pp. 561–580, 2007.
11