When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data
1 Introduction
In recent years, there have been significant advances in Automatic Speech Recog-
nition (ASR) using End-to-End (E2E) models [1]. One of the most notable
outcomes in this area is the development of Transformer-based architectures,
such as Wav2Vec2.0 [2], the Conformer networks [3], and more recently, fully
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Ekštein et al. (Eds.): TSD 2023, LNAI 14102, pp. 226–238, 2023.
https://doi.org/10.1007/978-3-031-40498-6_20
When Whisper Meets TTS 227
supervised models like Whisper [4]. These models have achieved state-of-the-
art performance on a variety of speech recognition benchmarks, including those
that involve noisy or accented speech. Although the results observed to date
are impressive, several challenges still remain for the research community. For
instance, these E2E architectures still require large amounts of transcribed
speech data to be trained and to reach good performance. Furthermore, spe-
cific domains such as health care, forensics, multimedia, or government, among
others, may face limited data availability due to privacy concerns or difficulties in
data collection. These scenarios pose challenges due to non-controlled acoustic
conditions and domain-specific vocabulary. Addressing these problems is par-
ticularly challenging for low-resource languages where large labeled corpora are
scarce for training ASR systems. Therefore, training robust ASR models for spe-
cific domains, low resource languages, and in non-controlled acoustic conditions
becomes a difficult task.
One strategy to adapt ASR systems to specific domains, especially under low
resource settings, is to use data augmentation strategies to artificially increase
the size of the training data. Methods like SpecAugment [5] or those based on
speed perturbation and noise injection have shown to be helpful in adapting ASR
models to specific domains. However, these methods only focus on adapting the
ASR system to the specific acoustic conditions of the target domain, overlooking
the challenge of domain-specific vocabulary.
Recent studies have demonstrated that it is possible to perform data augmen-
tation, or even full training of ASR systems, using synthetic data obtained from
Text-To-Speech (TTS) systems [6–9]. Given the improvements in neural TTS
models, such as Tacotron-2 with Global State Tokens [10] and more recently
VALL-E [11], it is possible to generate high quality speech with varying prosody
that can be used also to train and adapt E2E ASR models. The use of synthetic
speech to fine-tune ASR systems can be particularly helpful to deal with the
issue of out-of-vocabulary words and to expand the vocabulary of E2E systems
during training [6,12,13]. In [14], the authors use synthetic speech to teach med-
ication names to an E2E ASR system based on a Recurrent Neural Networks
Transducer (RNN-T). The training process involved mixing real and synthetic
samples. The fine-tuned model reduced relatively the word error rate (WER)
by up to 65% when recognizing out-of-vocabulary words related to medication
names. Additional studies have shown that combining real and synthetic speech
data during training and fine-tuning can reduce the WER of an ASR system [6–
9,15,16]. However, there are important considerations to address in those cases
to achieve accurate results. For instance, when using synthetic speech for ASR
training, it is important to deal with the mismatch in acoustic characteristics
between real and synthetic audio. Synthetic speech may contain artifacts that
do not exist in real data, such as unrealistic speaking styles and the absence of
background noise. Some studies have mitigated this issue by implementing reg-
ularization strategies [13,17] and freezing the encoder of the E2E model during
fine-tuning [12,13]. The process of fine-tuning only the decoder of the network is
228 J. C. Vásquez-Correa et al.
similar to adapting a language model. Therefore, the model only learns the token
representation of out-of-vocabulary words, rather than the acoustic properties
of synthetic speech [18].
These previous works have proven the benefits of increased acoustic and lex-
ical diversity in synthetic data for ASR training. Nevertheless, most of them are
evaluated using standard benchmark corpora, such as Librispeech [19]. Moreover,
in the majority of cases, TTS-derived data is used only for data augmentation,
rather than to adapt the model to new, unseen domains, particularly in low
resource languages, where there is the real need to adapt E2E ASR models.
Finally, most of the previous studies have focused on mixing real and synthetic
audio data, and have not shown reliable results using only synthetic speech [7,17].
This study extends all previous research by using only synthetic data for
domain adaptation of E2E ASR models. Our approach is motivated by the recent
release of Whisper [4], which was pre-trained with large amounts of labeled data
from the Internet (up to 680k h). This leads us to believe that it is now possible
to fine-tune models using only synthetic speech, making the domain adapta-
tion tasks more feasible, especially in low resource languages. We considered a
state-of-the-art TTS system to create realistic speech signals in order to create
adapted Whisper models for a variety of domains, including forensics, broad-
cast media, and parliamentary. Our proposed methodology was also evaluated
in different languages to test the effect of using synthetic speech to adapt pre-
trained models with large, intermediate, and low resource languages such as
English, Spanish, and Basque, respectively. To the best of our knowledge, this
is one of the first studies to consider the effect of using only synthetic speech in
ASR model adaptation for non-English data. An additional contribution of this
paper relies on the evaluation and comparison of different Parameter Efficient
Fine-Tuning (PEFT) methods [20] when training large Transformer-based mod-
els. PEFT-based approaches focus on fine-tuning only a small number of model
parameters, thereby greatly decreasing the computational and storage costs. The
use of these strategies have not been extensively explored for speech-based mod-
els. However, this is an important aspect to be considered when training large
models such as Whisper.
The rest of the paper is distributed as follows. Section 2 describes the methods
and strategies considered to adapt an ASR system based on Whisper to new
unknown domains in different languages. Section 3 describes the different corpora
considered in this study to train and evaluate the proposed approach. Section 4
shows the main results obtained and discusses the main insights derived from the
performed experiments. Finally, Sect. 5 draws the main conclusions and presents
further perspectives to be addressed.
2 Methods
2.1 Whisper
Whisper is an encoder-decoder Transformer network recently introduced by Ope-
nAI [4]. The model is trained in a fully supervised manner, using up to 680k h
of labeled speech data from multiple sources. The encoder is fed by 80-channel
When Whisper Meets TTS 229
Due to the large number of parameters to fine-tune, especially for the Large ver-
sion of Whisper (1550 M), a set of PEFT strategies were applied. These methods
aimed to fine-tune a small number of model parameters, decreasing the compu-
tational and storage costs [20]. In general, PEFT methods have shown to be
comparable to a full parameter fine-tuning despite the substantial reduction of
tunable parameters [20].
We compared three different PEFT methods: (1) the Low-Rank Adaptation
(LoRA) [21], which freezes the pretrained model weights and injects trainable
rank decomposition matrices into each layer of the Whisper decoder. We con-
sider a rank r = 32 and a re-scaling factor α = 64 for the matrix factorization
in LoRA [21]. (2) AdaLoRA [22], where the rank decomposition of the weight
matrices is performed adaptively. Critical incremental matrices are assigned with
high rank such that they can capture more fine-grained and task-specific infor-
mation. Less important ones are pruned to have lower rank to prevent over-
fitting and save the computational budget [22]. (3) Finally, in addition to the
rank decomposition-based approaches, we considered the Bias-terms Fine-tuning
(BitFit) strategy [23]. BitFit updates the bias terms in the pre-trained model,
while freezing the remaining parameters of the Whisper decoder. The authors
in [23] showed that fine-tuning only a subset of bias parameters in a Transformer
network is comparable to a full fine-tuning of the model.
The generation of realistic synthetic speech data was performed through a state-
of-the-art TTS system composed by a Tacotron-2 [24] acoustic model followed by
a HiFi-GAN [25] neural vocoder. Tacotron-2 consists of a sequence-to-sequence
model, which includes an encoder, a decoder, and a final post-processing convolu-
tional neural network. The encoder is fed with embedding representations of the
input characters, generated by a 1D convolutional-recurrent network, and which
is simultaneously trained with the whole TTS system. The Tacotron-2 models for
English, Spanish and Basque were trained on pairs of text and its corresponding
acoustic information, represented by audios sampled at 22, 050 Hz, using 80-
channel Mel-spectrograms, a frame length of 1024, a time-shift of 256 samples,
and a 1024-resolution Fourier transform. During training, this network learned
to generalize and generate new spectrograms from unseen texts using the exam-
ples given for training. The model was trained using an Adam optimizer [26],
and a learning rate of 10−3 that exponentially decays to 10−5 after 50k steps. We
also applied L2 regularization with a weight of 10−7 and a batch-size of 32. The
final training steps were slightly different for each model, although they were
established between 170k and 190k steps. The English model was trained with
the LJ Speech Dataset [27] composed of 13, 100 short audio clips from a single
speaker (23 h and 54 min). The Spanish and Basque models were trained using
mono-speaker proprietary datasets, containing 11, 650 (20 h and 46 min) and
11, 640 (19 h and 4 min) short audio clips for Spanish and Basque, respectively.
When Whisper Meets TTS 231
2.4 Methodology
The proposed methodology is shown in Fig. 2. The text data for each domain was
crawled from the Internet to obtain the target vocabulary for recognition. The
crawled corpora were then preprocessed and used as input for the TTS system.
After generating the synthetic speech data, different versions of Whisper were
fine-tuned to obtain domain-specific ASRs. Only the decoder of Whisper was
adapted to learn the target vocabulary and not the acoustic characteristics of
synthetic speech. The evaluation was performed using real acoustic data from
the specific domains.
3 Data Description
We considered data in English, Spanish, and Basque with the aim to cover
different scenarios where the original Whisper model was trained with large,
232 J. C. Vásquez-Correa et al.
intermediate, and low resource data. For each language, the data used were
specific to a particular domain of application, which included forensics, broadcast
media, and parliament. Table 1 summarizes the main characteristics for each
corpus. Further details about each scenario are found in the following sections.
The experiments for this scenario were performed with the GRACE corpus [28].
This is a multilingual dataset that comprises audio recordings from multiple
sources from the research community. Audios from different public databases
were compiled and filtered according to the presence of 86 keywords related to
child abuse. This study considered only the English version of the dataset, which
comprises 9.2 h of audio recordings from the Spoken Wikipedia corpus [29], the
Debating technology corpus [30], and TEDLIUMv2 [31]. This corpus is available
online1 to be used as a benchmark corpus for speech recognition under forensic
domains.
The text data used for synthesis and fine-tuning of the Whisper model
included crawled documents from EUROPOL2 , UNICEF3 , and Wikipedia arti-
cles related to child abuse. The crawled corpus is composed of 55,059 words
(without stop words) from which 4,571 audio utterances were created (11.7 h).
The data for this experiment considered the test set of the IberSPEECH-RTVE
2022 Speech to Text Transcription Challenge [32]4 . The database is a collection
of 54 h of audio materials from the Spanish national TV (RTVE) archive in
various genres. The corpus covers a wide variety of scenarios of read and spon-
taneous speech, including material from scripted content to live broadcasts. The
1
https://shorturl.at/dfjx2.
2
https://www.europol.europa.eu/media-press/newsroom?q=child%20abuse.
3
https://www.unicef.org/search?force=0&query=child+abuse&created%5Bmin
%5D=&created%5Bmax%5D=.
4
http://catedrartve.unizar.es/rtvedatabase.html.
When Whisper Meets TTS 233
Table 2. Results obtained by fine-tuning each Whisper version using synthetic speech
in English, Spanish, and Basque. The performance is measured in terms of total WER.
between 6.2 and 31 points depending on the model size. For both languages, the
WER reduction is more evident for the case of the smallest models (tiny and
base). For the English language, the fine-tuning process using synthetic speech
does not result in a reduction in WER with respect to the original models,
which is contrary to the results obtained for Spanish and Basque. In some cases,
the fine-tuned version even produces a higher WER than the original one (base
and large). This behavior can be explained by two reasons: (1) the amount of
data used to train the original Whisper models for English is much greater than
the amount considered for Spanish and Basque (see Table 1). (2) The test data
used in English is a compilation of several corpora from the literature, including
TEDLIUMv2 and the Spoken Wikipedia corpus. Information from such corpora
may already be available within the original Whisper weights. Therefore, the
information added to the model via synthetic speech does not contribute to new
knowledge, as in the case of Spanish and Basque.
With the aim to compare different PEFT methods when fine-tuning Whisper,
Table 3 shows results comparing LoRA [21], AdaLoRA [22], and BitFit [23].
The comparison is performed fine-tuning the medium Whisper model (769 M
parameters). The fine-tuning process for all methods is performed under the
same conditions and using the same hyperparameters for training.
Similar results are observed when using either LoRA or AdaLoRA. Both
approaches are able to reduce the WER compared to the original Whisper
model, especially for Spanish and Basque. This is explained considering that
both approaches rely on the same principle of weight decomposition into low
rank matrices. The main difference is that AdaLoRA is able to achieve the same
results, but fine-tuning less than half the parameters fine-tuned by LoRA. This
leads less memory consumption (see Fig. 3). BitFit helps to reduce the train-
When Whisper Meets TTS 235
Table 3. Comparison between different PEFT methods for fine-tuning the medium
version of Whisper. Results are presented in terms of WER and the normalized WER
(nWER) with respect to the original Whisper model.
ing time and the memory costs even further, by fine-tuning only 0.08% of the
weights, but at the cost of sacrificing performance, especially in Basque.
Fig. 3. GPU memory and train speed for the different PEFT methods. All evaluations
are conducted on a NVIDIA GeForce RTX-3090 GPU (24GB VRAM). Full fine-tuning
is not possible for batch sizes larger than 1 due to memory constraints.
Training speed for BitFit is significantly higher than for the other two meth-
ods (p-value 0.005 in all cases). The differences between the training speed
for LoRA, AdaLoRA, and the full fine-tuning is not significant (p-value > 0.005
in all cases). The statistical comparisons were performed using an ANOVA with
a Tukey Post-Hoc test. Although there is not significant differences for the train-
ing speed between LoRA, AdaLoRA, and the full fine-tuning, the memory con-
sumption of the PEFT methods is much lower than the observed for the full
fine-tuning. This makes possible the use of larger batch sizes (either directly or
using training accumulation steps for the gradient computation), which at the
end will be translated in a significant improvement of training speed [20].
5 Conclusion
This paper proposes a methodology for adapting the vocabulary of Whisper-
based speech recognizers to new domains in different languages using only TTS-
236 J. C. Vásquez-Correa et al.
derived data. The proposed approach was tested on data from large, interme-
diate, and low resource scenarios in English, Spanish, and Basque languages,
respectively. In addition, we compared different PEFT-based methods to per-
form the fine-tuning process due to the large number of parameters to train.
This study demonstrated that it is possible to improve the performance of an
E2E ASR system using only synthetic data, which is a novel approach. Previous
studies relied on the combination of synthetic and real speech, which still requires
data annotation procedures that can slow down the training process. Using only
synthetic speech data can be a more efficient and cost-effective approach, espe-
cially when there is limited time or resources available for data collection.
The results indicated that using only synthetic data for domain adaptation of
Whisper-based ASRs leads to performance improvements, particularly in low-
resource scenarios. The proposed methodology was successful in reducing the
WER between 6.2 and 31 points, depending on the language and model version.
In addition, we confirm the utility of using PEFT methods to train large models,
which would be difficult to achieve under limited hardware resources. PEFT
methods helped to reduce memory consumption, giving the possibility to use
larger bath sizes, which ultimately lead to more generalized models.
For future work, we will incorporate data augmentation techniques such as
SpecAugment to increase the acoustic variability of the synthetic audios. Addi-
tionally, using more synthetic samples can also help reduce WERs, as the model
can learn from a larger vocabulary. Overall, these approaches can lead to even
better performance and robustness of the ASR system in different domains and
languages. Additional PEFT methods such as those based on Prompt and Pre-
fix tuning [37,38] can also be considered an adapted to fine-tune the decoder of
large acoustic models such as Whisper.
References
1. Li, J., et al.: Recent advances in end-to-end automatic speech recognition. APSIPA
Trans. Sign. Inf. Proc. 11(1) (2022)
2. Baevski, A., et al.: Wav2Vec 2.0: a framework for self-supervised learning of speech
representations. In: NEURIPS, vol. 33, pp. 12449–12460 (2020)
3. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recog-
nition. In: Proceedings of the INTERSPEECH, pp. 5036–5040 (2020)
4. Radford, A., et al.: Robust speech recognition via large-scale weak supervision.
Technical report, OpenAI (2022)
5. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic
speech recognition. In: Proceedings of the INTERSPEECH, pp. 2613–2617 (2019)
6. Li, J., et al.: Training neural speech recognition systems with synthetic speech
augmentation. arXiv preprint arXiv:1811.00707 (2018)
7. Rosenberg, A., et al.: Speech recognition with augmented synthesized speech. In:
Proceedings of the ASRU, pp. 996–1002. IEEE (2019)
8. Laptev, A., et al.: You do not need more data: improving end-to-end speech recog-
nition by text-to-speech data augmentation. In: Proceedings of the CISP-BMEI,
pp. 439–444. IEEE (2020)
When Whisper Meets TTS 237
9. Rossenbach, N., et al.: Generating synthetic audio data for attention-based speech
recognition systems. In: Proceedings of the ICASSP, pp. 7069–7073. IEEE (2020)
10. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in
end-to-end speech synthesis. In Proceedings of the ICML, pp. 5180–5189. PMLR
(2018)
11. Wang, C., et al.: Neural codec language models are zero-shot text to speech syn-
thesizers. arXiv preprint arXiv:2301.02111 (2023)
12. Ueno, S., et al.: Multi-speaker sequence-to-sequence speech synthesis for data aug-
mentation in acoustic-to-word speech recognition. In Proceedings of the ICASSP,
pp. 6161–6165. IEEE (2019)
13. Zheng, X., Liu, Y., Gunceler, D., Willett, D.: Using synthetic audio to improve the
recognition of out-of-vocabulary words in end-to-end ASR systems. In: Proceedings
of the ICASSP, pp. 5674–5678. IEEE (2021)
14. Fazel, A., et al.: SynthASR: unlocking synthetic data for speech recognition. arXiv
preprint arXiv:2106.07803 (2021)
15. Ueno, S., et al.: Data augmentation for ASR using TTS via a discrete representa-
tion. In: Proceedings of the ASRU, pp. 68–75. IEEE (2021)
16. Qu, L., Weber, C., Wermter, S.: Emphasizing unseen words: new vocabulary acqui-
sition for end-to-end speech recognition. Neural Netw. 161, 494–504 (2023)
17. Hu, T.Y., et al.: Synt++: utilizing imperfect synthetic data to improve speech
recognition. In: Proceedings of the ICASSP, pp. 7682–7686. IEEE (2022)
18. Mimura, M., et al.: Leveraging sequence-to-sequence speech synthesis for enhancing
acoustic-to-word speech recognition. In: Proceedings of the SLT, pp. 477–484. IEEE
(2018)
19. Panayotov, V., et al.: LibriSpeech: an ASR corpus based on public domain audio
books. In: Proceedings of the ICASSP, pp. 5206–5210 (2015)
20. Ding, N., et al.: Parameter-efficient fine-tuning of large-scale pre-trained language
models. Nature Mach. Intell. 5, 1–16 (2023)
21. Hu, E.J., Shen, Y., et al.: LoRA: low-rank adaptation of large language models.
arXiv preprint arXiv:2106.09685 (2021)
22. Zhang, Q., et al.: Adaptive budget allocation for parameter-efficient fine-tuning.
arXiv preprint arXiv:2303.10512 (2023)
23. Zaken, E.B., et al.: BitFit: simple parameter-efficient fine-tuning for transformer-
based masked language-models. arXiv preprint arXiv:2106.10199 (2021)
24. Shen, et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram
predictions. In: Proceedings of the ICASSP, pp. 4779–4783. IEEE (2018)
25. Kong, J., et al.: Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis. Proceedings of the NEURIPS, vol. 33, pp. 17022–17033
(2020)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. 2015 ICLR.
arXiv preprint arXiv:1412.6980 (2015)
27. Ito, K., Johnson, L.: The LJ speech dataset (2017). www.http://keithito.com/LJ-
Speech-Dataset/
28. Vásquez-Correa, J.C., Álvarez Muniain, A.: Novel speech recognition systems
applied to forensics within child exploitation: Wav2Vec 2. 0 vs. whisper. Sensors
23(4), 1843 (2023)
29. Baumann, T., et al.: The spoken Wikipedia corpus collection: harvesting, alignment
and an application to hyperlistening. Lang. Resour. Eval. 53(2), 303–329 (2019)
30. Mirkin, S., et al.: A recorded debating dataset. In: Proceedings of the LREC, pp.
250–254 (2017)
238 J. C. Vásquez-Correa et al.
31. Rousseau, A., et al.: Enhancing the TED-LIUM corpus with selected data for
language modeling and more ted talks. In: Proceedings of the LREC, pp. 3935–
3939 (2014)
32. Lleida, E., et al.: Albayzin evaluation: IberSPEECH-RTVE 2022 speech to text
transcription challenge (2022)
33. Dinkel, H., et al.: Voice activity detection in the wild: a data-driven approach
using teacher-student training. IEEE/ACM Trans. Audio, Speech Lang. Process.
29, 1542–1555 (2021)
34. Gemmeke, J., et al.: Audio set: an ontology and human-labeled dataset for audio
events. In: Proceedings of the ICASSP, pp. 776–780 (2017)
35. Arzelus, H., et al.: The Vicomtech-UPM speech transcription systems for the
albayzın-rtve 2022 speech to text transcription challenge. In: Proceedings of the
IberSPEECH, pp. 266–270 (2022)
36. T. Etchegoyhen et al. mintzai-st: Corpus and baselines for basque-spanish speech
translation. In: Proceedings of the IberSPEECH, pp. 1–5 (2021)
37. Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to fine-tuning uni-
versally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)
38. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation.
In: Proceedings of the ACL, pp. 4582–4597 (2021)