0% found this document useful (0 votes)
106 views13 pages

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data

Paper

Uploaded by

aksteamnew1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
106 views13 pages

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data

Paper

Uploaded by

aksteamnew1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

When Whisper Meets TTS: Domain

Adaptation Using only Synthetic Speech


Data

Juan Camilo Vásquez-Correa1(B) , Haritz Arzelus1 , Juan M. Martin-Doñas1 ,


Joaquin Arellano1 , Ander Gonzalez-Docasal1,2 , and Aitor Álvarez1
1
Fundacion Vicomtech, Basque Research and Technology Alliance (BRTA),
Mikeletegi 57, 20009 Donostia - San Sebastian, Spain
[email protected]
2
University of Zaragoza, Department of Electronics, Engineering and
Communications, Pedro Cerbuna 12, 50009 Zaragoza, Spain

Abstract. Automatic Speech Recognition is among the most important


areas of Artificial Intelligence research today. One of the most notable
advances in this area is the development of end-to-end models, which
have shown state-of-the-art performance in many benchmark scenarios.
In spite of the recent improvements, these architectures still require large
amounts of transcribed speech data to be trained, which can be challeng-
ing in low resource languages, or in specific domains due to privacy con-
cerns. This study proposes a methodology to fine-tune Whisper-based
models using only synthetic speech. The aim is to enable training robust
systems for specific domains and low resource languages, where large
labeled corpora are difficult to collect. Our approach is based on a lan-
guage model adaptation by fine-tuning only the decoder of the model,
thus the network is able to learn specific vocabulary that is not initially
available. The proposed methodology is evaluated with data from differ-
ent languages and domains. In addition, Parameter Efficient Fine-Tuning
strategies were used to efficiently adapt the large pre-trained Whisper
models. This is one of the first studies that considers the effect of using
only synthetic speech for domain adaption of speech recognition systems
in non-English data, providing word error rate reductions in low resource
languages between 2 and 30 points, depending on the Whisper version.

Keywords: Speech Recognition · Whisper · Text to Speech · Domain


Adaptation · Parameter Efficient Fine-Tuning

1 Introduction

In recent years, there have been significant advances in Automatic Speech Recog-
nition (ASR) using End-to-End (E2E) models [1]. One of the most notable
outcomes in this area is the development of Transformer-based architectures,
such as Wav2Vec2.0 [2], the Conformer networks [3], and more recently, fully
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Ekštein et al. (Eds.): TSD 2023, LNAI 14102, pp. 226–238, 2023.
https://doi.org/10.1007/978-3-031-40498-6_20
When Whisper Meets TTS 227

supervised models like Whisper [4]. These models have achieved state-of-the-
art performance on a variety of speech recognition benchmarks, including those
that involve noisy or accented speech. Although the results observed to date
are impressive, several challenges still remain for the research community. For
instance, these E2E architectures still require large amounts of transcribed
speech data to be trained and to reach good performance. Furthermore, spe-
cific domains such as health care, forensics, multimedia, or government, among
others, may face limited data availability due to privacy concerns or difficulties in
data collection. These scenarios pose challenges due to non-controlled acoustic
conditions and domain-specific vocabulary. Addressing these problems is par-
ticularly challenging for low-resource languages where large labeled corpora are
scarce for training ASR systems. Therefore, training robust ASR models for spe-
cific domains, low resource languages, and in non-controlled acoustic conditions
becomes a difficult task.
One strategy to adapt ASR systems to specific domains, especially under low
resource settings, is to use data augmentation strategies to artificially increase
the size of the training data. Methods like SpecAugment [5] or those based on
speed perturbation and noise injection have shown to be helpful in adapting ASR
models to specific domains. However, these methods only focus on adapting the
ASR system to the specific acoustic conditions of the target domain, overlooking
the challenge of domain-specific vocabulary.
Recent studies have demonstrated that it is possible to perform data augmen-
tation, or even full training of ASR systems, using synthetic data obtained from
Text-To-Speech (TTS) systems [6–9]. Given the improvements in neural TTS
models, such as Tacotron-2 with Global State Tokens [10] and more recently
VALL-E [11], it is possible to generate high quality speech with varying prosody
that can be used also to train and adapt E2E ASR models. The use of synthetic
speech to fine-tune ASR systems can be particularly helpful to deal with the
issue of out-of-vocabulary words and to expand the vocabulary of E2E systems
during training [6,12,13]. In [14], the authors use synthetic speech to teach med-
ication names to an E2E ASR system based on a Recurrent Neural Networks
Transducer (RNN-T). The training process involved mixing real and synthetic
samples. The fine-tuned model reduced relatively the word error rate (WER)
by up to 65% when recognizing out-of-vocabulary words related to medication
names. Additional studies have shown that combining real and synthetic speech
data during training and fine-tuning can reduce the WER of an ASR system [6–
9,15,16]. However, there are important considerations to address in those cases
to achieve accurate results. For instance, when using synthetic speech for ASR
training, it is important to deal with the mismatch in acoustic characteristics
between real and synthetic audio. Synthetic speech may contain artifacts that
do not exist in real data, such as unrealistic speaking styles and the absence of
background noise. Some studies have mitigated this issue by implementing reg-
ularization strategies [13,17] and freezing the encoder of the E2E model during
fine-tuning [12,13]. The process of fine-tuning only the decoder of the network is
228 J. C. Vásquez-Correa et al.

similar to adapting a language model. Therefore, the model only learns the token
representation of out-of-vocabulary words, rather than the acoustic properties
of synthetic speech [18].
These previous works have proven the benefits of increased acoustic and lex-
ical diversity in synthetic data for ASR training. Nevertheless, most of them are
evaluated using standard benchmark corpora, such as Librispeech [19]. Moreover,
in the majority of cases, TTS-derived data is used only for data augmentation,
rather than to adapt the model to new, unseen domains, particularly in low
resource languages, where there is the real need to adapt E2E ASR models.
Finally, most of the previous studies have focused on mixing real and synthetic
audio data, and have not shown reliable results using only synthetic speech [7,17].
This study extends all previous research by using only synthetic data for
domain adaptation of E2E ASR models. Our approach is motivated by the recent
release of Whisper [4], which was pre-trained with large amounts of labeled data
from the Internet (up to 680k h). This leads us to believe that it is now possible
to fine-tune models using only synthetic speech, making the domain adapta-
tion tasks more feasible, especially in low resource languages. We considered a
state-of-the-art TTS system to create realistic speech signals in order to create
adapted Whisper models for a variety of domains, including forensics, broad-
cast media, and parliamentary. Our proposed methodology was also evaluated
in different languages to test the effect of using synthetic speech to adapt pre-
trained models with large, intermediate, and low resource languages such as
English, Spanish, and Basque, respectively. To the best of our knowledge, this
is one of the first studies to consider the effect of using only synthetic speech in
ASR model adaptation for non-English data. An additional contribution of this
paper relies on the evaluation and comparison of different Parameter Efficient
Fine-Tuning (PEFT) methods [20] when training large Transformer-based mod-
els. PEFT-based approaches focus on fine-tuning only a small number of model
parameters, thereby greatly decreasing the computational and storage costs. The
use of these strategies have not been extensively explored for speech-based mod-
els. However, this is an important aspect to be considered when training large
models such as Whisper.
The rest of the paper is distributed as follows. Section 2 describes the methods
and strategies considered to adapt an ASR system based on Whisper to new
unknown domains in different languages. Section 3 describes the different corpora
considered in this study to train and evaluate the proposed approach. Section 4
shows the main results obtained and discusses the main insights derived from the
performed experiments. Finally, Sect. 5 draws the main conclusions and presents
further perspectives to be addressed.

2 Methods
2.1 Whisper
Whisper is an encoder-decoder Transformer network recently introduced by Ope-
nAI [4]. The model is trained in a fully supervised manner, using up to 680k h
of labeled speech data from multiple sources. The encoder is fed by 80-channel
When Whisper Meets TTS 229

Fig. 1. Whisper architecture representation. The log Mel-spectrograms are encoded


by a Transformer network. Encoded representations are transformed into character
outputs and non-speech tokens via the Transformer decoder. Figure inspired from [4].

log-Mel spectrograms and it consists of two convolution layers (kernel size of


3), followed by sinusoidal positional encoding, and a stacked set of Transformer
blocks. The decoder uses the learned positional embeddings and the same num-
ber of Transformer blocks as the encoder (see Fig. 1).
Five pre-trained versions of Whisper are available, with variations in the
number of layers (ranging from 4 to 32) and attention heads (ranging from 6 to
20). These configurations yield models with 39 M to 1550 M parameters. The con-
ducted experiments involved fine-tuning the five versions of the model to evaluate
their capability to be adapted to the target domain. The process is performed
freezing the weights of the encoder network, similar to previous studies [13,18].
Freezing the encoder aims to mitigate the mismatch in acoustic characteristics
between real and synthetic audio, which can be problematic for model train-
ing and fine-tuning. Hence, the learning process focuses only on adapting the
vocabulary to the E2E system, similar to a language model adaptation.
The hyper parameters for fine-tuning included a learning rate of 5 × 10−5 ,
warmed up during the initial 10% of the training, and batch size of 16 (using
gradient accumulation steps due to memory constraints). The decoding was per-
formed using a beam search strategy with 5 beams, an array of temperature
weights of [0.2, 0.4, 0.6, 0.8, 1], and a no repeat 3-gram strategy to avoid loops [4].
230 J. C. Vásquez-Correa et al.

2.2 Parameter Efficient Fine-Tuning

Due to the large number of parameters to fine-tune, especially for the Large ver-
sion of Whisper (1550 M), a set of PEFT strategies were applied. These methods
aimed to fine-tune a small number of model parameters, decreasing the compu-
tational and storage costs [20]. In general, PEFT methods have shown to be
comparable to a full parameter fine-tuning despite the substantial reduction of
tunable parameters [20].
We compared three different PEFT methods: (1) the Low-Rank Adaptation
(LoRA) [21], which freezes the pretrained model weights and injects trainable
rank decomposition matrices into each layer of the Whisper decoder. We con-
sider a rank r = 32 and a re-scaling factor α = 64 for the matrix factorization
in LoRA [21]. (2) AdaLoRA [22], where the rank decomposition of the weight
matrices is performed adaptively. Critical incremental matrices are assigned with
high rank such that they can capture more fine-grained and task-specific infor-
mation. Less important ones are pruned to have lower rank to prevent over-
fitting and save the computational budget [22]. (3) Finally, in addition to the
rank decomposition-based approaches, we considered the Bias-terms Fine-tuning
(BitFit) strategy [23]. BitFit updates the bias terms in the pre-trained model,
while freezing the remaining parameters of the Whisper decoder. The authors
in [23] showed that fine-tuning only a subset of bias parameters in a Transformer
network is comparable to a full fine-tuning of the model.

2.3 Text to Speech Models

The generation of realistic synthetic speech data was performed through a state-
of-the-art TTS system composed by a Tacotron-2 [24] acoustic model followed by
a HiFi-GAN [25] neural vocoder. Tacotron-2 consists of a sequence-to-sequence
model, which includes an encoder, a decoder, and a final post-processing convolu-
tional neural network. The encoder is fed with embedding representations of the
input characters, generated by a 1D convolutional-recurrent network, and which
is simultaneously trained with the whole TTS system. The Tacotron-2 models for
English, Spanish and Basque were trained on pairs of text and its corresponding
acoustic information, represented by audios sampled at 22, 050 Hz, using 80-
channel Mel-spectrograms, a frame length of 1024, a time-shift of 256 samples,
and a 1024-resolution Fourier transform. During training, this network learned
to generalize and generate new spectrograms from unseen texts using the exam-
ples given for training. The model was trained using an Adam optimizer [26],
and a learning rate of 10−3 that exponentially decays to 10−5 after 50k steps. We
also applied L2 regularization with a weight of 10−7 and a batch-size of 32. The
final training steps were slightly different for each model, although they were
established between 170k and 190k steps. The English model was trained with
the LJ Speech Dataset [27] composed of 13, 100 short audio clips from a single
speaker (23 h and 54 min). The Spanish and Basque models were trained using
mono-speaker proprietary datasets, containing 11, 650 (20 h and 46 min) and
11, 640 (19 h and 4 min) short audio clips for Spanish and Basque, respectively.
When Whisper Meets TTS 231

The Tacotron-2 model is combined with a HiFi-GAN [25] neural vocoder


that receives the spectrograms generated by the acoustic model, and produce
the final waveform. Each model was trained and adapted to each target voice
by using a set of ground-truth aligned Mel spectrograms, compatible with the
Tacotron-2 model. The vocoder was trained using a learning rate of 2 × 10−4
with a decaying factor of 0.999, whilst the batch-size was set to 16.

2.4 Methodology
The proposed methodology is shown in Fig. 2. The text data for each domain was
crawled from the Internet to obtain the target vocabulary for recognition. The
crawled corpora were then preprocessed and used as input for the TTS system.
After generating the synthetic speech data, different versions of Whisper were
fine-tuned to obtain domain-specific ASRs. Only the decoder of Whisper was
adapted to learn the target vocabulary and not the acoustic characteristics of
synthetic speech. The evaluation was performed using real acoustic data from
the specific domains.

Fig. 2. Methodology to fine-tune Whisper ASR systems using synthetic data.

Three scenarios were considered to cover specific domains in forensics, broad-


cast media, and parliamentary. These scenarios often include specialized termi-
nology that a general-purpose ASR system may not be able to recognize accu-
rately. At the same time, each domain contains data in a different language. The
considered languages were selected to cover large, intermediate, and low resource
scenarios [4].

3 Data Description
We considered data in English, Spanish, and Basque with the aim to cover
different scenarios where the original Whisper model was trained with large,
232 J. C. Vásquez-Correa et al.

Table 1. Data distribution for each corpus.

English Spanish Basque


Domain Forensics Broadcast media Parliament
Hours for pre training [4] 438,200 11,100 21
Hours for fine-tuning 11.7 20.4 26.7
Hours for test 10 54 7.2
Tokens for fine-tuning 99,470 141,236 187,975
Unique tokens for fine-tuning 7,473 18,393 30,471

intermediate, and low resource data. For each language, the data used were
specific to a particular domain of application, which included forensics, broadcast
media, and parliament. Table 1 summarizes the main characteristics for each
corpus. Further details about each scenario are found in the following sections.

3.1 English Data. Forensic Child Abuse Analysis Domain

The experiments for this scenario were performed with the GRACE corpus [28].
This is a multilingual dataset that comprises audio recordings from multiple
sources from the research community. Audios from different public databases
were compiled and filtered according to the presence of 86 keywords related to
child abuse. This study considered only the English version of the dataset, which
comprises 9.2 h of audio recordings from the Spoken Wikipedia corpus [29], the
Debating technology corpus [30], and TEDLIUMv2 [31]. This corpus is available
online1 to be used as a benchmark corpus for speech recognition under forensic
domains.
The text data used for synthesis and fine-tuning of the Whisper model
included crawled documents from EUROPOL2 , UNICEF3 , and Wikipedia arti-
cles related to child abuse. The crawled corpus is composed of 55,059 words
(without stop words) from which 4,571 audio utterances were created (11.7 h).

3.2 Spanish Data. Broadcast Media Domain

The data for this experiment considered the test set of the IberSPEECH-RTVE
2022 Speech to Text Transcription Challenge [32]4 . The database is a collection
of 54 h of audio materials from the Spanish national TV (RTVE) archive in
various genres. The corpus covers a wide variety of scenarios of read and spon-
taneous speech, including material from scripted content to live broadcasts. The

1
https://shorturl.at/dfjx2.
2
https://www.europol.europa.eu/media-press/newsroom?q=child%20abuse.
3
https://www.unicef.org/search?force=0&query=child+abuse&created%5Bmin
%5D=&created%5Bmax%5D=.
4
http://catedrartve.unizar.es/rtvedatabase.html.
When Whisper Meets TTS 233

corpus incorporates a diverse range of content, such as fiction series, contest


shows, social and cultural documentaries, unedited live interviews, and news-
casts, among others. This corpus is not directly segmented and contains chal-
lenging acoustic conditions (e.g. background music and noise). Therefore, we
first applied a Voice Activity Detection (VAD) module based on the GPVAD
convolutional-recurrent architecture [33] trained with 5k h from the Google
AudioSet database [34]. The GPVAD raw output is further processed by join-
ing small speech regions separated by less than two seconds of non-speech, thus
generating speech segments longer than five seconds when possible.
The training text material to generate synthetic data for this domain covered
TV subtitles from the RTVE Play web portal5 and news crawled from the RTVE
website6 [35]. We also collected news gathered from digital newspapers in the
Internet in order to generalize to other news formats and improve generaliza-
tion. A total of 17,524 sentences covering different news topics were synthesized,
forming a corpus with 20.4 h duration.

3.3 Basque Data. Parliament Domain


The Mintzai corpus7 was considered for this scenario. This dataset consists of
parliamentary sessions of the Basque government between 2011 and 2018. The
corpus was originally designed for speech translation studies, and contains par-
allel utterances in Basque and Spanish [36]. The considered experiments cover
only Basque data. The test includes audio from 127 speakers (7.2 h).
The training corpus to be synthesized for this domain was obtained by crawl-
ing the web sites where the official plenary sessions of the parliament are avail-
able8 . Texts from the sessions between 2012 and 2020 were downloaded as PDF
files, and converted to plain text using the PDFtoText Linux tool. A set of 13,910
sentences were generated, forming a training corpus with 26.7 h duration.

4 Results and Discussion


Table 2 shows the results obtained by fine-tuning the different Whisper models
using synthetic speech. The results include those obtained for large (English),
intermediate (Spanish), and low (Basque) resource languages. The fine-tuning
process in this case was performed using LoRA as the PEFT method.
The experiments confirm the reliability of the proposed methodology to adapt
the Whisper domain using synthetic data, especially when the amount of pre-
training data is intermediate or low, as is the case for Spanish and Basque. For
Spanish, the fine-tuning process reduces the WER between 0.4 and 6.2 points
compared to the original models, depending on the Whisper version. For the case
of Basque, the fine-tuning process has a greater impact on the WER, reducing
5
https://www.rtve.es/play/.
6
https://www.rtve.es/noticias.
7
https://github.com/Vicomtech/mintzai-ST.
8
http://www.legebiltzarra.eus.
234 J. C. Vásquez-Correa et al.

Table 2. Results obtained by fine-tuning each Whisper version using synthetic speech
in English, Spanish, and Basque. The performance is measured in terms of total WER.

Model version Fine-tuning WER


(Yes/No) English Spanish Basque
tiny No 21.8 38.6 94.8
tiny Yes 22.4 34.2 69.8
base No 20.0 30.7 91.5
base Yes 21.7 24.5 60.4
small No 18.6 24.3 73.3
small Yes 19.4 23.9 55.4
medium No 18.0 22.1 61.3
medium Yes 17.9 20.5 53.8
large No 17.6 16.1 59.7
large Yes 17.8 14.9 50.9

between 6.2 and 31 points depending on the model size. For both languages, the
WER reduction is more evident for the case of the smallest models (tiny and
base). For the English language, the fine-tuning process using synthetic speech
does not result in a reduction in WER with respect to the original models,
which is contrary to the results obtained for Spanish and Basque. In some cases,
the fine-tuned version even produces a higher WER than the original one (base
and large). This behavior can be explained by two reasons: (1) the amount of
data used to train the original Whisper models for English is much greater than
the amount considered for Spanish and Basque (see Table 1). (2) The test data
used in English is a compilation of several corpora from the literature, including
TEDLIUMv2 and the Spoken Wikipedia corpus. Information from such corpora
may already be available within the original Whisper weights. Therefore, the
information added to the model via synthetic speech does not contribute to new
knowledge, as in the case of Spanish and Basque.
With the aim to compare different PEFT methods when fine-tuning Whisper,
Table 3 shows results comparing LoRA [21], AdaLoRA [22], and BitFit [23].
The comparison is performed fine-tuning the medium Whisper model (769 M
parameters). The fine-tuning process for all methods is performed under the
same conditions and using the same hyperparameters for training.
Similar results are observed when using either LoRA or AdaLoRA. Both
approaches are able to reduce the WER compared to the original Whisper
model, especially for Spanish and Basque. This is explained considering that
both approaches rely on the same principle of weight decomposition into low
rank matrices. The main difference is that AdaLoRA is able to achieve the same
results, but fine-tuning less than half the parameters fine-tuned by LoRA. This
leads less memory consumption (see Fig. 3). BitFit helps to reduce the train-
When Whisper Meets TTS 235

Table 3. Comparison between different PEFT methods for fine-tuning the medium
version of Whisper. Results are presented in terms of WER and the normalized WER
(nWER) with respect to the original Whisper model.

PEFT Trainable Trainable English Spanish Basque


Method params. % WER nWER WER nWER WER nWER
Not fine-tuned - - 18.0 100.0 22.1 100.0 61.3 100.0
LoRA [21] 9.4M 1.22 17.9 99.4 20.5 92.8 53.8 87.8
AdaLoRA [22] 3.5M 0.46 17.8 98.9 20.1 91.0 52.9 86.6
BitFit [23] 0.6M 0.08 17.9 99.4 21.8 98.6 59.8 97.6

ing time and the memory costs even further, by fine-tuning only 0.08% of the
weights, but at the cost of sacrificing performance, especially in Basque.

Fig. 3. GPU memory and train speed for the different PEFT methods. All evaluations
are conducted on a NVIDIA GeForce RTX-3090 GPU (24GB VRAM). Full fine-tuning
is not possible for batch sizes larger than 1 due to memory constraints.

Training speed for BitFit is significantly higher than for the other two meth-
ods (p-value  0.005 in all cases). The differences between the training speed
for LoRA, AdaLoRA, and the full fine-tuning is not significant (p-value > 0.005
in all cases). The statistical comparisons were performed using an ANOVA with
a Tukey Post-Hoc test. Although there is not significant differences for the train-
ing speed between LoRA, AdaLoRA, and the full fine-tuning, the memory con-
sumption of the PEFT methods is much lower than the observed for the full
fine-tuning. This makes possible the use of larger batch sizes (either directly or
using training accumulation steps for the gradient computation), which at the
end will be translated in a significant improvement of training speed [20].

5 Conclusion
This paper proposes a methodology for adapting the vocabulary of Whisper-
based speech recognizers to new domains in different languages using only TTS-
236 J. C. Vásquez-Correa et al.

derived data. The proposed approach was tested on data from large, interme-
diate, and low resource scenarios in English, Spanish, and Basque languages,
respectively. In addition, we compared different PEFT-based methods to per-
form the fine-tuning process due to the large number of parameters to train.
This study demonstrated that it is possible to improve the performance of an
E2E ASR system using only synthetic data, which is a novel approach. Previous
studies relied on the combination of synthetic and real speech, which still requires
data annotation procedures that can slow down the training process. Using only
synthetic speech data can be a more efficient and cost-effective approach, espe-
cially when there is limited time or resources available for data collection.
The results indicated that using only synthetic data for domain adaptation of
Whisper-based ASRs leads to performance improvements, particularly in low-
resource scenarios. The proposed methodology was successful in reducing the
WER between 6.2 and 31 points, depending on the language and model version.
In addition, we confirm the utility of using PEFT methods to train large models,
which would be difficult to achieve under limited hardware resources. PEFT
methods helped to reduce memory consumption, giving the possibility to use
larger bath sizes, which ultimately lead to more generalized models.
For future work, we will incorporate data augmentation techniques such as
SpecAugment to increase the acoustic variability of the synthetic audios. Addi-
tionally, using more synthetic samples can also help reduce WERs, as the model
can learn from a larger vocabulary. Overall, these approaches can lead to even
better performance and robustness of the ASR system in different domains and
languages. Additional PEFT methods such as those based on Prompt and Pre-
fix tuning [37,38] can also be considered an adapted to fine-tune the decoder of
large acoustic models such as Whisper.

References
1. Li, J., et al.: Recent advances in end-to-end automatic speech recognition. APSIPA
Trans. Sign. Inf. Proc. 11(1) (2022)
2. Baevski, A., et al.: Wav2Vec 2.0: a framework for self-supervised learning of speech
representations. In: NEURIPS, vol. 33, pp. 12449–12460 (2020)
3. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recog-
nition. In: Proceedings of the INTERSPEECH, pp. 5036–5040 (2020)
4. Radford, A., et al.: Robust speech recognition via large-scale weak supervision.
Technical report, OpenAI (2022)
5. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic
speech recognition. In: Proceedings of the INTERSPEECH, pp. 2613–2617 (2019)
6. Li, J., et al.: Training neural speech recognition systems with synthetic speech
augmentation. arXiv preprint arXiv:1811.00707 (2018)
7. Rosenberg, A., et al.: Speech recognition with augmented synthesized speech. In:
Proceedings of the ASRU, pp. 996–1002. IEEE (2019)
8. Laptev, A., et al.: You do not need more data: improving end-to-end speech recog-
nition by text-to-speech data augmentation. In: Proceedings of the CISP-BMEI,
pp. 439–444. IEEE (2020)
When Whisper Meets TTS 237

9. Rossenbach, N., et al.: Generating synthetic audio data for attention-based speech
recognition systems. In: Proceedings of the ICASSP, pp. 7069–7073. IEEE (2020)
10. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in
end-to-end speech synthesis. In Proceedings of the ICML, pp. 5180–5189. PMLR
(2018)
11. Wang, C., et al.: Neural codec language models are zero-shot text to speech syn-
thesizers. arXiv preprint arXiv:2301.02111 (2023)
12. Ueno, S., et al.: Multi-speaker sequence-to-sequence speech synthesis for data aug-
mentation in acoustic-to-word speech recognition. In Proceedings of the ICASSP,
pp. 6161–6165. IEEE (2019)
13. Zheng, X., Liu, Y., Gunceler, D., Willett, D.: Using synthetic audio to improve the
recognition of out-of-vocabulary words in end-to-end ASR systems. In: Proceedings
of the ICASSP, pp. 5674–5678. IEEE (2021)
14. Fazel, A., et al.: SynthASR: unlocking synthetic data for speech recognition. arXiv
preprint arXiv:2106.07803 (2021)
15. Ueno, S., et al.: Data augmentation for ASR using TTS via a discrete representa-
tion. In: Proceedings of the ASRU, pp. 68–75. IEEE (2021)
16. Qu, L., Weber, C., Wermter, S.: Emphasizing unseen words: new vocabulary acqui-
sition for end-to-end speech recognition. Neural Netw. 161, 494–504 (2023)
17. Hu, T.Y., et al.: Synt++: utilizing imperfect synthetic data to improve speech
recognition. In: Proceedings of the ICASSP, pp. 7682–7686. IEEE (2022)
18. Mimura, M., et al.: Leveraging sequence-to-sequence speech synthesis for enhancing
acoustic-to-word speech recognition. In: Proceedings of the SLT, pp. 477–484. IEEE
(2018)
19. Panayotov, V., et al.: LibriSpeech: an ASR corpus based on public domain audio
books. In: Proceedings of the ICASSP, pp. 5206–5210 (2015)
20. Ding, N., et al.: Parameter-efficient fine-tuning of large-scale pre-trained language
models. Nature Mach. Intell. 5, 1–16 (2023)
21. Hu, E.J., Shen, Y., et al.: LoRA: low-rank adaptation of large language models.
arXiv preprint arXiv:2106.09685 (2021)
22. Zhang, Q., et al.: Adaptive budget allocation for parameter-efficient fine-tuning.
arXiv preprint arXiv:2303.10512 (2023)
23. Zaken, E.B., et al.: BitFit: simple parameter-efficient fine-tuning for transformer-
based masked language-models. arXiv preprint arXiv:2106.10199 (2021)
24. Shen, et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram
predictions. In: Proceedings of the ICASSP, pp. 4779–4783. IEEE (2018)
25. Kong, J., et al.: Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis. Proceedings of the NEURIPS, vol. 33, pp. 17022–17033
(2020)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. 2015 ICLR.
arXiv preprint arXiv:1412.6980 (2015)
27. Ito, K., Johnson, L.: The LJ speech dataset (2017). www.http://keithito.com/LJ-
Speech-Dataset/
28. Vásquez-Correa, J.C., Álvarez Muniain, A.: Novel speech recognition systems
applied to forensics within child exploitation: Wav2Vec 2. 0 vs. whisper. Sensors
23(4), 1843 (2023)
29. Baumann, T., et al.: The spoken Wikipedia corpus collection: harvesting, alignment
and an application to hyperlistening. Lang. Resour. Eval. 53(2), 303–329 (2019)
30. Mirkin, S., et al.: A recorded debating dataset. In: Proceedings of the LREC, pp.
250–254 (2017)
238 J. C. Vásquez-Correa et al.

31. Rousseau, A., et al.: Enhancing the TED-LIUM corpus with selected data for
language modeling and more ted talks. In: Proceedings of the LREC, pp. 3935–
3939 (2014)
32. Lleida, E., et al.: Albayzin evaluation: IberSPEECH-RTVE 2022 speech to text
transcription challenge (2022)
33. Dinkel, H., et al.: Voice activity detection in the wild: a data-driven approach
using teacher-student training. IEEE/ACM Trans. Audio, Speech Lang. Process.
29, 1542–1555 (2021)
34. Gemmeke, J., et al.: Audio set: an ontology and human-labeled dataset for audio
events. In: Proceedings of the ICASSP, pp. 776–780 (2017)
35. Arzelus, H., et al.: The Vicomtech-UPM speech transcription systems for the
albayzın-rtve 2022 speech to text transcription challenge. In: Proceedings of the
IberSPEECH, pp. 266–270 (2022)
36. T. Etchegoyhen et al. mintzai-st: Corpus and baselines for basque-spanish speech
translation. In: Proceedings of the IberSPEECH, pp. 1–5 (2021)
37. Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to fine-tuning uni-
versally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)
38. Li, X.L., Liang, P.: Prefix-tuning: optimizing continuous prompts for generation.
In: Proceedings of the ACL, pp. 4582–4597 (2021)

You might also like