0% found this document useful (0 votes)

106 views13 pages

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data

Paper

Uploaded by

aksteamnew1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

106 views13 pages

When Whisper Meets TTS Domain Adaptation Using Only Synthetic Speech Data

Paper

Uploaded by

aksteamnew1

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

When Whisper Meets TTS: Domain

Adaptation Using only Synthetic Speech

Data

Juan Camilo Vásquez-Correa1(B) , Haritz Arzelus1 , Juan M. Martin-Doñas1 ,

Joaquin Arellano1 , Ander Gonzalez-Docasal1,2 , and Aitor Álvarez1
1
Fundacion Vicomtech, Basque Research and Technology Alliance (BRTA),
Mikeletegi 57, 20009 Donostia - San Sebastian, Spain
[email protected]
2
University of Zaragoza, Department of Electronics, Engineering and
Communications, Pedro Cerbuna 12, 50009 Zaragoza, Spain

Abstract. Automatic Speech Recognition is among the most important

areas of Artificial Intelligence research today. One of the most notable
advances in this area is the development of end-to-end models, which
have shown state-of-the-art performance in many benchmark scenarios.
In spite of the recent improvements, these architectures still require large
amounts of transcribed speech data to be trained, which can be challeng-
ing in low resource languages, or in specific domains due to privacy con-
cerns. This study proposes a methodology to fine-tune Whisper-based
models using only synthetic speech. The aim is to enable training robust
systems for specific domains and low resource languages, where large
labeled corpora are difficult to collect. Our approach is based on a lan-
guage model adaptation by fine-tuning only the decoder of the model,
thus the network is able to learn specific vocabulary that is not initially
available. The proposed methodology is evaluated with data from differ-
ent languages and domains. In addition, Parameter Efficient Fine-Tuning
strategies were used to efficiently adapt the large pre-trained Whisper
models. This is one of the first studies that considers the effect of using
only synthetic speech for domain adaption of speech recognition systems
in non-English data, providing word error rate reductions in low resource
languages between 2 and 30 points, depending on the Whisper version.

Keywords: Speech Recognition · Whisper · Text to Speech · Domain

Adaptation · Parameter Eﬃcient Fine-Tuning

1 Introduction

In recent years, there have been signiﬁcant advances in Automatic Speech Recog-
nition (ASR) using End-to-End (E2E) models [1]. One of the most notable
outcomes in this area is the development of Transformer-based architectures,
such as Wav2Vec2.0 [2], the Conformer networks [3], and more recently, fully
c The Author(s), under exclusive license to Springer Nature Switzerland AG 2023
K. Ekštein et al. (Eds.): TSD 2023, LNAI 14102, pp. 226–238, 2023.
https://doi.org/10.1007/978-3-031-40498-6_20
When Whisper Meets TTS 227

supervised models like Whisper [4]. These models have achieved state-of-the-
art performance on a variety of speech recognition benchmarks, including those
that involve noisy or accented speech. Although the results observed to date
are impressive, several challenges still remain for the research community. For
instance, these E2E architectures still require large amounts of transcribed
speech data to be trained and to reach good performance. Furthermore, spe-
cific domains such as health care, forensics, multimedia, or government, among
others, may face limited data availability due to privacy concerns or difficulties in
data collection. These scenarios pose challenges due to non-controlled acoustic
conditions and domain-specific vocabulary. Addressing these problems is par-
ticularly challenging for low-resource languages where large labeled corpora are
scarce for training ASR systems. Therefore, training robust ASR models for spe-
cific domains, low resource languages, and in non-controlled acoustic conditions
becomes a difficult task.
One strategy to adapt ASR systems to specific domains, especially under low
resource settings, is to use data augmentation strategies to artificially increase
the size of the training data. Methods like SpecAugment [5] or those based on
speed perturbation and noise injection have shown to be helpful in adapting ASR
models to specific domains. However, these methods only focus on adapting the
ASR system to the specific acoustic conditions of the target domain, overlooking
the challenge of domain-specific vocabulary.
Recent studies have demonstrated that it is possible to perform data augmen-
tation, or even full training of ASR systems, using synthetic data obtained from
Text-To-Speech (TTS) systems [6–9]. Given the improvements in neural TTS
models, such as Tacotron-2 with Global State Tokens [10] and more recently
VALL-E [11], it is possible to generate high quality speech with varying prosody
that can be used also to train and adapt E2E ASR models. The use of synthetic
speech to fine-tune ASR systems can be particularly helpful to deal with the
issue of out-of-vocabulary words and to expand the vocabulary of E2E systems
during training [6,12,13]. In [14], the authors use synthetic speech to teach med-
ication names to an E2E ASR system based on a Recurrent Neural Networks
Transducer (RNN-T). The training process involved mixing real and synthetic
samples. The fine-tuned model reduced relatively the word error rate (WER)
by up to 65% when recognizing out-of-vocabulary words related to medication
names. Additional studies have shown that combining real and synthetic speech
data during training and fine-tuning can reduce the WER of an ASR system [6–
9,15,16]. However, there are important considerations to address in those cases
to achieve accurate results. For instance, when using synthetic speech for ASR
training, it is important to deal with the mismatch in acoustic characteristics
between real and synthetic audio. Synthetic speech may contain artifacts that
do not exist in real data, such as unrealistic speaking styles and the absence of
background noise. Some studies have mitigated this issue by implementing reg-
ularization strategies [13,17] and freezing the encoder of the E2E model during
fine-tuning [12,13]. The process of fine-tuning only the decoder of the network is
228 J. C. Vásquez-Correa et al.

similar to adapting a language model. Therefore, the model only learns the token
representation of out-of-vocabulary words, rather than the acoustic properties
of synthetic speech [18].
These previous works have proven the benefits of increased acoustic and lex-
ical diversity in synthetic data for ASR training. Nevertheless, most of them are
evaluated using standard benchmark corpora, such as Librispeech [19]. Moreover,
in the majority of cases, TTS-derived data is used only for data augmentation,
rather than to adapt the model to new, unseen domains, particularly in low
resource languages, where there is the real need to adapt E2E ASR models.
Finally, most of the previous studies have focused on mixing real and synthetic
audio data, and have not shown reliable results using only synthetic speech [7,17].
This study extends all previous research by using only synthetic data for
domain adaptation of E2E ASR models. Our approach is motivated by the recent
release of Whisper [4], which was pre-trained with large amounts of labeled data
from the Internet (up to 680k h). This leads us to believe that it is now possible
to fine-tune models using only synthetic speech, making the domain adapta-
tion tasks more feasible, especially in low resource languages. We considered a
state-of-the-art TTS system to create realistic speech signals in order to create
adapted Whisper models for a variety of domains, including forensics, broad-
cast media, and parliamentary. Our proposed methodology was also evaluated
in different languages to test the effect of using synthetic speech to adapt pre-
trained models with large, intermediate, and low resource languages such as
English, Spanish, and Basque, respectively. To the best of our knowledge, this
is one of the first studies to consider the effect of using only synthetic speech in
ASR model adaptation for non-English data. An additional contribution of this
paper relies on the evaluation and comparison of different Parameter Efficient
Fine-Tuning (PEFT) methods [20] when training large Transformer-based mod-
els. PEFT-based approaches focus on fine-tuning only a small number of model
parameters, thereby greatly decreasing the computational and storage costs. The
use of these strategies have not been extensively explored for speech-based mod-
els. However, this is an important aspect to be considered when training large
models such as Whisper.
The rest of the paper is distributed as follows. Section 2 describes the methods
and strategies considered to adapt an ASR system based on Whisper to new
unknown domains in different languages. Section 3 describes the different corpora
considered in this study to train and evaluate the proposed approach. Section 4
shows the main results obtained and discusses the main insights derived from the
performed experiments. Finally, Sect. 5 draws the main conclusions and presents
further perspectives to be addressed.

2 Methods
2.1 Whisper
Whisper is an encoder-decoder Transformer network recently introduced by Ope-
nAI [4]. The model is trained in a fully supervised manner, using up to 680k h
of labeled speech data from multiple sources. The encoder is fed by 80-channel
When Whisper Meets TTS 229

Fig. 1. Whisper architecture representation. The log Mel-spectrograms are encoded

by a Transformer network. Encoded representations are transformed into character
outputs and non-speech tokens via the Transformer decoder. Figure inspired from [4].

log-Mel spectrograms and it consists of two convolution layers (kernel size of

3), followed by sinusoidal positional encoding, and a stacked set of Transformer
blocks. The decoder uses the learned positional embeddings and the same num-
ber of Transformer blocks as the encoder (see Fig. 1).
Five pre-trained versions of Whisper are available, with variations in the
number of layers (ranging from 4 to 32) and attention heads (ranging from 6 to
20). These configurations yield models with 39 M to 1550 M parameters. The con-
ducted experiments involved fine-tuning the five versions of the model to evaluate
their capability to be adapted to the target domain. The process is performed
freezing the weights of the encoder network, similar to previous studies [13,18].
Freezing the encoder aims to mitigate the mismatch in acoustic characteristics
between real and synthetic audio, which can be problematic for model train-
ing and fine-tuning. Hence, the learning process focuses only on adapting the
vocabulary to the E2E system, similar to a language model adaptation.
The hyper parameters for fine-tuning included a learning rate of 5 × 10−5 ,
warmed up during the initial 10% of the training, and batch size of 16 (using
gradient accumulation steps due to memory constraints). The decoding was per-
formed using a beam search strategy with 5 beams, an array of temperature
weights of [0.2, 0.4, 0.6, 0.8, 1], and a no repeat 3-gram strategy to avoid loops [4].
230 J. C. Vásquez-Correa et al.

2.2 Parameter Eﬃcient Fine-Tuning

Due to the large number of parameters to fine-tune, especially for the Large ver-
sion of Whisper (1550 M), a set of PEFT strategies were applied. These methods
aimed to fine-tune a small number of model parameters, decreasing the compu-
tational and storage costs [20]. In general, PEFT methods have shown to be
comparable to a full parameter fine-tuning despite the substantial reduction of
tunable parameters [20].
We compared three different PEFT methods: (1) the Low-Rank Adaptation
(LoRA) [21], which freezes the pretrained model weights and injects trainable
rank decomposition matrices into each layer of the Whisper decoder. We con-
sider a rank r = 32 and a re-scaling factor α = 64 for the matrix factorization
in LoRA [21]. (2) AdaLoRA [22], where the rank decomposition of the weight
matrices is performed adaptively. Critical incremental matrices are assigned with
high rank such that they can capture more fine-grained and task-specific infor-
mation. Less important ones are pruned to have lower rank to prevent over-
fitting and save the computational budget [22]. (3) Finally, in addition to the
rank decomposition-based approaches, we considered the Bias-terms Fine-tuning
(BitFit) strategy [23]. BitFit updates the bias terms in the pre-trained model,
while freezing the remaining parameters of the Whisper decoder. The authors
in [23] showed that fine-tuning only a subset of bias parameters in a Transformer
network is comparable to a full fine-tuning of the model.

2.3 Text to Speech Models

The generation of realistic synthetic speech data was performed through a state-
of-the-art TTS system composed by a Tacotron-2 [24] acoustic model followed by
a HiFi-GAN [25] neural vocoder. Tacotron-2 consists of a sequence-to-sequence
model, which includes an encoder, a decoder, and a final post-processing convolu-
tional neural network. The encoder is fed with embedding representations of the
input characters, generated by a 1D convolutional-recurrent network, and which
is simultaneously trained with the whole TTS system. The Tacotron-2 models for
English, Spanish and Basque were trained on pairs of text and its corresponding
acoustic information, represented by audios sampled at 22, 050 Hz, using 80-
channel Mel-spectrograms, a frame length of 1024, a time-shift of 256 samples,
and a 1024-resolution Fourier transform. During training, this network learned
to generalize and generate new spectrograms from unseen texts using the exam-
ples given for training. The model was trained using an Adam optimizer [26],
and a learning rate of 10−3 that exponentially decays to 10−5 after 50k steps. We
also applied L2 regularization with a weight of 10−7 and a batch-size of 32. The
final training steps were slightly different for each model, although they were
established between 170k and 190k steps. The English model was trained with
the LJ Speech Dataset [27] composed of 13, 100 short audio clips from a single
speaker (23 h and 54 min). The Spanish and Basque models were trained using
mono-speaker proprietary datasets, containing 11, 650 (20 h and 46 min) and
11, 640 (19 h and 4 min) short audio clips for Spanish and Basque, respectively.
When Whisper Meets TTS 231

The Tacotron-2 model is combined with a HiFi-GAN [25] neural vocoder

that receives the spectrograms generated by the acoustic model, and produce
the ﬁnal waveform. Each model was trained and adapted to each target voice
by using a set of ground-truth aligned Mel spectrograms, compatible with the
Tacotron-2 model. The vocoder was trained using a learning rate of 2 × 10−4
with a decaying factor of 0.999, whilst the batch-size was set to 16.

2.4 Methodology
The proposed methodology is shown in Fig. 2. The text data for each domain was
crawled from the Internet to obtain the target vocabulary for recognition. The
crawled corpora were then preprocessed and used as input for the TTS system.
After generating the synthetic speech data, different versions of Whisper were
fine-tuned to obtain domain-specific ASRs. Only the decoder of Whisper was
adapted to learn the target vocabulary and not the acoustic characteristics of
synthetic speech. The evaluation was performed using real acoustic data from
the specific domains.

Fig. 2. Methodology to ﬁne-tune Whisper ASR systems using synthetic data.

Three scenarios were considered to cover speciﬁc domains in forensics, broad-

cast media, and parliamentary. These scenarios often include specialized termi-
nology that a general-purpose ASR system may not be able to recognize accu-
rately. At the same time, each domain contains data in a diﬀerent language. The
considered languages were selected to cover large, intermediate, and low resource
scenarios [4].

3 Data Description
We considered data in English, Spanish, and Basque with the aim to cover
diﬀerent scenarios where the original Whisper model was trained with large,
232 J. C. Vásquez-Correa et al.

Table 1. Data distribution for each corpus.

English Spanish Basque

Domain Forensics Broadcast media Parliament
Hours for pre training [4] 438,200 11,100 21
Hours for fine-tuning 11.7 20.4 26.7
Hours for test 10 54 7.2
Tokens for fine-tuning 99,470 141,236 187,975
Unique tokens for fine-tuning 7,473 18,393 30,471

intermediate, and low resource data. For each language, the data used were
speciﬁc to a particular domain of application, which included forensics, broadcast
media, and parliament. Table 1 summarizes the main characteristics for each
corpus. Further details about each scenario are found in the following sections.

3.1 English Data. Forensic Child Abuse Analysis Domain

The experiments for this scenario were performed with the GRACE corpus [28].
This is a multilingual dataset that comprises audio recordings from multiple
sources from the research community. Audios from different public databases
were compiled and filtered according to the presence of 86 keywords related to
child abuse. This study considered only the English version of the dataset, which
comprises 9.2 h of audio recordings from the Spoken Wikipedia corpus [29], the
Debating technology corpus [30], and TEDLIUMv2 [31]. This corpus is available
online1 to be used as a benchmark corpus for speech recognition under forensic
domains.
The text data used for synthesis and fine-tuning of the Whisper model
included crawled documents from EUROPOL2 , UNICEF3 , and Wikipedia arti-
cles related to child abuse. The crawled corpus is composed of 55,059 words
(without stop words) from which 4,571 audio utterances were created (11.7 h).

3.2 Spanish Data. Broadcast Media Domain

The data for this experiment considered the test set of the IberSPEECH-RTVE
2022 Speech to Text Transcription Challenge [32]4 . The database is a collection
of 54 h of audio materials from the Spanish national TV (RTVE) archive in
various genres. The corpus covers a wide variety of scenarios of read and spon-
taneous speech, including material from scripted content to live broadcasts. The

1
https://shorturl.at/dfjx2.
2
https://www.europol.europa.eu/media-press/newsroom?q=child%20abuse.
3
https://www.unicef.org/search?force=0&query=child+abuse&created%5Bmin
%5D=&created%5Bmax%5D=.
4
http://catedrartve.unizar.es/rtvedatabase.html.
When Whisper Meets TTS 233

corpus incorporates a diverse range of content, such as ﬁction series, contest

shows, social and cultural documentaries, unedited live interviews, and news-
casts, among others. This corpus is not directly segmented and contains chal-
lenging acoustic conditions (e.g. background music and noise). Therefore, we
first applied a Voice Activity Detection (VAD) module based on the GPVAD
convolutional-recurrent architecture [33] trained with 5k h from the Google
AudioSet database [34]. The GPVAD raw output is further processed by join-
ing small speech regions separated by less than two seconds of non-speech, thus
generating speech segments longer than five seconds when possible.
The training text material to generate synthetic data for this domain covered
TV subtitles from the RTVE Play web portal5 and news crawled from the RTVE
website6 [35]. We also collected news gathered from digital newspapers in the
Internet in order to generalize to other news formats and improve generaliza-
tion. A total of 17,524 sentences covering different news topics were synthesized,
forming a corpus with 20.4 h duration.

3.3 Basque Data. Parliament Domain

The Mintzai corpus7 was considered for this scenario. This dataset consists of
parliamentary sessions of the Basque government between 2011 and 2018. The
corpus was originally designed for speech translation studies, and contains par-
allel utterances in Basque and Spanish [36]. The considered experiments cover
only Basque data. The test includes audio from 127 speakers (7.2 h).
The training corpus to be synthesized for this domain was obtained by crawl-
ing the web sites where the oﬃcial plenary sessions of the parliament are avail-
able8 . Texts from the sessions between 2012 and 2020 were downloaded as PDF
ﬁles, and converted to plain text using the PDFtoText Linux tool. A set of 13,910
sentences were generated, forming a training corpus with 26.7 h duration.

4 Results and Discussion

Table 2 shows the results obtained by fine-tuning the different Whisper models
using synthetic speech. The results include those obtained for large (English),
intermediate (Spanish), and low (Basque) resource languages. The fine-tuning
process in this case was performed using LoRA as the PEFT method.
The experiments confirm the reliability of the proposed methodology to adapt
the Whisper domain using synthetic data, especially when the amount of pre-
training data is intermediate or low, as is the case for Spanish and Basque. For
Spanish, the fine-tuning process reduces the WER between 0.4 and 6.2 points
compared to the original models, depending on the Whisper version. For the case
of Basque, the fine-tuning process has a greater impact on the WER, reducing
5
https://www.rtve.es/play/.
6
https://www.rtve.es/noticias.
7
https://github.com/Vicomtech/mintzai-ST.
8
http://www.legebiltzarra.eus.
234 J. C. Vásquez-Correa et al.

Table 2. Results obtained by ﬁne-tuning each Whisper version using synthetic speech
in English, Spanish, and Basque. The performance is measured in terms of total WER.

Model version Fine-tuning WER

(Yes/No) English Spanish Basque
tiny No 21.8 38.6 94.8
tiny Yes 22.4 34.2 69.8
base No 20.0 30.7 91.5
base Yes 21.7 24.5 60.4
small No 18.6 24.3 73.3
small Yes 19.4 23.9 55.4
medium No 18.0 22.1 61.3
medium Yes 17.9 20.5 53.8
large No 17.6 16.1 59.7
large Yes 17.8 14.9 50.9

between 6.2 and 31 points depending on the model size. For both languages, the
WER reduction is more evident for the case of the smallest models (tiny and
base). For the English language, the fine-tuning process using synthetic speech
does not result in a reduction in WER with respect to the original models,
which is contrary to the results obtained for Spanish and Basque. In some cases,
the fine-tuned version even produces a higher WER than the original one (base
and large). This behavior can be explained by two reasons: (1) the amount of
data used to train the original Whisper models for English is much greater than
the amount considered for Spanish and Basque (see Table 1). (2) The test data
used in English is a compilation of several corpora from the literature, including
TEDLIUMv2 and the Spoken Wikipedia corpus. Information from such corpora
may already be available within the original Whisper weights. Therefore, the
information added to the model via synthetic speech does not contribute to new
knowledge, as in the case of Spanish and Basque.
With the aim to compare different PEFT methods when fine-tuning Whisper,
Table 3 shows results comparing LoRA [21], AdaLoRA [22], and BitFit [23].
The comparison is performed fine-tuning the medium Whisper model (769 M
parameters). The fine-tuning process for all methods is performed under the
same conditions and using the same hyperparameters for training.
Similar results are observed when using either LoRA or AdaLoRA. Both
approaches are able to reduce the WER compared to the original Whisper
model, especially for Spanish and Basque. This is explained considering that
both approaches rely on the same principle of weight decomposition into low
rank matrices. The main difference is that AdaLoRA is able to achieve the same
results, but fine-tuning less than half the parameters fine-tuned by LoRA. This
leads less memory consumption (see Fig. 3). BitFit helps to reduce the train-
When Whisper Meets TTS 235

Table 3. Comparison between diﬀerent PEFT methods for ﬁne-tuning the medium
version of Whisper. Results are presented in terms of WER and the normalized WER
(nWER) with respect to the original Whisper model.

PEFT Trainable Trainable English Spanish Basque

Method params. % WER nWER WER nWER WER nWER
Not ﬁne-tuned - - 18.0 100.0 22.1 100.0 61.3 100.0
LoRA [21] 9.4M 1.22 17.9 99.4 20.5 92.8 53.8 87.8
AdaLoRA [22] 3.5M 0.46 17.8 98.9 20.1 91.0 52.9 86.6
BitFit [23] 0.6M 0.08 17.9 99.4 21.8 98.6 59.8 97.6

ing time and the memory costs even further, by ﬁne-tuning only 0.08% of the
weights, but at the cost of sacriﬁcing performance, especially in Basque.

Fig. 3. GPU memory and train speed for the diﬀerent PEFT methods. All evaluations
are conducted on a NVIDIA GeForce RTX-3090 GPU (24GB VRAM). Full ﬁne-tuning
is not possible for batch sizes larger than 1 due to memory constraints.

Training speed for BitFit is significantly higher than for the other two meth-
ods (p-value 0.005 in all cases). The differences between the training speed
for LoRA, AdaLoRA, and the full fine-tuning is not significant (p-value > 0.005
in all cases). The statistical comparisons were performed using an ANOVA with
a Tukey Post-Hoc test. Although there is not significant differences for the train-
ing speed between LoRA, AdaLoRA, and the full fine-tuning, the memory con-
sumption of the PEFT methods is much lower than the observed for the full
fine-tuning. This makes possible the use of larger batch sizes (either directly or
using training accumulation steps for the gradient computation), which at the
end will be translated in a significant improvement of training speed [20].

5 Conclusion
This paper proposes a methodology for adapting the vocabulary of Whisper-
based speech recognizers to new domains in diﬀerent languages using only TTS-
236 J. C. Vásquez-Correa et al.

derived data. The proposed approach was tested on data from large, interme-
diate, and low resource scenarios in English, Spanish, and Basque languages,
respectively. In addition, we compared different PEFT-based methods to per-
form the fine-tuning process due to the large number of parameters to train.
This study demonstrated that it is possible to improve the performance of an
E2E ASR system using only synthetic data, which is a novel approach. Previous
studies relied on the combination of synthetic and real speech, which still requires
data annotation procedures that can slow down the training process. Using only
synthetic speech data can be a more efficient and cost-effective approach, espe-
cially when there is limited time or resources available for data collection.
The results indicated that using only synthetic data for domain adaptation of
Whisper-based ASRs leads to performance improvements, particularly in low-
resource scenarios. The proposed methodology was successful in reducing the
WER between 6.2 and 31 points, depending on the language and model version.
In addition, we confirm the utility of using PEFT methods to train large models,
which would be difficult to achieve under limited hardware resources. PEFT
methods helped to reduce memory consumption, giving the possibility to use
larger bath sizes, which ultimately lead to more generalized models.
For future work, we will incorporate data augmentation techniques such as
SpecAugment to increase the acoustic variability of the synthetic audios. Addi-
tionally, using more synthetic samples can also help reduce WERs, as the model
can learn from a larger vocabulary. Overall, these approaches can lead to even
better performance and robustness of the ASR system in different domains and
languages. Additional PEFT methods such as those based on Prompt and Pre-
fix tuning [37,38] can also be considered an adapted to fine-tune the decoder of
large acoustic models such as Whisper.

References
1. Li, J., et al.: Recent advances in end-to-end automatic speech recognition. APSIPA
Trans. Sign. Inf. Proc. 11(1) (2022)
2. Baevski, A., et al.: Wav2Vec 2.0: a framework for self-supervised learning of speech
representations. In: NEURIPS, vol. 33, pp. 12449–12460 (2020)
3. Gulati, A., et al.: Conformer: convolution-augmented transformer for speech recog-
nition. In: Proceedings of the INTERSPEECH, pp. 5036–5040 (2020)
4. Radford, A., et al.: Robust speech recognition via large-scale weak supervision.
Technical report, OpenAI (2022)
5. Park, D.S., et al.: SpecAugment: a simple data augmentation method for automatic
speech recognition. In: Proceedings of the INTERSPEECH, pp. 2613–2617 (2019)
6. Li, J., et al.: Training neural speech recognition systems with synthetic speech
augmentation. arXiv preprint arXiv:1811.00707 (2018)
7. Rosenberg, A., et al.: Speech recognition with augmented synthesized speech. In:
Proceedings of the ASRU, pp. 996–1002. IEEE (2019)
8. Laptev, A., et al.: You do not need more data: improving end-to-end speech recog-
nition by text-to-speech data augmentation. In: Proceedings of the CISP-BMEI,
pp. 439–444. IEEE (2020)
When Whisper Meets TTS 237

9. Rossenbach, N., et al.: Generating synthetic audio data for attention-based speech
recognition systems. In: Proceedings of the ICASSP, pp. 7069–7073. IEEE (2020)
10. Wang, Y., et al.: Style tokens: unsupervised style modeling, control and transfer in
end-to-end speech synthesis. In Proceedings of the ICML, pp. 5180–5189. PMLR
(2018)
11. Wang, C., et al.: Neural codec language models are zero-shot text to speech syn-
thesizers. arXiv preprint arXiv:2301.02111 (2023)
12. Ueno, S., et al.: Multi-speaker sequence-to-sequence speech synthesis for data aug-
mentation in acoustic-to-word speech recognition. In Proceedings of the ICASSP,
pp. 6161–6165. IEEE (2019)
13. Zheng, X., Liu, Y., Gunceler, D., Willett, D.: Using synthetic audio to improve the
recognition of out-of-vocabulary words in end-to-end ASR systems. In: Proceedings
of the ICASSP, pp. 5674–5678. IEEE (2021)
14. Fazel, A., et al.: SynthASR: unlocking synthetic data for speech recognition. arXiv
preprint arXiv:2106.07803 (2021)
15. Ueno, S., et al.: Data augmentation for ASR using TTS via a discrete representa-
tion. In: Proceedings of the ASRU, pp. 68–75. IEEE (2021)
16. Qu, L., Weber, C., Wermter, S.: Emphasizing unseen words: new vocabulary acqui-
sition for end-to-end speech recognition. Neural Netw. 161, 494–504 (2023)
17. Hu, T.Y., et al.: Synt++: utilizing imperfect synthetic data to improve speech
recognition. In: Proceedings of the ICASSP, pp. 7682–7686. IEEE (2022)
18. Mimura, M., et al.: Leveraging sequence-to-sequence speech synthesis for enhancing
acoustic-to-word speech recognition. In: Proceedings of the SLT, pp. 477–484. IEEE
(2018)
19. Panayotov, V., et al.: LibriSpeech: an ASR corpus based on public domain audio
books. In: Proceedings of the ICASSP, pp. 5206–5210 (2015)
20. Ding, N., et al.: Parameter-efficient fine-tuning of large-scale pre-trained language
models. Nature Mach. Intell. 5, 1–16 (2023)
21. Hu, E.J., Shen, Y., et al.: LoRA: low-rank adaptation of large language models.
arXiv preprint arXiv:2106.09685 (2021)
22. Zhang, Q., et al.: Adaptive budget allocation for parameter-efficient fine-tuning.
arXiv preprint arXiv:2303.10512 (2023)
23. Zaken, E.B., et al.: BitFit: simple parameter-efficient fine-tuning for transformer-
based masked language-models. arXiv preprint arXiv:2106.10199 (2021)
24. Shen, et al.: Natural TTS synthesis by conditioning WaveNet on MEL spectrogram
predictions. In: Proceedings of the ICASSP, pp. 4779–4783. IEEE (2018)
25. Kong, J., et al.: Hifi-gan: Generative adversarial networks for efficient and high
fidelity speech synthesis. Proceedings of the NEURIPS, vol. 33, pp. 17022–17033
(2020)
26. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. 2015 ICLR.
arXiv preprint arXiv:1412.6980 (2015)
27. Ito, K., Johnson, L.: The LJ speech dataset (2017). www.http://keithito.com/LJ-
Speech-Dataset/
28. Vásquez-Correa, J.C., Álvarez Muniain, A.: Novel speech recognition systems
applied to forensics within child exploitation: Wav2Vec 2. 0 vs. whisper. Sensors
23(4), 1843 (2023)
29. Baumann, T., et al.: The spoken Wikipedia corpus collection: harvesting, alignment
and an application to hyperlistening. Lang. Resour. Eval. 53(2), 303–329 (2019)
30. Mirkin, S., et al.: A recorded debating dataset. In: Proceedings of the LREC, pp.
250–254 (2017)
238 J. C. Vásquez-Correa et al.

31. Rousseau, A., et al.: Enhancing the TED-LIUM corpus with selected data for
language modeling and more ted talks. In: Proceedings of the LREC, pp. 3935–
3939 (2014)
32. Lleida, E., et al.: Albayzin evaluation: IberSPEECH-RTVE 2022 speech to text
transcription challenge (2022)
33. Dinkel, H., et al.: Voice activity detection in the wild: a data-driven approach
using teacher-student training. IEEE/ACM Trans. Audio, Speech Lang. Process.
29, 1542–1555 (2021)
34. Gemmeke, J., et al.: Audio set: an ontology and human-labeled dataset for audio
events. In: Proceedings of the ICASSP, pp. 776–780 (2017)
35. Arzelus, H., et al.: The Vicomtech-UPM speech transcription systems for the
albayzın-rtve 2022 speech to text transcription challenge. In: Proceedings of the
IberSPEECH, pp. 266–270 (2022)
36. T. Etchegoyhen et al. mintzai-st: Corpus and baselines for basque-spanish speech
translation. In: Proceedings of the IberSPEECH, pp. 1–5 (2021)
37. Liu, X., et al.: P-tuning v2: prompt tuning can be comparable to ﬁne-tuning uni-
versally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021)
38. Li, X.L., Liang, P.: Preﬁx-tuning: optimizing continuous prompts for generation.
In: Proceedings of the ACL, pp. 4582–4597 (2021)

Floating Break Water
No ratings yet
Floating Break Water
160 pages
Suoni
No ratings yet
Suoni
38 pages
Thesis
No ratings yet
Thesis
37 pages
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
No ratings yet
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
34 pages
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
No ratings yet
Comparing The Fine-Tuning and Performance of Whisper Pre-Trained Models For Turkish Speech Recognition Task
4 pages
W - LM: I Asr M L M L - R L: Hisper Mproving Odels With Anguage Odels For OW Esource Anguages
No ratings yet
W - LM: I Asr M L M L - R L: Hisper Mproving Odels With Anguage Odels For OW Esource Anguages
26 pages
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
No ratings yet
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
25 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Fine-Tune Whisper For Multilingual ASR With ? Transformers
No ratings yet
Fine-Tune Whisper For Multilingual ASR With ? Transformers
26 pages
Recent Advances in End-to-End Automatic Speech Recognition
No ratings yet
Recent Advances in End-to-End Automatic Speech Recognition
64 pages
Fine-Tune Whisper For Multilingual ASR With Transformers
No ratings yet
Fine-Tune Whisper For Multilingual ASR With Transformers
24 pages
Seminar Report Parthiv
No ratings yet
Seminar Report Parthiv
58 pages
Preprints202212 0426 v1
No ratings yet
Preprints202212 0426 v1
18 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Metastylespeech
No ratings yet
Metastylespeech
16 pages
Aerospace 11 00219
No ratings yet
Aerospace 11 00219
13 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
No ratings yet
Neural Codec Language Models Are Zero-Shot Text To Speech Synthesizers
16 pages
Clari Net
No ratings yet
Clari Net
12 pages
2023 Icnlsp-1 30
No ratings yet
2023 Icnlsp-1 30
11 pages
Minimax Speech
No ratings yet
Minimax Speech
20 pages
Domain Adap Asr 5
No ratings yet
Domain Adap Asr 5
6 pages
3 Gan
No ratings yet
3 Gan
12 pages
End-to-End Speech Recognition: A Survey
No ratings yet
End-to-End Speech Recognition: A Survey
27 pages
Embarrassingly Simple LLM Asr
No ratings yet
Embarrassingly Simple LLM Asr
11 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Base Paper
No ratings yet
Base Paper
10 pages
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
No ratings yet
BNTTS: Few-Shot Speaker Adaptation in Low-Resource Setting
13 pages
1 s2.0 S0952197616302391 Main
No ratings yet
1 s2.0 S0952197616302391 Main
8 pages
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
No ratings yet
Language Model Bootstrapping Using Neural Machine Translation For Conversational Speech Recognition
7 pages
S E A T - S: Ample Fficient Daptive EXT TO Peech
No ratings yet
S E A T - S: Ample Fficient Daptive EXT TO Peech
15 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
Improving The Inclusivity of Dutch Speech Recognition by Fine-Tuning Whisper On The JASMIN-CGN Corpus
No ratings yet
Improving The Inclusivity of Dutch Speech Recognition by Fine-Tuning Whisper On The JASMIN-CGN Corpus
6 pages
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
No ratings yet
Multimodal Grounding For Sequence-To-Sequence Speech Recognition
5 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
Promptasr For Contextualized Asr With Controllable Style
No ratings yet
Promptasr For Contextualized Asr With Controllable Style
5 pages
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
No ratings yet
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
6 pages
4032 Whispering LLaMA A Cross
No ratings yet
4032 Whispering LLaMA A Cross
10 pages
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
No ratings yet
PNG BERT-augmented BERT On Phonemes and Graphemes For Neural TTS
5 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
Modality Adaptation For End-to-End Speech-to-Text Translation
No ratings yet
Modality Adaptation For End-to-End Speech-to-Text Translation
5 pages
Transfer Learning For ASR To Deal With Low-Resource Data Problem
No ratings yet
Transfer Learning For ASR To Deal With Low-Resource Data Problem
8 pages
Ieee
No ratings yet
Ieee
12 pages
Parameter-Efficient Fine-Tuning of Whisper For Low-Resource Speech Recognition
No ratings yet
Parameter-Efficient Fine-Tuning of Whisper For Low-Resource Speech Recognition
4 pages
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
No ratings yet
Pre-Trained Text Embeddings For Enhanced Text-to-Speech Synthesis
5 pages
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
No ratings yet
Gokul Karthik Kumar Praveen S V Pratyush Kumar Mitesh M. Khapra Karthik Nandakumar
8 pages
Auto Prep 2024
No ratings yet
Auto Prep 2024
5 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Mastercam Lathe Lesson 7 CAMInstructor
100% (3)
Mastercam Lathe Lesson 7 CAMInstructor
56 pages
Encodec Trans
No ratings yet
Encodec Trans
5 pages
Lightweight End-To-End Text-To-Speech Synthesis Fo
No ratings yet
Lightweight End-To-End Text-To-Speech Synthesis Fo
6 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
Cad-Cam Modeling PDF
100% (2)
Cad-Cam Modeling PDF
390 pages
Triangle Class 10
No ratings yet
Triangle Class 10
69 pages
Golden Real Analysis PDF
No ratings yet
Golden Real Analysis PDF
4 pages
Aryabhatta Question Paper Class XI 2019
No ratings yet
Aryabhatta Question Paper Class XI 2019
15 pages
Module For Stem 12 Gen Physics
No ratings yet
Module For Stem 12 Gen Physics
23 pages
MATH 6 PPT Q3 - Formulas in Solving For The Areas of Plane Figures
No ratings yet
MATH 6 PPT Q3 - Formulas in Solving For The Areas of Plane Figures
22 pages
VX Positioner Bray IOM 6A Hart
No ratings yet
VX Positioner Bray IOM 6A Hart
164 pages
2024-25 G - 5 CH - 4 Factors Answerkey
No ratings yet
2024-25 G - 5 CH - 4 Factors Answerkey
3 pages
7 - Worksheet 7 - Trigonometry & Right-Angled Triangles
No ratings yet
7 - Worksheet 7 - Trigonometry & Right-Angled Triangles
8 pages
Mental Calculation
No ratings yet
Mental Calculation
54 pages
Mel709 22
No ratings yet
Mel709 22
18 pages
Enotes
No ratings yet
Enotes
30 pages
Exponents Worksheets PDF
0% (3)
Exponents Worksheets PDF
2 pages
Science Stem Lesson
No ratings yet
Science Stem Lesson
25 pages
Basic Properties and Behaviors of Oil and Gas Reservoirs PDF
No ratings yet
Basic Properties and Behaviors of Oil and Gas Reservoirs PDF
97 pages
Lesson 3 Metrology
No ratings yet
Lesson 3 Metrology
25 pages
1980 Kennedy
No ratings yet
1980 Kennedy
24 pages
Boeing 747 - Aerodynamic Analysis
100% (1)
Boeing 747 - Aerodynamic Analysis
59 pages
CSAT - 01 - Explanation File
No ratings yet
CSAT - 01 - Explanation File
27 pages
Autocad 3D
No ratings yet
Autocad 3D
23 pages
ANGLES
No ratings yet
ANGLES
9 pages
ChemE Course Descriptions - 19
No ratings yet
ChemE Course Descriptions - 19
5 pages
Solution 1
No ratings yet
Solution 1
6 pages
PSD Analysis Steps
No ratings yet
PSD Analysis Steps
15 pages
University of Cambridge International Examinations General Certificate of Education Advanced Level
No ratings yet
University of Cambridge International Examinations General Certificate of Education Advanced Level
4 pages
Reactor in Series: F F FX F F FX F F FX
No ratings yet
Reactor in Series: F F FX F F FX F F FX
4 pages
DSP 1imp
No ratings yet
DSP 1imp
13 pages
Spore News Vol 3 No 1
No ratings yet
Spore News Vol 3 No 1
6 pages
OpenAI Whisper for Developers: The Complete Guide for Developers and Engineers
From Everand
OpenAI Whisper for Developers: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
From Everand
Applied HuggingSound for Speech Recognition: The Complete Guide for Developers and Engineers
William Smith
No ratings yet
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Visual Word: Unlocking the Power of Image Understanding
From Everand
Visual Word: Unlocking the Power of Image Understanding
Fouad Sabry
No ratings yet
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Language Identification: Fundamentals and Applications
From Everand
Language Identification: Fundamentals and Applications
Fouad Sabry
No ratings yet