0% found this document useful (0 votes)
31 views

Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis

Uploaded by

engineeringbaby4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis

Uploaded by

engineeringbaby4
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Improving Accented Speech Recognition using Data

Augmentation based on Unsupervised


Text-to-Speech Synthesis
Cong-Thanh Do Shuhei Imai Rama Doddipatla Thomas Hain
Toshiba Research Europe Tohoku University Toshiba Research Europe University of Sheffield
Cambridge, UK Sendai, Japan Cambridge, UK Sheffield, UK
[email protected] [email protected] [email protected] [email protected]

Abstract—This paper investigates the use of unsupervised text- learning (SSL) models [9] or to improve the recognition of
to-speech synthesis (TTS) as a data augmentation method to out-of-vocabulary words in end-to-end ASR [10]. TTS was
improve accented speech recognition. TTS systems are trained also used as a data augmentation method to improve speech
with a small amount of accented speech training data and their
pseudo-labels rather than manual transcriptions, and hence un- recognition for Librispeech task [11] and low-resource speech
supervised. This approach enables the use of accented speech data recognition [12], [13], [14]. More specifically, synthetic data
without manual transcriptions to perform data augmentation were used for data augmentation in the context of low-resource
for accented speech recognition. Synthetic accented speech data, ASR using conventional hybrid structure [13] and to augment
generated from text prompts by using the TTS systems, are the training of RNN-T (recurrent neural network - transducer)
then combined with available non-accented speech data to train
automatic speech recognition (ASR) systems. ASR experiments ASR model [15]. In [14], cross-lingual multi-speaker speech
are performed in a self-supervised learning framework using a synthesis and cross-lingual voice conversion were applied to
Wav2vec2.0 model which was pre-trained on large amount of data augmentation for ASR. The authors showed that it is
unsupervised accented speech data. The accented speech data possible to achieve promising results for ASR model training
for training the unsupervised TTS are read speech, selected with just a single speaker dataset in a target language, making
from L2-ARCTIC and British Isles corpora, while spontaneous
conversational speech from the Edinburgh international accents it viable for low-resource scenarios [14].
of English corpus are used as the evaluation data. Experimental While having been widely used in various ASR tasks, TTS,
results show that Wav2vec2.0 models which are fine-tuned to especially unsupervised TTS which is trained with unsuper-
downstream ASR task with synthetic accented speech data,
generated by the unsupervised TTS, yield up to 6.1% relative
vised audio data [16], has not been extensively studied as a
word error rate reductions compared to a Wav2vec2.0 baseline data augmentation method in accented speech recognition. In
which is fine-tuned with the non-accented speech data from a recent study on using TTS as data augmentation for accented
Librispeech corpus. speech recognition [17], accented speech were generated by
Index Terms—Accented speech recognition, text-to-speech syn- passing English text prompts through TTS system for a lan-
thesis, data augmentation, self-supervised learning, Wav2vec2.0
guage corresponding to the target accent. For example, English
I. I NTRODUCTION text prompts passing through Spanish TTS will approximate
Accented speech recognition is an important research topic Spanish-accented English. The study in [17] used commercial
of automatic speech recognition (ASR). Because of its im- TTS systems whose training data were not accessible by users.
portance, this research topic has been receiving attention In this paper, we investigate the use of unsupervised TTS
and being addressed with various research approaches. In as a data augmentation method to improve accented speech
general, these approaches can be classified as accent-agnostic recognition. In our approach, we make use of a small amount
approaches, in which the modeling of accents inside the ASR of accented speech data which do not have manual transcrip-
systems is not made specific, and accent-aware approaches tions to train TTS systems. This approach enables the use of
in which additional information about the accents of the accented speech data without manual transcriptions to perform
input speech are used [1]. Among accent-agnostic approaches, data augmentation for accented speech recognition. Indeed,
adversarial learning was used to establish accent classifier and from a small amount of unsupervised accented speech data
accent relabeling which led to performance improvement [2], used to train the TTS systems, we can generate larger amount
[3], [4]. In addition, similarity losses such as cosine losses or of synthetic accented speech data once the TTS systems are
contrastive losses were used to build accent neutral models [5]. trained. In this paper, from 58 hours of accented speech
In accent-aware approaches, multi-domain training [6], accent data, selected from two speech corpora of read speech: L2-
embeddings [7], or accent information fusion [8] are among ARCTIC [18] and British Isles [19], we train unsupervised
the approaches which have been investigated. TTS and generate 250 hours more of synthetic accented speech
Text-to-speech synthesis (TTS) is an useful technology data which help to achieve better gains on the evaluation
which can be used to improve ASR in a number of ways, data of spontaneous conversational speech from the Edinburgh
for instance to improve the pre-training of self-supervised international accents of English corpus (EdAcc) [20].

ISBN: 978-9-4645-9361-7 136 EUSIPCO 2024

Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
Accented speech
training data
(audio & texts)
SSL pre-trained model SSL pre-trained model

Non-accented Supervised
Non-accented Supervised Synthetic
speech data Fine-tuning speech data Fine-tuning accented speech
(audio & texts)
(audio & texts) data (audio)

Train TTS Train TTS


Accented speech
Text prompts training data
(audio only) Fine-tuned ASR model
Fine-tuned ASR model

Text prompts
ASR ASR
Inference Inference
VITS (TTS) model VITS (TTS) model
Input test speech
Pseudo-labels Synthetic
accented speech TTS
TTS data (audio) Inference
Inference
ASR
External LM hypotheses
Semi-supervised
Fine-tuning
External LM
Fig. 2: Supervised accented speech training data are used
to train supervised TTS. These data may be included in the
ASR ASR
hypotheses Inference supervised fine-tuning for ASR, and the non-accented speech
Fine-tuned ASR model
Input test speech training data may be used to train a TTS system. The fine-
tuned ASR model is used in the ASR inference, with an
Fig. 1: Unsupervised accented speech training data and theirs external language model (LM), to decode test speech.
pseudo-labels are used to train unsupervised TTS. The pseudo-
labels are generated by decoding the unsupervised accented evaluation data. It is actually more viable to find accented
speech training data using the baseline ASR model obtained speech training data which are spoken by speakers whose
from the supervised fine-tuning of the SSL pre-trained model first languages are similar to those of the speakers in the
with the supervised non-accented speech data. The unsuper- evaluation data. Using these speech data to train TTS systems
vised accented speech training data may be included in the and generate more accented speech data for ASR training
semi-supervised fine-tuning for ASR, and the non-accented should create more accent variability, and hence, improve
speech data may be used to train a TTS system. accented speech recognition performance.

The paper is organized as follows. In section II, details A. Unsupervised scenario


of the data augmentation for accented speech recognition Fig. 1 shows unsupervised scenario where the manual tran-
based on unsupervised TTS are introduced. The training and scriptions of the accented speech training data AccD are not
inference of TTS systems are presented in section III. Section available. Hence, pseudo-labels for the unsupervised accented
IV introduces the data used in the experiments, experimental speech training data are generated by decoding these data
results, and discussion. Finally, section V concludes the paper. using the baseline ASR model obtained by fine-tuning the SSL
pre-trained model with the supervised non-accented speech
II. DATA AUGMENTATION FOR ACCENTED S PEECH data. The unsupervised accented speech training data AccD
R ECOGNITION BASED ON U NSUPERVISED TTS and their pseudo-labels are then used to train TTS model,
We use Wav2vec2.0 SSL framework [21] for our ex- which is a Variational Inference with adversarial learning for
periments with accented speech recognition. Assume that a end-to-end Text-to-Speech (VITS) model [23], to generate
Wav2vec2.0 model was pre-trained via SSL on large amount synthetic accented speech data for data augmentation. The
of unsupervised speech data to cover various English accents unsupervised accented speech training data and their pseudo-
and speakers, we can fine-tune this pre-trained model to labels may be included in the semi-supervised fine-tuning of
downstream ASR task using available non-accented speech the SSL pre-trained model. The ASR model obtained after the
data. The non-accented speech data could be any available semi-supervised fine-tuning is used in the ASR inference to
data which can be used to train ASR systems, for instance decode input test speech. The ASR inferences use an external
Librispeech training data [22]. When using publicly available language model (LM) when decoding audio data.
Wav2vec2.0 pre-trained models, we assume that only the A word-level 4-gram LM is used as external LM during
models are available and their training data are not available. ASR inferences. This 4-gram LM is trained on the manual
In addition to the non-accented speech data, we assume that transcriptions of Librispeech training data. The pre-training
a small amount of accented speech training data, named AccD, and fine-tuning of the Wav2vec2.0 models as well as the
is available. These accented speech training data will be used inference follow the same settings used for the LARGE
to train the TTS systems. In accented speech recognition, it Wav2vec2.0 models in [21]. These large models consist of
is not practical to find accented speech training data spoken 6 convolutional neural network (CNN) and 24 transformer
in the same speaking styles and by the same speakers in the layers, and have 350 millions parameters.

137

Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
Synthesized Speech
Natural Speech
Synthesized Speech Discriminators

Linear Decoder
Spectrogram Decoder
Real / Fake
𝜇∅
Posterior 𝑧
Encoder 𝜎∅
Flow Speaker ID
Flow 𝑓

Speaker ID
𝑧"

Length
Monotonic (2 2 1)
(2 2 1) Regulator
Alignment Search

Ceil
𝜇# 𝜎#
( 1.9 1.7 0.8 )
Projection Projection
Stochastic Stochastic
// Duration Duration
stop gradient Predictor Predictor
Text Encoder Text Encoder
Noise Noise

Phonemes Phonemes

Text Text prompt

Fig. 3: Training of VITS parallel end-to-end TTS system. The Fig. 4: TTS inference using VITS model where text prompt
input text can be either manual transcriptions or pseudo-labels. and speaker ID are used as input.

B. Supervised scenario subsequently generated by the decoder. All the VITS systems
For comparison, we also examine supervised scenario where in this paper use the same architecture and are trained with the
the manual transcriptions of the accented speech training data same number of training iterations, i.e. 300K. We observe that
AccD are available. In Fig. 2, the accented speech training data after 300K training iterations, the quality of the synthesized
AccD and their manual transcriptions can be directly used to waveforms saturates. Greater details of the VITS models and
train supervised TTS model. They may also be included in the their implementation can be found in [23].
training data which are used for the supervised fine-tuning of
the SSL pre-trained model. Once the TTS model is trained, IV. E XPERIMENTS
it can be used during the TTS inference to generate synthetic The Wav2vec2.0 model is pre-trained with more than 60K
accented speech data using independent text prompts. Both the hours of unsupervised speech data from Libri-Light, Common-
synthetic accented speech data and the text prompts can then Voice, Switchboard, and Fisher corpora. These speech training
be used for data augmentation in the supervised fine-tuning of data were spoken in various English accents. The non-accented
the SSL pre-trained model. speech data consist of 960 hours of training data from Lib-
rispeech corpus [22] which include 2200 speakers. Although
III. T EXT- TO - SPEECH S YNTHESIS
these speakers spoke US-English, we consider Librispeech as
VITS is an end-to-end multi-speaker TTS system which non-accented in the context of this study because US-English
can generate high-quality waveforms [23]. During the training is only one of the English accents in the evaluation data.
of VITS (see Fig. 3), a Posterior Encoder encodes linear Subsequent experimental results confirm our assumption.
spectrogram from natural speech into a latent variable z [24]
which is then used in a Decoder to restore waveform. HiFi- A. Data
GAN (Generative Adversarial Network) [25], a GAN-based 1) Accented speech training data (AccD): We combine data
neural vocoder [26], is used in the decoder to synthesize from the L2-ARCTIC corpus [18] and the British Isles corpus
high-fidelity speech. The latent variable z is also fed into the [19] as accented speech training data. These are corpora of
Flow f which computes the Kullback-Leibler divergence with read speech which were recorded in controlled environments.
the Text Encoder outputs. The Flow f is trained to remove The L2-ARCTIC corpus is a speech corpus of non-native
speaker information and reduce posterior complexity [27]. English which contains 26,867 utterances from 24 non-native
During training, speakers identities (IDs) are used to extract English speakers with equally distributed number of speakers
speaker embeddings for training multi-speaker TTS. per accent. The total duration of the corpus is 27.1 hours, with
During TTS inference (see Fig. 4), an inverse transform an average of 67.7 minutes of speech per speaker. On average,
f −1 of the Flow f is used to synthesize speech. The output each utterance is 3.6 seconds in duration. The utterances in L2-
of the Text Encoder is stretched by the Length Regulator ARCTIC are spoken in 6 non-native accents: Arabic, Chinese,
based on the predicted duration, and then the sampled latent Hindi, Korean, Spanish, and Vietnamese.
variable z ′ ∼ N (z ′ ; µθ (text), σθ (text)) is transformed by the The British Isles corpus includes speech utterances recorded
inverse Flow f −1 together with speaker information. Speech is by volunteers speaking with different accents of the British

138

Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Word error rates (WERs) on the development and test sets of the Edinburgh international accents of English corpus
(EdAcc), and on the test-clean and test-other sets of Librispeech (LS) corpus.
Test data
Fine-tuning data EdAcc dev-set EdAcc test-set LS test-clean LS test-other
LS 960h (M) (baseline in [20]) 33.4 36.1 2.9 5.6
LS 960h (M) (our baseline) 32.8 35.1 2.2 4.2
LS 960h (M) + AccD (P) 32.4 34.6 2.1 4.1
LS 960h (M) + AccD (M) 31.1 33.4 2.1 4.0
LS 960h (M) + TTS-LS 960h (M) 31.4 33.8 2.1 4.0
LS 960h (M) + TTS-AccD (P) 31.0 33.2 2.1 4.1
LS 960h (M) + TTS-AccD (M) 30.8 33.0 2.1 4.1
LS 960h (M) + AccD (P) + TTS-AccD (P) + TTS-LS 960h (M) 30.8 33.2 2.1 4.2
LS 960h (M) + AccD (M) + TTS-AccD (M) + TTS-LS 960h (M) 30.4 32.7 2.1 4.1

Isles, namely Ireland, Scotland, Wales, the Midlands, North- & test sets and the Librispeech (LS) test-clean & test-other sets
ern, and Southern of England. The corpus consists of 17,877 are shown. The ASR models in Table I are fine-tuned from
utterances spoken by 120 speakers of which 49 are female one Wav2vec2.0 pre-trained model, which was pre-trained on
and 71 are male. The total duration of the corpus is 31 hours. the unsupervised training data of Libri-Light, Common Voice,
When being decoded in the unsupervised scenario, the WERs Switchboard, and Fisher, using different fine-tuning data. The
of the pseudo-labels obtained on the L2-ARCTIC and British abbreviations used in Table I have the meaning as follows:
Isles training data are 10.7% and 10.2%, respectively.
• LS 960h (M): 960 hours of training speech from Lib-
2) Evaluation data: We use the development and test sets rispeech, manual (M) transcriptions are used as labels.
from the Edinburgh international accents of English corpus • AccD (P), AccD (M): 58 hours of accented speech
(EdAcc) [20], which consist of spontaneous conversational training data, using either pseudo-labels (P) or manual
speech, as evaluation data. The corpus includes a wide range (M) transcriptions as labels.
of first- and second-language varieties of English in the • TTS-LS 960h (M): 250 hours of synthetic non-accented
form of dyadic video call conversations between friends. The speech data generated by TTS system trained on LS 960h
conversations range in durations from 20 to 60 minutes. These (M) data. The speakers are from the LS 960h data.
conversations are segmented into shorter utterances based on • TTS-AccD (P), TTS-AccD (M): 250 hours of synthetic
manual annotations and are then separated into development accented speech data generated by TTS systems trained
and test sets which consist of 9079 and 8494 utterances, on either AccD (P) or AccD (M) data, with speakers from
respectively. In total, the development set contains 14 hours the AccD data.
and the test set contains 15 hours of speech. There are more
than 40 self-reported English accents from 51 different first We build a baseline model by fine-tuning the Wav2vec2.0
languages. The statistics and analyses show that EdAcc is pre-trained model with the LS 960h (M) data. The Wav2vec2.0
linguistically diverse and challenging for current English ASR pre-trained model and the fine-tuning data that we use are the
systems [20]. With more than 40 English accents, the EdAcc same as those used to train the baseline model in [20]. We will
corpus covers English accents from four continents, including compare the results with our baseline model which has lower
Africa, America, Asia, and Europe. The conversations were WERs, compared to those of the Wav2vec2.0 model reported
manually transcribed by professional transcribers to obtain in [20], on the development and test data of both EdAcc
manual transcriptions which are used in the evaluation. and Librispeech (see Table I). Combining the unsupervised
3) Synthetic speech data: Synthetic speech data are gener- accented speech training data AccD (P) with the non-accented
ated using the TTS systems and English text prompts. The speech data LS 960h (M) to fine-tune the pre-trained model
text prompts used in the TTS inference are selected from yields 1.2% and 1.4% relative WER reductions on EdAcc
the manual transcriptions of the training data in three speech dev and test sets, respectively, while the respective relative
corpora: LJSpeech [28], TED-LIUM [29], and VCTK [30]. WER reductions on these sets are 5.2% and 4.8% when the
The objective of selecting text prompts from independent TTS supervised accented speech training data AccD (M) are used.
and ASR corpora is to ensure that these prompts are not related When the synthetic non-accented speech data TTS-LS 960h
to the evaluation data and are phonetically balanced, since (M) which are generated by the supervised TTS system,
they were designed for TTS and ASR applications. In total, trained on the non-accented speech data LS 960h (M) with
there are 120K text prompts resulting in 250 hours of synthetic manual transcriptions, are included in the fine-tuning, 4.3%
speech data which are spoken by the speakers presented in the and 3.7% relative WER reductions are obtained on the
training data of the TTS systems. EdAcc dev and test sets, respectively. Since the synthetic
non-accented speech data TTS-LS 960h (M) are spoken by
B. Results & Discussion the same speakers in the LS 960h (M) data, the relative
Experimental results, in terms of WERs, are shown in Table WER reductions are made mainly thanks to more acoustic
I. In Table I, the WERs computed on the EdAcc development realizations, based on the independent text prompts, are added

139

Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
to the fine-tuning data from the synthetic non-accented speech [6] M. Lucas and Y. Estève, “Improving accented speech recognition with
data. Larger gains are obtained when the synthetic accented multi-domain training,” in Proc. 2023 IEEE ICASSP, pp. 1–5.
[7] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recognition
speech data TTS-AccD (P) and TTS-AccD (M) are used, even using accent embeddings and multi-task learning,” in Proc. INTER-
though the amount of data and the number of speakers in SPEECH 2018, pp. 2454–2458.
the AccD data used to train TTS systems are much smaller [8] X. Wang, Y. Long, Y. Li, and H. Wei, “Multi-pass training and
cross-information fusion for low-resource end-to-end accented speech
compared to those of the non-accented speech data LS 960h recognition,” in Proc. INTERSPEECH 2023, pp. 2923–2927.
(M): 58 hours compared to 960 hours, and 144 speakers [9] Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, G. Wang, and
compared to 2200 speakers. More specifically, the TTS-AccD P. Moreno, “Injecting text in self-supervised speech pretraining,” in
Proc. 2021 IEEE ASRU Workshop, pp. 251–258.
(P) data generated by unsupervised TTS help to achieve 5.5% [10] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to
and 5.4% relative WER reductions on the EdAcc dev and test improve the recognition of out-of-vocabulary words in end-to-end ASR
sets, respectively, while the TTS-AccD (M) data generated by systems,” in Proc. 2021 IEEE ICASSP, pp. 5674–5678.
[11] A. Rosenberg, Y. Zhang, B. Ramabhadran, Y. Jia, P. Moreno, Y. Wu,
supervised TTS help to achieve 6.1% and 6.0% relative WER and Z. Wu, “Speech recognition with augmented synthesized speech,”
reductions on the EdAcc dev and test sets, respectively. in Proc. 2019 IEEE ASRU, pp. 996–1002.
When the accented speech training data AccD and all the [12] S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Data augmentation
for ASR using TTS via discrete representation,” in Proc. 2021 IEEE
synthetic speech data are combined with the non-accented Automatic Speech Recognition and Understanding Workshop, pp. 68–75.
speech data to fine-tune the pre-trained model, further gains [13] G. Zhong, H. Song, R. Wang, L. Sun, D. Liu, J. Pan, X. Fang, J. Du,
are obtained. In the unsupervised scenarios, 6.1% and 5.4% J. Zhang, and L. Dai, “External text based data augmentation for low-
resource speech recognition in the constrained condition of OpenASR21
relative WER reductions are obtained on the EdAcc dev and challenge,” in Proc. INTERSPEECH 2022, pp. 4860–4864.
test sets, respectively, while the respective relative WER reduc- [14] E. Casanova, C. Shulby, A. Korolev, A.C. Junior, A.d.S. Soares,
tions obtained on these sets in the supervised scenario are 7.3% S. Aluı́sio, and M.A. Ponti, “ASR data augmentation in low-resource
settings using cross-lingual multi-speaker TTS and cross-lingual voice
and 6.8%, respectively. Actually, using natural accented speech conversion,” in Proc. INTERSPEECH 2023, pp. 1244–1248.
training data and synthetic accented speech data improves the [15] A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and
performance on EdAcc dev and test sets but does not harm or J. Droppo, “SynthASR: unlocking synthetic data for speech recognition,”
in Proc. INTERSPEECH 2021, pp. 896–900.
improve the ASR performance on Librispeech test sets. This [16] J. Ni, L. Wang, H. Gao, K. Qian, Y. Zhang, S. Chang, and M. Hasegawa-
confirms that considering Librispeech training data as non- Johnson, “Unsupervised text-to-speech synthesis by unsupervised auto-
accented speech data in our experiments is relevant. matic speech recognition,” in Proc. INTERSPEECH 2022, pp. 461–465.
[17] G. Karakasidis, N. Robinson, Y. Getman, A. Ogayo, R. Al-Ghezi,
V. C ONCLUSION A. Ayasi, S. Watanabe, Mortensen D. R., and M. Kurimo, “Multilingual
TTS accent impressions for accented ASR,” in Proc. 2023 International
Unsupervised TTS, trained on unsupervised accented speech Conference on Text, Speech, and Dialogue (TSD), pp. 317–327.
training data, was used to generate synthetic accented speech [18] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen,
J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: a non-native English
data for data augmentation in accented speech recognition. speech corpus,” in Proc. INTERSPEECH 2018, pp. 2783–2787.
Experiments showed that the Wav2vec2.0 models which used [19] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-source
the synthetic accented speech data yielded up to 6.1% relative multi-speaker corpora of the English accents in the British Isles,” in
Proc. 2020 Conference on Language Resources and Evaluation (LREC).
WER reductions compared to a large Wav2vec2.0 baseline. [20] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch, and
These gains are close to those obtained in the supervised P. Bell, “The Edinburgh international accents of English corpus: towards
scenario. The results demonstrate that unsupervised accented the democratization of English ASR,” in Proc. 2023 IEEE ICASSP.
[21] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: a
speech data, even when available in limited quantities and framework for self-supervised learning of speech representations,” in
are spoken in different styles by speakers who differ from Proc. 2020 Advances in Neural Information Processing Systems.
those in the evaluation data, can be effectively used to train [22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
ASR corpus based on public domain audio books,” in Proc. 2015 IEEE
TTS systems for data augmentation. This approach improves ICASSP, pp. 5206–5210.
accented speech recognition, particularly when the speakers in [23] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with
the unsupervised accented speech data and those in the eval- adversarial learning for end-to-end text-to-speech,” in 2021 International
Conference on Machine Learning.
uation data have some overlaps on speakers’ first languages. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
2014 International Conference on Learning Representations (ICLR).
R EFERENCES [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[1] D. Prabhu, P. Jyothi, S. Ganapathy, and V. Unni, “Accented speech S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
recognition with accent-specific codebooks,” in Proc. 2023 Conference Proc. 2014 Advances in Neural Information Processing Systems (NIPS).
on Empirical Methods in Natual Language Processing (EMNLP). [26] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks
[2] Y.-C. Chen, Z. Yang, C.-F. Yeh, M. Jain, and M. Seltzer, “Aipnet: for efficient and high fidelity speech synthesis,” Proc. 2020 Advances
generative adversarial pre-training of accent-invariant networks for end- in Neural Information Processing Systems.
to-end speech recognition,” in Proc. 2020 IEEE ICASSP, pp. 6979–6983. [27] D. Rezende and S. Mohamed, “Variational inference with normalizing
[3] N. Das, S. Bodapati, M. Sunkara, S. Srinivasan, and D. Horng Chau, flows,” in Proc. 2015 International Conference on Machine Learning.
“Best of both worlds: robust accented speech recognition with adversar- [28] K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/
ial transfer learning,” in Proc. INTERSPEECH 2021, pp. 1314–1318. LJ-Speech-Dataset, 2017.
[4] H. Hu et al., “REDAT: accent-invariant representation for end-to-end [29] A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic
ASR by domain adversarial training with relabeling,” in Proc. 2021 speech recognition dedicated corpus,” in Proc. 2012 Conference on
IEEE ICASSP, pp. 6408–6412. Language Resources and Evaluation (LREC), pp. 125–129.
[5] K. Deng, S. Cao, and L. Ma, “Improving accent identification and [30] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus:
accented speech recognition under a framework of self-supervised English multi-speaker corpus for CSTR voice cloning toolkit (version
learning,” in Proc. INTERSPEECH 2021, pp. 1504–1508. 0.92),” https://doi.org/10.7488/ds/2645, 2017.

140

Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.

You might also like