Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis
Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis
Abstract—This paper investigates the use of unsupervised text- learning (SSL) models [9] or to improve the recognition of
to-speech synthesis (TTS) as a data augmentation method to out-of-vocabulary words in end-to-end ASR [10]. TTS was
improve accented speech recognition. TTS systems are trained also used as a data augmentation method to improve speech
with a small amount of accented speech training data and their
pseudo-labels rather than manual transcriptions, and hence un- recognition for Librispeech task [11] and low-resource speech
supervised. This approach enables the use of accented speech data recognition [12], [13], [14]. More specifically, synthetic data
without manual transcriptions to perform data augmentation were used for data augmentation in the context of low-resource
for accented speech recognition. Synthetic accented speech data, ASR using conventional hybrid structure [13] and to augment
generated from text prompts by using the TTS systems, are the training of RNN-T (recurrent neural network - transducer)
then combined with available non-accented speech data to train
automatic speech recognition (ASR) systems. ASR experiments ASR model [15]. In [14], cross-lingual multi-speaker speech
are performed in a self-supervised learning framework using a synthesis and cross-lingual voice conversion were applied to
Wav2vec2.0 model which was pre-trained on large amount of data augmentation for ASR. The authors showed that it is
unsupervised accented speech data. The accented speech data possible to achieve promising results for ASR model training
for training the unsupervised TTS are read speech, selected with just a single speaker dataset in a target language, making
from L2-ARCTIC and British Isles corpora, while spontaneous
conversational speech from the Edinburgh international accents it viable for low-resource scenarios [14].
of English corpus are used as the evaluation data. Experimental While having been widely used in various ASR tasks, TTS,
results show that Wav2vec2.0 models which are fine-tuned to especially unsupervised TTS which is trained with unsuper-
downstream ASR task with synthetic accented speech data,
generated by the unsupervised TTS, yield up to 6.1% relative
vised audio data [16], has not been extensively studied as a
word error rate reductions compared to a Wav2vec2.0 baseline data augmentation method in accented speech recognition. In
which is fine-tuned with the non-accented speech data from a recent study on using TTS as data augmentation for accented
Librispeech corpus. speech recognition [17], accented speech were generated by
Index Terms—Accented speech recognition, text-to-speech syn- passing English text prompts through TTS system for a lan-
thesis, data augmentation, self-supervised learning, Wav2vec2.0
guage corresponding to the target accent. For example, English
I. I NTRODUCTION text prompts passing through Spanish TTS will approximate
Accented speech recognition is an important research topic Spanish-accented English. The study in [17] used commercial
of automatic speech recognition (ASR). Because of its im- TTS systems whose training data were not accessible by users.
portance, this research topic has been receiving attention In this paper, we investigate the use of unsupervised TTS
and being addressed with various research approaches. In as a data augmentation method to improve accented speech
general, these approaches can be classified as accent-agnostic recognition. In our approach, we make use of a small amount
approaches, in which the modeling of accents inside the ASR of accented speech data which do not have manual transcrip-
systems is not made specific, and accent-aware approaches tions to train TTS systems. This approach enables the use of
in which additional information about the accents of the accented speech data without manual transcriptions to perform
input speech are used [1]. Among accent-agnostic approaches, data augmentation for accented speech recognition. Indeed,
adversarial learning was used to establish accent classifier and from a small amount of unsupervised accented speech data
accent relabeling which led to performance improvement [2], used to train the TTS systems, we can generate larger amount
[3], [4]. In addition, similarity losses such as cosine losses or of synthetic accented speech data once the TTS systems are
contrastive losses were used to build accent neutral models [5]. trained. In this paper, from 58 hours of accented speech
In accent-aware approaches, multi-domain training [6], accent data, selected from two speech corpora of read speech: L2-
embeddings [7], or accent information fusion [8] are among ARCTIC [18] and British Isles [19], we train unsupervised
the approaches which have been investigated. TTS and generate 250 hours more of synthetic accented speech
Text-to-speech synthesis (TTS) is an useful technology data which help to achieve better gains on the evaluation
which can be used to improve ASR in a number of ways, data of spontaneous conversational speech from the Edinburgh
for instance to improve the pre-training of self-supervised international accents of English corpus (EdAcc) [20].
Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
Accented speech
training data
(audio & texts)
SSL pre-trained model SSL pre-trained model
Non-accented Supervised
Non-accented Supervised Synthetic
speech data Fine-tuning speech data Fine-tuning accented speech
(audio & texts)
(audio & texts) data (audio)
Text prompts
ASR ASR
Inference Inference
VITS (TTS) model VITS (TTS) model
Input test speech
Pseudo-labels Synthetic
accented speech TTS
TTS data (audio) Inference
Inference
ASR
External LM hypotheses
Semi-supervised
Fine-tuning
External LM
Fig. 2: Supervised accented speech training data are used
to train supervised TTS. These data may be included in the
ASR ASR
hypotheses Inference supervised fine-tuning for ASR, and the non-accented speech
Fine-tuned ASR model
Input test speech training data may be used to train a TTS system. The fine-
tuned ASR model is used in the ASR inference, with an
Fig. 1: Unsupervised accented speech training data and theirs external language model (LM), to decode test speech.
pseudo-labels are used to train unsupervised TTS. The pseudo-
labels are generated by decoding the unsupervised accented evaluation data. It is actually more viable to find accented
speech training data using the baseline ASR model obtained speech training data which are spoken by speakers whose
from the supervised fine-tuning of the SSL pre-trained model first languages are similar to those of the speakers in the
with the supervised non-accented speech data. The unsuper- evaluation data. Using these speech data to train TTS systems
vised accented speech training data may be included in the and generate more accented speech data for ASR training
semi-supervised fine-tuning for ASR, and the non-accented should create more accent variability, and hence, improve
speech data may be used to train a TTS system. accented speech recognition performance.
137
Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
Synthesized Speech
Natural Speech
Synthesized Speech Discriminators
Linear Decoder
Spectrogram Decoder
Real / Fake
𝜇∅
Posterior 𝑧
Encoder 𝜎∅
Flow Speaker ID
Flow 𝑓
Speaker ID
𝑧"
Length
Monotonic (2 2 1)
(2 2 1) Regulator
Alignment Search
Ceil
𝜇# 𝜎#
( 1.9 1.7 0.8 )
Projection Projection
Stochastic Stochastic
// Duration Duration
stop gradient Predictor Predictor
Text Encoder Text Encoder
Noise Noise
Phonemes Phonemes
Fig. 3: Training of VITS parallel end-to-end TTS system. The Fig. 4: TTS inference using VITS model where text prompt
input text can be either manual transcriptions or pseudo-labels. and speaker ID are used as input.
B. Supervised scenario subsequently generated by the decoder. All the VITS systems
For comparison, we also examine supervised scenario where in this paper use the same architecture and are trained with the
the manual transcriptions of the accented speech training data same number of training iterations, i.e. 300K. We observe that
AccD are available. In Fig. 2, the accented speech training data after 300K training iterations, the quality of the synthesized
AccD and their manual transcriptions can be directly used to waveforms saturates. Greater details of the VITS models and
train supervised TTS model. They may also be included in the their implementation can be found in [23].
training data which are used for the supervised fine-tuning of
the SSL pre-trained model. Once the TTS model is trained, IV. E XPERIMENTS
it can be used during the TTS inference to generate synthetic The Wav2vec2.0 model is pre-trained with more than 60K
accented speech data using independent text prompts. Both the hours of unsupervised speech data from Libri-Light, Common-
synthetic accented speech data and the text prompts can then Voice, Switchboard, and Fisher corpora. These speech training
be used for data augmentation in the supervised fine-tuning of data were spoken in various English accents. The non-accented
the SSL pre-trained model. speech data consist of 960 hours of training data from Lib-
rispeech corpus [22] which include 2200 speakers. Although
III. T EXT- TO - SPEECH S YNTHESIS
these speakers spoke US-English, we consider Librispeech as
VITS is an end-to-end multi-speaker TTS system which non-accented in the context of this study because US-English
can generate high-quality waveforms [23]. During the training is only one of the English accents in the evaluation data.
of VITS (see Fig. 3), a Posterior Encoder encodes linear Subsequent experimental results confirm our assumption.
spectrogram from natural speech into a latent variable z [24]
which is then used in a Decoder to restore waveform. HiFi- A. Data
GAN (Generative Adversarial Network) [25], a GAN-based 1) Accented speech training data (AccD): We combine data
neural vocoder [26], is used in the decoder to synthesize from the L2-ARCTIC corpus [18] and the British Isles corpus
high-fidelity speech. The latent variable z is also fed into the [19] as accented speech training data. These are corpora of
Flow f which computes the Kullback-Leibler divergence with read speech which were recorded in controlled environments.
the Text Encoder outputs. The Flow f is trained to remove The L2-ARCTIC corpus is a speech corpus of non-native
speaker information and reduce posterior complexity [27]. English which contains 26,867 utterances from 24 non-native
During training, speakers identities (IDs) are used to extract English speakers with equally distributed number of speakers
speaker embeddings for training multi-speaker TTS. per accent. The total duration of the corpus is 27.1 hours, with
During TTS inference (see Fig. 4), an inverse transform an average of 67.7 minutes of speech per speaker. On average,
f −1 of the Flow f is used to synthesize speech. The output each utterance is 3.6 seconds in duration. The utterances in L2-
of the Text Encoder is stretched by the Length Regulator ARCTIC are spoken in 6 non-native accents: Arabic, Chinese,
based on the predicted duration, and then the sampled latent Hindi, Korean, Spanish, and Vietnamese.
variable z ′ ∼ N (z ′ ; µθ (text), σθ (text)) is transformed by the The British Isles corpus includes speech utterances recorded
inverse Flow f −1 together with speaker information. Speech is by volunteers speaking with different accents of the British
138
Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Word error rates (WERs) on the development and test sets of the Edinburgh international accents of English corpus
(EdAcc), and on the test-clean and test-other sets of Librispeech (LS) corpus.
Test data
Fine-tuning data EdAcc dev-set EdAcc test-set LS test-clean LS test-other
LS 960h (M) (baseline in [20]) 33.4 36.1 2.9 5.6
LS 960h (M) (our baseline) 32.8 35.1 2.2 4.2
LS 960h (M) + AccD (P) 32.4 34.6 2.1 4.1
LS 960h (M) + AccD (M) 31.1 33.4 2.1 4.0
LS 960h (M) + TTS-LS 960h (M) 31.4 33.8 2.1 4.0
LS 960h (M) + TTS-AccD (P) 31.0 33.2 2.1 4.1
LS 960h (M) + TTS-AccD (M) 30.8 33.0 2.1 4.1
LS 960h (M) + AccD (P) + TTS-AccD (P) + TTS-LS 960h (M) 30.8 33.2 2.1 4.2
LS 960h (M) + AccD (M) + TTS-AccD (M) + TTS-LS 960h (M) 30.4 32.7 2.1 4.1
Isles, namely Ireland, Scotland, Wales, the Midlands, North- & test sets and the Librispeech (LS) test-clean & test-other sets
ern, and Southern of England. The corpus consists of 17,877 are shown. The ASR models in Table I are fine-tuned from
utterances spoken by 120 speakers of which 49 are female one Wav2vec2.0 pre-trained model, which was pre-trained on
and 71 are male. The total duration of the corpus is 31 hours. the unsupervised training data of Libri-Light, Common Voice,
When being decoded in the unsupervised scenario, the WERs Switchboard, and Fisher, using different fine-tuning data. The
of the pseudo-labels obtained on the L2-ARCTIC and British abbreviations used in Table I have the meaning as follows:
Isles training data are 10.7% and 10.2%, respectively.
• LS 960h (M): 960 hours of training speech from Lib-
2) Evaluation data: We use the development and test sets rispeech, manual (M) transcriptions are used as labels.
from the Edinburgh international accents of English corpus • AccD (P), AccD (M): 58 hours of accented speech
(EdAcc) [20], which consist of spontaneous conversational training data, using either pseudo-labels (P) or manual
speech, as evaluation data. The corpus includes a wide range (M) transcriptions as labels.
of first- and second-language varieties of English in the • TTS-LS 960h (M): 250 hours of synthetic non-accented
form of dyadic video call conversations between friends. The speech data generated by TTS system trained on LS 960h
conversations range in durations from 20 to 60 minutes. These (M) data. The speakers are from the LS 960h data.
conversations are segmented into shorter utterances based on • TTS-AccD (P), TTS-AccD (M): 250 hours of synthetic
manual annotations and are then separated into development accented speech data generated by TTS systems trained
and test sets which consist of 9079 and 8494 utterances, on either AccD (P) or AccD (M) data, with speakers from
respectively. In total, the development set contains 14 hours the AccD data.
and the test set contains 15 hours of speech. There are more
than 40 self-reported English accents from 51 different first We build a baseline model by fine-tuning the Wav2vec2.0
languages. The statistics and analyses show that EdAcc is pre-trained model with the LS 960h (M) data. The Wav2vec2.0
linguistically diverse and challenging for current English ASR pre-trained model and the fine-tuning data that we use are the
systems [20]. With more than 40 English accents, the EdAcc same as those used to train the baseline model in [20]. We will
corpus covers English accents from four continents, including compare the results with our baseline model which has lower
Africa, America, Asia, and Europe. The conversations were WERs, compared to those of the Wav2vec2.0 model reported
manually transcribed by professional transcribers to obtain in [20], on the development and test data of both EdAcc
manual transcriptions which are used in the evaluation. and Librispeech (see Table I). Combining the unsupervised
3) Synthetic speech data: Synthetic speech data are gener- accented speech training data AccD (P) with the non-accented
ated using the TTS systems and English text prompts. The speech data LS 960h (M) to fine-tune the pre-trained model
text prompts used in the TTS inference are selected from yields 1.2% and 1.4% relative WER reductions on EdAcc
the manual transcriptions of the training data in three speech dev and test sets, respectively, while the respective relative
corpora: LJSpeech [28], TED-LIUM [29], and VCTK [30]. WER reductions on these sets are 5.2% and 4.8% when the
The objective of selecting text prompts from independent TTS supervised accented speech training data AccD (M) are used.
and ASR corpora is to ensure that these prompts are not related When the synthetic non-accented speech data TTS-LS 960h
to the evaluation data and are phonetically balanced, since (M) which are generated by the supervised TTS system,
they were designed for TTS and ASR applications. In total, trained on the non-accented speech data LS 960h (M) with
there are 120K text prompts resulting in 250 hours of synthetic manual transcriptions, are included in the fine-tuning, 4.3%
speech data which are spoken by the speakers presented in the and 3.7% relative WER reductions are obtained on the
training data of the TTS systems. EdAcc dev and test sets, respectively. Since the synthetic
non-accented speech data TTS-LS 960h (M) are spoken by
B. Results & Discussion the same speakers in the LS 960h (M) data, the relative
Experimental results, in terms of WERs, are shown in Table WER reductions are made mainly thanks to more acoustic
I. In Table I, the WERs computed on the EdAcc development realizations, based on the independent text prompts, are added
139
Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.
to the fine-tuning data from the synthetic non-accented speech [6] M. Lucas and Y. Estève, “Improving accented speech recognition with
data. Larger gains are obtained when the synthetic accented multi-domain training,” in Proc. 2023 IEEE ICASSP, pp. 1–5.
[7] A. Jain, M. Upreti, and P. Jyothi, “Improved accented speech recognition
speech data TTS-AccD (P) and TTS-AccD (M) are used, even using accent embeddings and multi-task learning,” in Proc. INTER-
though the amount of data and the number of speakers in SPEECH 2018, pp. 2454–2458.
the AccD data used to train TTS systems are much smaller [8] X. Wang, Y. Long, Y. Li, and H. Wei, “Multi-pass training and
cross-information fusion for low-resource end-to-end accented speech
compared to those of the non-accented speech data LS 960h recognition,” in Proc. INTERSPEECH 2023, pp. 2923–2927.
(M): 58 hours compared to 960 hours, and 144 speakers [9] Z. Chen, Y. Zhang, A. Rosenberg, B. Ramabhadran, G. Wang, and
compared to 2200 speakers. More specifically, the TTS-AccD P. Moreno, “Injecting text in self-supervised speech pretraining,” in
Proc. 2021 IEEE ASRU Workshop, pp. 251–258.
(P) data generated by unsupervised TTS help to achieve 5.5% [10] X. Zheng, Y. Liu, D. Gunceler, and D. Willett, “Using synthetic audio to
and 5.4% relative WER reductions on the EdAcc dev and test improve the recognition of out-of-vocabulary words in end-to-end ASR
sets, respectively, while the TTS-AccD (M) data generated by systems,” in Proc. 2021 IEEE ICASSP, pp. 5674–5678.
[11] A. Rosenberg, Y. Zhang, B. Ramabhadran, Y. Jia, P. Moreno, Y. Wu,
supervised TTS help to achieve 6.1% and 6.0% relative WER and Z. Wu, “Speech recognition with augmented synthesized speech,”
reductions on the EdAcc dev and test sets, respectively. in Proc. 2019 IEEE ASRU, pp. 996–1002.
When the accented speech training data AccD and all the [12] S. Ueno, M. Mimura, S. Sakai, and T. Kawahara, “Data augmentation
for ASR using TTS via discrete representation,” in Proc. 2021 IEEE
synthetic speech data are combined with the non-accented Automatic Speech Recognition and Understanding Workshop, pp. 68–75.
speech data to fine-tune the pre-trained model, further gains [13] G. Zhong, H. Song, R. Wang, L. Sun, D. Liu, J. Pan, X. Fang, J. Du,
are obtained. In the unsupervised scenarios, 6.1% and 5.4% J. Zhang, and L. Dai, “External text based data augmentation for low-
resource speech recognition in the constrained condition of OpenASR21
relative WER reductions are obtained on the EdAcc dev and challenge,” in Proc. INTERSPEECH 2022, pp. 4860–4864.
test sets, respectively, while the respective relative WER reduc- [14] E. Casanova, C. Shulby, A. Korolev, A.C. Junior, A.d.S. Soares,
tions obtained on these sets in the supervised scenario are 7.3% S. Aluı́sio, and M.A. Ponti, “ASR data augmentation in low-resource
settings using cross-lingual multi-speaker TTS and cross-lingual voice
and 6.8%, respectively. Actually, using natural accented speech conversion,” in Proc. INTERSPEECH 2023, pp. 1244–1248.
training data and synthetic accented speech data improves the [15] A. Fazel, W. Yang, Y. Liu, R. Barra-Chicote, Y. Meng, R. Maas, and
performance on EdAcc dev and test sets but does not harm or J. Droppo, “SynthASR: unlocking synthetic data for speech recognition,”
in Proc. INTERSPEECH 2021, pp. 896–900.
improve the ASR performance on Librispeech test sets. This [16] J. Ni, L. Wang, H. Gao, K. Qian, Y. Zhang, S. Chang, and M. Hasegawa-
confirms that considering Librispeech training data as non- Johnson, “Unsupervised text-to-speech synthesis by unsupervised auto-
accented speech data in our experiments is relevant. matic speech recognition,” in Proc. INTERSPEECH 2022, pp. 461–465.
[17] G. Karakasidis, N. Robinson, Y. Getman, A. Ogayo, R. Al-Ghezi,
V. C ONCLUSION A. Ayasi, S. Watanabe, Mortensen D. R., and M. Kurimo, “Multilingual
TTS accent impressions for accented ASR,” in Proc. 2023 International
Unsupervised TTS, trained on unsupervised accented speech Conference on Text, Speech, and Dialogue (TSD), pp. 317–327.
training data, was used to generate synthetic accented speech [18] G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev-Hudilainen,
J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: a non-native English
data for data augmentation in accented speech recognition. speech corpus,” in Proc. INTERSPEECH 2018, pp. 2783–2787.
Experiments showed that the Wav2vec2.0 models which used [19] I. Demirsahin, O. Kjartansson, A. Gutkin, and C. Rivera, “Open-source
the synthetic accented speech data yielded up to 6.1% relative multi-speaker corpora of the English accents in the British Isles,” in
Proc. 2020 Conference on Language Resources and Evaluation (LREC).
WER reductions compared to a large Wav2vec2.0 baseline. [20] R. Sanabria, N. Bogoychev, N. Markl, A. Carmantini, O. Klejch, and
These gains are close to those obtained in the supervised P. Bell, “The Edinburgh international accents of English corpus: towards
scenario. The results demonstrate that unsupervised accented the democratization of English ASR,” in Proc. 2023 IEEE ICASSP.
[21] A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: a
speech data, even when available in limited quantities and framework for self-supervised learning of speech representations,” in
are spoken in different styles by speakers who differ from Proc. 2020 Advances in Neural Information Processing Systems.
those in the evaluation data, can be effectively used to train [22] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An
ASR corpus based on public domain audio books,” in Proc. 2015 IEEE
TTS systems for data augmentation. This approach improves ICASSP, pp. 5206–5210.
accented speech recognition, particularly when the speakers in [23] J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with
the unsupervised accented speech data and those in the eval- adversarial learning for end-to-end text-to-speech,” in 2021 International
Conference on Machine Learning.
uation data have some overlaps on speakers’ first languages. [24] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in
2014 International Conference on Learning Representations (ICLR).
R EFERENCES [25] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,
[1] D. Prabhu, P. Jyothi, S. Ganapathy, and V. Unni, “Accented speech S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
recognition with accent-specific codebooks,” in Proc. 2023 Conference Proc. 2014 Advances in Neural Information Processing Systems (NIPS).
on Empirical Methods in Natual Language Processing (EMNLP). [26] J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks
[2] Y.-C. Chen, Z. Yang, C.-F. Yeh, M. Jain, and M. Seltzer, “Aipnet: for efficient and high fidelity speech synthesis,” Proc. 2020 Advances
generative adversarial pre-training of accent-invariant networks for end- in Neural Information Processing Systems.
to-end speech recognition,” in Proc. 2020 IEEE ICASSP, pp. 6979–6983. [27] D. Rezende and S. Mohamed, “Variational inference with normalizing
[3] N. Das, S. Bodapati, M. Sunkara, S. Srinivasan, and D. Horng Chau, flows,” in Proc. 2015 International Conference on Machine Learning.
“Best of both worlds: robust accented speech recognition with adversar- [28] K. Ito and L. Johnson, “The LJ Speech dataset,” https://keithito.com/
ial transfer learning,” in Proc. INTERSPEECH 2021, pp. 1314–1318. LJ-Speech-Dataset, 2017.
[4] H. Hu et al., “REDAT: accent-invariant representation for end-to-end [29] A. Rousseau, P. Deléglise, and Y. Estève, “TED-LIUM: an automatic
ASR by domain adversarial training with relabeling,” in Proc. 2021 speech recognition dedicated corpus,” in Proc. 2012 Conference on
IEEE ICASSP, pp. 6408–6412. Language Resources and Evaluation (LREC), pp. 125–129.
[5] K. Deng, S. Cao, and L. Ma, “Improving accent identification and [30] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK corpus:
accented speech recognition under a framework of self-supervised English multi-speaker corpus for CSTR voice cloning toolkit (version
learning,” in Proc. INTERSPEECH 2021, pp. 1504–1508. 0.92),” https://doi.org/10.7488/ds/2645, 2017.
140
Authorized licensed use limited to: S G BALEKUNDRI INSTITUTE OF TECHNOLOGY - Belagavi. Downloaded on November 16,2024 at 05:17:40 UTC from IEEE Xplore. Restrictions apply.