TTS-Guided Training For Accent Conversion Without Parallel Data

TTS-Guided_Training_for_Accent_Conversion_Without_Parallel_Data

Uploaded by

rajalakshmikokila1981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views5 pages

TTS-Guided Training For Accent Conversion Without Parallel Data

TTS-Guided_Training_for_Accent_Conversion_Without_Parallel_Data

Uploaded by

rajalakshmikokila1981

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

IEEE SIGNAL PROCESSING LETTERS, VOL.

30, 2023 533

TTS-Guided Training for Accent Conversion

Without Parallel Data
Yi Zhou , Member, IEEE, Zhizheng Wu , Senior Member, IEEE, Mingyang Zhang ,
Xiaohai Tian , Member, IEEE, and Haizhou Li , Fellow, IEEE

Abstract—Accent Conversion (AC) seeks to change the accent Among the many aspects of proficiency in speaking a lan-
of speech from one (source) to another (target) while preserving guage, e.g., lexical, syntactic, semantic, phonological, pronun-
the speech content and speaker identity. However, many existing ciation is one of the most fundamental due to the neuro-
AC approaches rely on source-target parallel speech data during musculatory basis of speech production [1], [5], [6]. An ac-
training or reference speech at run-time. We propose a novel accent
conversion framework without the need for either parallel data
cent represents a distinctive way of pronouncing a language.
or reference speech. Specifically, a text-to-speech (TTS) system There have been many studies to find the mapping between
is first pretrained with target-accented speech data. This TTS accents. Voice morphing methods perform accent conversion by
model and its hidden representations are expected to be associated decomposing and modifying the spectral details of speech [7],
only with the target accent. Then, a speech encoder is trained [8], [9]. Frame-wise feature mapping methods achieve the same
to convert the accent of the speech under the supervision of the by pairing source and target vectors based on their linguistic
pretrained TTS model. In doing so, the source-accented speech similarity [2]. Articulatory synthesizer methods take a different
and its corresponding transcription are forwarded to the speech path by replacing the source pronunciation unit with the target
encoder and the pretrained TTS, respectively. The output of the
counterpart through unit selection [6], [10], [11]. There are
speech encoder is optimized to be the same as the text embedding
in the TTS system. At run-time, the speech encoder is combined few recent works attempting to build a pronunciation correc-
with the pretrained speech decoder to convert the source-accented tion model to transform the phonetic units from the source
speech toward the target. In the experiments, we converted En- accent to the target’s [3], [12]. However, these successful AC
glish with two source accents (Chinese/Indian) to the target accent approaches require parallel speech data between the source
(American/British/Canadian). Both objective metrics and subjec- and target accents during training, which limits the scope of
tive listening tests successfully validate that the proposed approach applications [13], [14].
generates speech samples that are close to the target accent with The use of pretrained models provides an effective solution
high speech quality.
to eliminating the reliance on parallel speech data [4], [13],
Index Terms—Accent conversion (AC), text-to-speech (TTS). [14]. Such AC frameworks generally consist of a pretrained
automatic speech recognition (ASR) model followed by a pre-
trained TTS model. Some use the ASR model to extract phonetic
I. INTRODUCTION posteriograms (PPGs) or bottleneck features to characterize the
CCENT Conversion (AC) is a technique that converts phonetic pronunciation. In [13], the speaker-specific AC model
A speech with a source accent to a target accent. It is ex-
pected to modify only the accent-related speech attributes, while
is trained with the source-accented speech to map the PPGs to
the acoustic features. To convert the accent of an utterance, it
preserving the speech content and the speaker identity. The AC requires that a target-accented speaker speaks the same content.
techniques enable many real-life applications, such as language There are also studies to synthesize a reference speech by a
education [1], [2], customer service, movie dubbing [3], and target-accented TTS model [14]. Such techniques work only
personalized text-to-speech (TTS) [4]. when text transcripts are available [14]. Others directly take the
output transcription of the ASR model to synthesize speech by a
target-accented TTS model [15]. Besides the trouble of building
Manuscript received 17 February 2023; revised 7 April 2023; accepted 8 April
2023. Date of publication 25 April 2023; date of current version 12 May 2023. an ASR, the recognition accuracy may be affected by accent
This work was supported in part by the National Natural Science Foundation variations [12], therefore, accent conversion quality.
of China under Grant 62271432 in part by the Human-Robot Collaborative AI In this paper, we propose a novel AC framework without the
for Advanced Manufacturing and Engineering under Grant A18A2b0046, and need for either parallel or reference speech data. In the proposed
in part by the Agency for Science, Technology and Research, Singapore. The
associate editor coordinating the review of this manuscript and approving it for framework, a TTS system is first trained with target-accented
publication was Dr. Daniele Giacobello. (Corresponding author: Zhizheng Wu.) speech data. Then, a speech encoder is trained to convert the
Yi Zhou is with the Department of Electrical and Computer En- accent of the speech under the supervision of the pretrained
gineering, National University of Singapore, Singapore 117583 (e-mail: TTS model. Specifically, a source-accented speech and its cor-
[email protected]).
Zhizheng Wu, Mingyang Zhang, and Haizhou Li are with the Shenzhen
responding transcription are presented to a speech encoder and
Research Institute of Big Data, School of Data Science, Chinese University TTS system. The speech encoder is trained to minimize the
of Hong Kong, Shenzhen 518172, China (e-mail: [email protected]; distance between its output and the hidden text embedding in the
[email protected]; [email protected]). TTS system. On the other hand, the TTS system is trained only
Xiaohai Tian is with the Bytedance Inc., Singapore 048583 (e-mail: xiao-
[email protected]).
with speech data of the target accent. Its hidden representations,
Digital Object Identifier 10.1109/LSP.2023.3270079 e.g., text embedding, are expected to carry the target accent. In

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
534 IEEE SIGNAL PROCESSING LETTERS, VOL. 30, 2023

Fig. 1. The system diagram of the proposed AC framework. The TTS system is first pretrained with target-accented speech data. Then the speech encoder is
trained with source-accented speech guided by the TTS system. The training objective is to minimize the combination of the contrastive loss LC between the
speech embedding and the text embedding, the text classification loss LT C with the ground-truth phoneme sequences, and the reconstruction loss LT T S between
the TTS predicted and converted acoustic features. Only the parameters in the speech encoder are updated.

this way, we force the speech encoder to generate target-accented into the target space. The speech encoder is an attention based
text embeddings. During conversion, the speech encoder and network that uses acoustic features as inputs and predicts a
the speech decoder work in tandem to generate target-accented speech embedding phoneme sequence. The attention in the
speech. encoder aligns the acoustic features and the phoneme sequences.
The speech embedding is a bottleneck feature that represents the
linguistic information before the phoneme prediction. Hence, the
II. PROPOSED ACCENT CONVERSION FRAMEWORK speech embedding has the equal length to the phoneme sequence
The proposed AC framework utilizes a speech encoder to as well as the text embedding, regardless of the speaking rate of
convert source-accented speech into the target pronunciation speakers.
with the guidance of a pretrained TTS system. Fig. 1 illustrates the proposed AC framework, where the
pretrained TTS is fixed. The phoneme transcriptions of the
source-accented speech are passed into the text encoder, and
A. Training Process the acoustic features are used as the speech encoder input. The
Among the various speech synthesis tasks, TTS systems, output text embeddings are taken by the speech encoder to
especially the end-to-end (E2E) models [16], [17], [18], have learn the target pronunciation from acoustic features. The model
gained great success thanks to the power of deep learning and parameters in the speech encoder are updated by the supervision
vast large-scale corpus contributed by the community [19]. of the corresponding phoneme sequence, text embedding, and
Recent advances in voice conversion (VC) have succeeded in the predicted acoustic features by the pretrained TTS. The total
non-parallel training by taking advantage of the latent repre- loss function is defined as follows:
sentation of a TTS system [20], [21], [22], [23]. The hidden
text embedding in a well-trained TTS model is a robust repre- LSE = wC LC + wT C LT C + LT T S , (1)
sentation underlying the speech content [22]. While these VC
systems effectively modify the speaker timbre, they are not able where LSE is the total loss of the speech encoder. LC and
to map the pronunciation to the target accent. This motivates LT C denote the contrastive loss and text classification loss,
us to consider that if the TTS system is trained with ONLY while wC and wT C are their respective weights. The contrastive
target-accented speech, the text embedding is expected to be loss is used to guide the speech embedding toward the text
free from other accents, which is highly desirable for AC. embedding [24]. LT T S is the reconstruction loss between the
Building on the success of prior studies in VC and TTS, TTS predicted and the converted acoustic features [17].
we propose a pronunciation mapping mechanism for AC. The By completing the speech encoder training, the speech em-
training of the accent conversion network is supervised by the bedding should reside in the same feature space as the text
pretrained TTS network. The proposed framework first trains embedding that represents the target accent information. The
an E2E TTS system [17] with target-accented speech data from source-accented speech is projected onto the target phoneme
multiple speakers. The TTS system consists of a text encoder representation. AC is thus achieved.
and a speech decoder, which can be found from Fig. 1. The
text encoder takes audio transcriptions as input and generates a
text embedding, while the speech decoder generates acoustic B. Conversion Process
features from the text representation conditioned on speaker During conversion, speech transcriptions are not available.
embeddings [16], [17], [18], [23]. It is assumed that the text First, the speech encoder takes acoustic features extracted from
embedding in this TTS system represents a feature space that the source-accented speech and encodes them into the speech
characterizes pronunciation of the target accent. embedding. Then, the speech decoder of the pretrained TTS
Subsequently, we train a speech encoder to find an effective generates acoustic features from the speech embedding. The
mapping to convert the pronunciation of source-accented speech converted speech waveform is obtained using a neural vocoder.
ZHOU et al.: TTS-GUIDED TRAINING FOR ACCENT CONVERSION WITHOUT PARALLEL DATA 535

TABLE I TABLE II
THE DESCRIPTION OF PARALLEL DATA REQUIREMENTS FOR DIFFERENT AC THE AVERAGE WER AND PER RESULTS. SOURCE AND TARGET DENOTE THE
SYSTEMS. THE TTS SYSTEM IS OMITTED FROM THIS TABLE SINCE IT IS NOT NATURAL SPEECH WITH THE SOURCE AND TARGET ACCENTS, RESPECTIVELY
AN ACCENT CONVERSION SYSTEM

III. EXPERIMENTS
In this work, we refer to American, British, and Canadian (c) BNF-PC-AC: This is an extension of system (b) with an
English as target accents, while Indian and Chinese English are additional pronunciation correction model. It first created
considered as source accents. We converted the English speech target-accented speech in the desired speaker’s voice using
of Indian and Chinese speakers to the target accents. parallel speech data. Then, a pronunciation correction model
was implemented to convert the acoustic features of source-
accented speech to the synthesized target-accented speech [3].
A. Experimental Systems and Setups The pronunciation correction model used the same network
In the experiment, we selected 4 speakers with the Indian configurations with the VC model in system (a). This system
or Chinese accents ASI (male), TNI (female), TXHC (male), was speaker dependent, so we trained 4 models.
and LXC (female) from the L2-ARCTIC corpus [25]. For each (d) TTS: This is a multi-speaker TTS system, which is the
speaker, 1031 utterances were used during training, and the pretrained TTS described in Section II-A. It used speech
remaining 100 utterances were used for conversion. data of American, British, and Canadian speakers from the
The Parallel WaveGAN neural vocoder [26] was trained VCTK corpus [27]. The selected training data included 63
with VCTK corpus [27]. All speech signals were resampled at speakers of approximately 20.5 hours. It adopted Tacotron2
16 kHz. The 80-dimensional Mel-spectrogram was extracted as network architecture [17] and the 256-dimensional speaker
the acoustic features with a window size of 50 ms and frameshift embedding [30] was added to the speech decoder [23]. This
of 12.5 ms. serves as the upper bound of our proposed TTS-AC system.
We implemented 5 different systems using different tech- (e) TTS-AC: This is the proposed AC system, as described in
niques for comparison. Table I gives an overview of the parallel Section II. The TTS pretraining was the same as system (d).
data requirements of all conversion systems. We note that our The speech encoder was speaker specific. The speech encoder
proposed TTS-AC does not require parallel data during either consisted of 2 pyramid BLSTM layers [31] of 256 units in
training or conversion process. each direction, 1 LSTM layer of 512 units with location-aware
attention, and a final softmax layer of 512 units. wC and wT C
(a) parallel-VC: This is a sequence to sequence voice conversion in the loss function 1 were set to 30 and 1, respectively. At
(VC) system taking parallel speech data between two speakers run-time, it adopted the beam search with a window size of 10.
with different accents. We selected speakers bdl (male) and
slt (female) as the target, who were with the American accent B. Objective Evaluations
from the CMU arctic database [28]. We had 4 source-to-target We report word error rate (WER), intra-accent and inter-
conversion speaker pairs: ASI-to-bdl, TNI-to-slt, TXHC-to- accent phoneme error rates (PER) as objective metrics.
bdl, and LXC-to-slt. It converted the 80-dimensional Mel- 1) Word Error Rate (WER): Speech was transcribed using
spectrogram from the source-accented to the target-accented the Google Cloud Speech-to-Text1 service. English (United
speech. The model followed the same implementation with States) was chosen, so the selected ASR was trained with a
the speech embedding to Mel-Spectrogram synthesizer in [3], large amount of American English speech data. The results are
and the only difference was the dimension of the input layer. summarized in Table II. A smaller WER value stands for a better
In this system, converted speech does not preserve the speaker performance. We note that the WER values of the source and
identity. target speech are 29.83% and 5.79%, respectively. The results
(b) BNF-AC: This is an AC system that takes PPG as input tell the fact that ASR performance deteriorates with a different
and generates acoustic features [13]. 4 speaker dependent accent. Then, we further compare the generated speech from
models were trained with the source-accented speech of the different systems. The WER results of system (b) and (c) are
selected speaker [13]. This system needed reference speech similar (28.47% and 28.93%), which are also comparable with
from speakers with the target accents for conversion. So, we the previous study [3]. System (a) using parallel data achieves
used the same speech data from speaker bdl and slt, and there a better results of 18.14%. The proposed TTS-AC system ob-
were 4 target-for-source speaker pairs: bdl-for-ASI, slt-for- tains a low WER value of 15.75%, which is similar to the
TNI, bdl-for-TXHC, and slt-for-LXC. The 256-dimensional TTS system. It indicates that the proposed TTS-guided training
bottleneck features [3] extracted by a pretrained acoustic greatly retains the capability of the pretrained TTS system in
model were used as PPG inputs. The bottleneck details can
be found in [29]. The output was the 80-dimensional Mel-
spectrogram. The model configurations were the same as the 1 [Online]. Available: https://cloud.google.com/speech-to-text, accessed on
speech synthesizer in [3]. 20/10/2022.
536 IEEE SIGNAL PROCESSING LETTERS, VOL. 30, 2023

Fig. 2. MUSHRA test results for speech quality. source and target refer to the
natural speech with the source and target accents, respectively, which are used Fig. 3. Accentedness test results with 95% confidence intervals. A lower value
as reference. means that the speech sounds more similar to the target accent, and vice-versa.
The source and target are natural speech used as references.

generating highly intelligible speech recognized by the ASR

engine.
2) Phoneme Error Rate (PER): Next, we report the intra-
accent and inter-accent phoneme error rate (PER). The intra-
accent PER is calculated between the converted speech and
the natural speech with the target accent, while the inter-accent
PER is the error between the converted speech and the natural
speech with the source accent. The phoneme was extracted by Fig. 4. Speaker similarity test results. A higher percentage of ‘Same (not
sure)’ and ‘Same (sure)’ together suggests a higher similarity to the desired
the pretrained recognition encoder in [24] using target-accented target speaker.
speech data. The values are summarized in Table II. We can find
that all inter-accent PER values are higher than the intra-accent,
which means that the phonemes within the same accent are more choices are presented in Fig. 3. The natural speech with source
similar than those between different accents. The target obtains and target accents receive ratings of 7.98 and 2.03, respectively.
the best 35.83% and 55.26% intra- and inter-accent PERs, the The accentedness rating of our proposed TTS-AC system is 2.87,
difference is 19.43%. Whereas the source’s is not so good, which far below the source-accented speech and very close to the target.
may due to the reason that the phoneme recognizer degrades Moreover, it is encouraging to see that our proposed method
with non-target accents. The proposed TTS-AC system achieves without any parallel speech data achieves a comparable result
a relatively low value for the intra-accent PER (40.37%), and to the parallel-VC system (2.79). Therefore, it can be concluded
the value for the inter-accent PER increases to 48.52%. The that our proposed AC technique with non-parallel training is
results confirm that the proposed system with TTS supervision very effective in converting the accent of speech samples from
effectively facilities the AC network to generate speech with an one to another.
accent that is close to the target in terms of recognized phonemes. 3) Speaker Similarity Test: The speaker similarity test [34]
compared the speaker identity between the generated speech and
C. Subjective Evaluations the natural speech of the desired speaker. In this test, listeners
had to give their decisions on a scale of four [34]. Fig. 4 depicts
We further conducted subjective listening tests using the the results, where more than 60% of the listeners thought that
Amazon Mechanical Turk platform,2 including the MUSHRA the converted speech of our TTS-AC system sounded the same
test [32], the accentedness test, and the speaker similarity test. 25 as that of the desired speaker. Although this is a remarkably
American listeners participated in the tests and were paid upon good score in terms of speaker similarity [35], we observe a
completion of all tests. In each test, each system provides 10 performance drop in both TTS and TTS-AC systems. Since the
utterances, which were randomly selected from 400 generated proposed TTS-AC performs almost the same as the TTS system,
speech samples. Listeners were asked to compare samples of the we suspect that the speaker similarity problem lies in the speaker
same content from different systems in random order.3 embedding, which is the only condition used to clone the voice
1) MUSHRA Test for Speech Quality: In this test, listeners in the TTS system. Similar findings are discussed in previous
were instructed to rate the speech quality of each sample between studies [3], [15].
0 and 100. A higher score corresponds to better speech quality.
Fig. 2 shows the MUSHRA score distributions. The proposed
TTS-AC system performs slightly worse than the TTS system IV. CONCLUSION
and similarly to the parallel-VC and the BNF-AC systems with In this paper, we present an accent conversion technique
an average MUSHRA score of over 70. Although the differ- without parallel data and reference speech, where the conversion
ence with the natural speech still exists, our proposed TTS-AC network training is guided by a pretrained text-to-speech (TTS)
achieves a fairly effective good performance of generating high- system. In the proposed framework, a TTS system is first trained
quality speech waveform. with the target-accented speech data, and then a speech encoder
2) Accentedness Test: This test assesses the speech accent- is trained with the source-accented speech. The speech encoder
edness. Listeners were instructed to rate the accentedness of an learns the target pronunciation under the supervision of the text
utterance on a nine-point Likert-scale (1: target accent; 9: source embedding obtained from the TTS system. Experimental re-
accent) [3], [33] given that any of the American, British, and sults successfully validate that our proposed method effectively
Canadian accents are regarded as target accents. The listeners’ converts the source to the target accent and generates high-
quality speech. In future, we will explore other pronunciation
2 [Online]. Available: https://www.mturk.com mapping techniques and try to address the generated speaker
3 Speech samples are at: https://vcsamples.github.io/SPL2022AC/ similarity.
ZHOU et al.: TTS-GUIDED TRAINING FOR ACCENT CONVERSION WITHOUT PARALLEL DATA 537

REFERENCES [20] M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training
framework for text-to-speech and voice conversion using multi-source
[1] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, “Foreign accent conversion tacotron and WaveNet,” in Proc. Interspeech, 2019, pp. 1298–1302.
in computer assisted pronunciation training,” Speech Commun., vol. 51, [21] S.-W. Park, D.-Y. Kim, and M.-C. Joe, “Cotatron: Transcription-guided
no. 10, pp. 920–932, 2009. speech encoder for any-to-many voice conversion without parallel data,”
[2] S. Aryal and R. Gutierrez-Osuna, “Can voice conversion be used to reduce in INTERSPEECH, pp. 4696–4700, 2020.
non-native accents?,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal [22] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, “Pretrain-
Process., 2014, pp. 7879–7883. ing techniques for sequence-to-sequence voice conversion,” IEEE/ACM
[3] G. Zhao, S. Ding, and R. Gutierrez-Osuna, “Converting foreign accent Trans. Audio, Speech, Lang. Process., vol. 29, pp. 745–755, 2021.
speech without a reference,” IEEE/ACM Trans. Audio, Speech, Lang. [23] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from
Process., vol. 29, pp. 2367–2381, 2021. speech synthesis to voice conversion with non-parallel training data,”
[4] S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Accentron: Foreign accent IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1290–1302,
conversion to arbitrary non-native speakers using zero-shot learning,” 2021.
Comput. Speech Lang., vol. 72, 2022, Art. no. 101302. [24] J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Non-parallel sequence-to-
[5] T. Scovel, A Time to Speak: A Psycholinguistic Inquiry Into the Critical sequence voice conversion with disentangled linguistic and speaker rep-
Period for Human Speech. Belmont, CA, USA: Wadsworth, 1988. resentations,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28,
[6] D. Felps, C. Geng, and R. Gutierrez-Osuna, “Foreign accent conversion pp. 540–552, 2020.
through concatenative synthesis in the articulatory domain,” IEEE Trans. [25] G. Zhao et al., “L2-arctic: A non-native english speech corpus.,” in Proc.
Audio, Speech, Lang. Process., vol. 20, no. 8, pp. 2301–2312, Oct. 2012. Interspeech, 2018, pp. 2783–2787.
[7] M. Huckvale and K. Yanagisawa, “Spoken language conversion with [26] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast wave-
accent morphing,” in Proc. 6th ISCA Speech Synth. Workshop, 2007, form generation model based on generative adversarial networks with
pp. 64–70. multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech,
[8] D. Felps and R. Gutierrez-Osuna, “Developing objective measures of Signal Process., 2020, pp. 6199–6203.
foreign-accent conversion,” IEEE Trans. Audio, Speech, Lang. Process., [27] C. Veaux et al., “CSTR VCTK corpus: English multi-speaker corpus for
vol. 18, no. 5, pp. 1030–1040, Jul. 2010. CSTR voice cloning toolkit,” The Centre for Speech Technol. Res., Univ.
[9] S. Aryal, D. Felps, and R. Gutierrez-Osuna, “Foreign accent conversion of Edinburgh, Edinburgh, U.K., 2017.
through voice morphing,” in Proc. Interspeech, 2013, pp. 3077–3081. [28] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in Proc.
[10] S. Aryal and R. Gutierrez-Osuna, “Reduction of non-native accents 5th ISCA workshop speech Synth., 2004, pp. 223–224.
through statistical parametric articulatory synthesis,” J. Acoust. Soc. Amer., [29] Y. Zhou, X. Tian, and H. Li, “Language agnostic speaker embedding for
vol. 137, no. 1, pp. 433–446, 2015. cross-lingual personalized speech generation,” IEEE/ACM Trans. Audio,
[11] S. Aryal and R. Gutierrez-Osuna, “Articulatory-based conversion of for- Speech, Lang. Process., vol. 29, pp. 3427–3439, 2021.
eign accents with deep neural networks,” in Proc. Interspeech, 2015. [30] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end
[12] W. Quamer, A. Das, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez- loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech,
Osuna, “Zero-shot foreign accent conversion without a native reference,” Signal Process., 2018, pp. 4879–4883.
in Proc. Interspeech 2022, 2022, pp. 4920–4924. [31] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell:
[13] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez- A neural network for large vocabulary conversational speech recogni-
Osuna, “Accent conversion using phonetic posteriorgrams,” in Proc. IEEE tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016,
Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5314–5318. pp. 4960–4964.
[14] W. Li et al., “Improving accent conversion with reference encoder and [32] Method for the Subjective Assessment of Intermediate Sound Quality, Rec.
end-to-end text-to-speech,” 2020, arXiv:2005.09271. ITU-R Recommendation BS.1534-1, International Telecommunications
[15] S. Liu et al., “End-to-end accent conversion without using native utter- Union, Geneva, Switzerland, 2003.
ances,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, [33] M. J. Munro and T. M. Derwing, “Foreign accent, comprehensibility, and
pp. 6289–6293. intelligibility in the speech of second language learners,” Lang. Learn.,
[16] W. Ping et al., “Deep voice 3: 2000-speaker neural text-to-speech,” in Proc. vol. 45, no. 1, pp. 73–97, 1995.
Int. Conf. Learn. Representations, 2018, pp. 214–217. [34] J. Lorenzo-Trueba et al., “The voice conversion challenge 2018: Promoting
[17] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on MEL development of parallel and nonparallel method,” in INTERSPEECH,
spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal pp. 195–202, 2018.
Process., 2018, pp. 4779–4783. [35] Y. Zhao et al., “Voice conversion challenge 2020: Intra-lingual semi-
[18] Y. Ren et al., “Fastspeech 2: Fast and high-quality end-to-end text to parallel and cross-lingual voice conversion,” in Proc. Joint Workshop
speech,” in Proc. Int. Conf. Learn. Representations, 2020. Blizzard Challenge Voice Convers. Challenge, 2020, pp. 80–98.
[19] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda,
“Voice transformer network: Sequence-to-sequence voice conversion us-
ing transformer with text-to-speech pretraining,” in INTERSPEECH,
pp. 4676–4680, 2020.

Table
No ratings yet
Table
1 page
Content Standards and Grade Level Expectations I English
No ratings yet
Content Standards and Grade Level Expectations I English
165 pages
Ngoc Chu
No ratings yet
Ngoc Chu
2 pages
MẪU KHẢO SÁT 11 LẦN 3. Thang 2.2024
No ratings yet
MẪU KHẢO SÁT 11 LẦN 3. Thang 2.2024
3 pages
Adding Emphasis Inversion
No ratings yet
Adding Emphasis Inversion
1 page
Ilovepdf Merged Merged
No ratings yet
Ilovepdf Merged Merged
39 pages
Text To Speech Indian Languages TTS
No ratings yet
Text To Speech Indian Languages TTS
9 pages
MT - 02 Ru 2024-25
No ratings yet
MT - 02 Ru 2024-25
4 pages
Networking (From Seniors)
No ratings yet
Networking (From Seniors)
2 pages
Lightweight Multi-Speaker Multi-Lingual Indic Text-to-Speech
No ratings yet
Lightweight Multi-Speaker Multi-Lingual Indic Text-to-Speech
9 pages
Yamamoto 等 - 2025 - Description-Based Controllable Text-To-Speech With Cross-Lingual Voice Control
No ratings yet
Yamamoto 等 - 2025 - Description-Based Controllable Text-To-Speech With Cross-Lingual Voice Control
5 pages
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
No ratings yet
V L: V F S P L: Oice OOP Oice Itting and Ynthesis Via A Honological OOP
14 pages
Grade 7 Big Summative 2.variant A
No ratings yet
Grade 7 Big Summative 2.variant A
2 pages
Meeting Agenda Sample
No ratings yet
Meeting Agenda Sample
4 pages
Learning 2 Task
No ratings yet
Learning 2 Task
7 pages
Analysis of Conversational Speech With Application To Voice Adaptation
No ratings yet
Analysis of Conversational Speech With Application To Voice Adaptation
8 pages
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
No ratings yet
Textless Unit-to-Unit Training For Many-to-Many Multilingual Speech-to-Speech Translation
13 pages
The Accented English Speech Recognition Challenge 2020 Open Datasets Tracks Baselines Results and Methods
No ratings yet
The Accented English Speech Recognition Challenge 2020 Open Datasets Tracks Baselines Results and Methods
5 pages
Rachel W
No ratings yet
Rachel W
4 pages
Baade24 Interspeech
No ratings yet
Baade24 Interspeech
5 pages
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
No ratings yet
Improving Automatic Speech Recognition For Non-Native English With Transfer Learning and Language Model Decoding
25 pages
2 Corintios
No ratings yet
2 Corintios
136 pages
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
No ratings yet
Phonetic Enhanced Language Modeling For Text-to-Speech Synthesis
5 pages
Zero-Shot Accent Conversion Using PSDN
No ratings yet
Zero-Shot Accent Conversion Using PSDN
5 pages
Final PPT
No ratings yet
Final PPT
13 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
No ratings yet
Learning To Speak Fluently in A Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
5 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Emotional Speech Synthesis Using End-to-End Neural TTS Models
No ratings yet
Emotional Speech Synthesis Using End-to-End Neural TTS Models
7 pages
Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision
No ratings yet
Convert and Speak: Zero-Shot Accent Conversion With Minimum Supervision
9 pages
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
No ratings yet
Indextts: An Industrial-Level Controllable and Efficient Zero-Shot Text-To-Speech System
5 pages
Controllable Multi-Speaker Emotional Speech Synthesis With An Emotion Representation of High Generalization Capability
No ratings yet
Controllable Multi-Speaker Emotional Speech Synthesis With An Emotion Representation of High Generalization Capability
15 pages
Voice Conversion by Separating Speaker
No ratings yet
Voice Conversion by Separating Speaker
6 pages
METTS Multilingual Emotional Text-To-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
No ratings yet
METTS Multilingual Emotional Text-To-Speech by Cross-Speaker and Cross-Lingual Emotion Transfer
13 pages
Cotatron: Transcription-Guided Speech Encoder For Any-to-Many Voice Conversion Without Parallel Data
No ratings yet
Cotatron: Transcription-Guided Speech Encoder For Any-to-Many Voice Conversion Without Parallel Data
5 pages
Lets Talk About Reading Teacher
No ratings yet
Lets Talk About Reading Teacher
3 pages
MGE: Assessment 1 Part 1 Comparative Analysis Marking Criteria
No ratings yet
MGE: Assessment 1 Part 1 Comparative Analysis Marking Criteria
1 page
Booklet 7 Graders
No ratings yet
Booklet 7 Graders
51 pages
Q2-Literary Devices
No ratings yet
Q2-Literary Devices
2 pages
0 1 Sociolinguistics A Language Study in Sociocultural Perspectives
No ratings yet
0 1 Sociolinguistics A Language Study in Sociocultural Perspectives
314 pages
Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis
No ratings yet
Improving Accented Speech Recognition Using Data Augmentation Based On Unsupervised Text-to-Speech Synthesis
5 pages
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
No ratings yet
Cross-Lingual Text-To-Speech Synthesis Via Domain Adaptation and Perceptual Similarity Regression in Speaker Space
5 pages
Suoni
No ratings yet
Suoni
38 pages
Lecture XXX
No ratings yet
Lecture XXX
5 pages
2023 Emnlp-Main 990
No ratings yet
2023 Emnlp-Main 990
13 pages
Lit Rev Dis 1
No ratings yet
Lit Rev Dis 1
24 pages
AEC 111 - English Comm.
No ratings yet
AEC 111 - English Comm.
265 pages
A Unit-Based System and Dataset For Expressive Direct Speech-to-Speech Translation
No ratings yet
A Unit-Based System and Dataset For Expressive Direct Speech-to-Speech Translation
5 pages
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
No ratings yet
Multilingual Text-To-Speech Training Using Cross Language Voice Conversion and Self-Supervised Learning of Speech Representations
5 pages
Descripciones en Ingles
No ratings yet
Descripciones en Ingles
11 pages
Intertext N3-4, 2018 PDF
No ratings yet
Intertext N3-4, 2018 PDF
286 pages
Australian English (Ause, Aue, Auseng, En-Au
No ratings yet
Australian English (Ause, Aue, Auseng, En-Au
13 pages
Voice Connect - S2ST Reserch Paper
No ratings yet
Voice Connect - S2ST Reserch Paper
4 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Interview Demo PDF
No ratings yet
Interview Demo PDF
11 pages
Thesis
No ratings yet
Thesis
37 pages
2309 14324v2
No ratings yet
2309 14324v2
8 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
ASRU2023
No ratings yet
ASRU2023
7 pages
An Analysis On English Idiomatic Expressions Translated Into Indopnesian in A Stranger in The Mirror By: Hafidah Kurniawati K.2202512
No ratings yet
An Analysis On English Idiomatic Expressions Translated Into Indopnesian in A Stranger in The Mirror By: Hafidah Kurniawati K.2202512
115 pages
5 Basic Tense
No ratings yet
5 Basic Tense
13 pages
1707 06519
No ratings yet
1707 06519
8 pages
Chen V2C Visual Voice Cloning CVPR 2022 Paper
No ratings yet
Chen V2C Visual Voice Cloning CVPR 2022 Paper
10 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
Foreign Accent Conversion Through Voice Morphing
No ratings yet
Foreign Accent Conversion Through Voice Morphing
6 pages
Semi-Supervised Training For Improving Data Efficiency
No ratings yet
Semi-Supervised Training For Improving Data Efficiency
5 pages
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
No ratings yet
The ART of Conversation: Measuring Phonetic Convergence and Deliberate Imitation in L2-Speech With A Siamese RNN
5 pages
Direct Punjabi To English Speech Translation Using Discrete Units
No ratings yet
Direct Punjabi To English Speech Translation Using Discrete Units
13 pages
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
No ratings yet
Voice Filter Few Shot Text To Speech Speaker Adaptation Using Voice Conversion As A Post Processing Module
5 pages
Skill Acquisition and Second Language Learning
No ratings yet
Skill Acquisition and Second Language Learning
17 pages
Lesson 1 Wednesday December 5th 2012
No ratings yet
Lesson 1 Wednesday December 5th 2012
2 pages
Acoustic Word Embeddings MDPI
No ratings yet
Acoustic Word Embeddings MDPI
9 pages
Radmmm
No ratings yet
Radmmm
5 pages
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
No ratings yet
Audio Word2vec: Sequence-To-Sequence Autoencoding For Unsupervised Learning of Audio Segmentation and Representation
13 pages
Parallel Tacotron
No ratings yet
Parallel Tacotron
5 pages
The Greek Article
No ratings yet
The Greek Article
1 page
The Role of Writing in Classroom Second Language Acquisition by Harklau
No ratings yet
The Role of Writing in Classroom Second Language Acquisition by Harklau
22 pages
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
No ratings yet
Latent Linguistic Embedding For Cross-Lingual Text-To-Speech and Voice Conversion
5 pages
Speech Synthesis
No ratings yet
Speech Synthesis
4 pages
Soal
No ratings yet
Soal
6 pages
Writing Lesson Plan
No ratings yet
Writing Lesson Plan
3 pages
LET Reviewer Preboard 2019 English (LOCKED)
No ratings yet
LET Reviewer Preboard 2019 English (LOCKED)
10 pages
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
No ratings yet
Conditional Variational Autoencoder With Adversarial Learning For End-to-End Text-to-Speech
15 pages
styleTTS2205 15439
No ratings yet
styleTTS2205 15439
20 pages
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
No ratings yet
StarGANv2-VC A Diverse, Unsupervised, Non-Parallel Framework For
5 pages
Mohammadi2017 SPECOM Overview PDF
No ratings yet
Mohammadi2017 SPECOM Overview PDF
18 pages
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
No ratings yet
Direct Speech-To-Speech Translation With A Sequence-To-Sequence Model
5 pages
Gujarati
No ratings yet
Gujarati
29 pages
Gruhn Et Al - Statistical Pronunciation Modeling For Non-Native Speech Processing
No ratings yet
Gruhn Et Al - Statistical Pronunciation Modeling For Non-Native Speech Processing
118 pages

TTS-Guided Training For Accent Conversion Without Parallel Data

Uploaded by

TTS-Guided Training For Accent Conversion Without Parallel Data

Uploaded by

IEEE SIGNAL PROCESSING LETTERS, VOL.

30, 2023 533

TTS-Guided Training for Accent Conversion

generating highly intelligible speech recognized by the ASR

You might also like