TTS-Guided Training For Accent Conversion Without Parallel Data
TTS-Guided Training For Accent Conversion Without Parallel Data
Abstract—Accent Conversion (AC) seeks to change the accent Among the many aspects of proficiency in speaking a lan-
of speech from one (source) to another (target) while preserving guage, e.g., lexical, syntactic, semantic, phonological, pronun-
the speech content and speaker identity. However, many existing ciation is one of the most fundamental due to the neuro-
AC approaches rely on source-target parallel speech data during musculatory basis of speech production [1], [5], [6]. An ac-
training or reference speech at run-time. We propose a novel accent
conversion framework without the need for either parallel data
cent represents a distinctive way of pronouncing a language.
or reference speech. Specifically, a text-to-speech (TTS) system There have been many studies to find the mapping between
is first pretrained with target-accented speech data. This TTS accents. Voice morphing methods perform accent conversion by
model and its hidden representations are expected to be associated decomposing and modifying the spectral details of speech [7],
only with the target accent. Then, a speech encoder is trained [8], [9]. Frame-wise feature mapping methods achieve the same
to convert the accent of the speech under the supervision of the by pairing source and target vectors based on their linguistic
pretrained TTS model. In doing so, the source-accented speech similarity [2]. Articulatory synthesizer methods take a different
and its corresponding transcription are forwarded to the speech path by replacing the source pronunciation unit with the target
encoder and the pretrained TTS, respectively. The output of the
counterpart through unit selection [6], [10], [11]. There are
speech encoder is optimized to be the same as the text embedding
in the TTS system. At run-time, the speech encoder is combined few recent works attempting to build a pronunciation correc-
with the pretrained speech decoder to convert the source-accented tion model to transform the phonetic units from the source
speech toward the target. In the experiments, we converted En- accent to the target’s [3], [12]. However, these successful AC
glish with two source accents (Chinese/Indian) to the target accent approaches require parallel speech data between the source
(American/British/Canadian). Both objective metrics and subjec- and target accents during training, which limits the scope of
tive listening tests successfully validate that the proposed approach applications [13], [14].
generates speech samples that are close to the target accent with The use of pretrained models provides an effective solution
high speech quality.
to eliminating the reliance on parallel speech data [4], [13],
Index Terms—Accent conversion (AC), text-to-speech (TTS). [14]. Such AC frameworks generally consist of a pretrained
automatic speech recognition (ASR) model followed by a pre-
trained TTS model. Some use the ASR model to extract phonetic
I. INTRODUCTION posteriograms (PPGs) or bottleneck features to characterize the
CCENT Conversion (AC) is a technique that converts phonetic pronunciation. In [13], the speaker-specific AC model
A speech with a source accent to a target accent. It is ex-
pected to modify only the accent-related speech attributes, while
is trained with the source-accented speech to map the PPGs to
the acoustic features. To convert the accent of an utterance, it
preserving the speech content and the speaker identity. The AC requires that a target-accented speaker speaks the same content.
techniques enable many real-life applications, such as language There are also studies to synthesize a reference speech by a
education [1], [2], customer service, movie dubbing [3], and target-accented TTS model [14]. Such techniques work only
personalized text-to-speech (TTS) [4]. when text transcripts are available [14]. Others directly take the
output transcription of the ASR model to synthesize speech by a
target-accented TTS model [15]. Besides the trouble of building
Manuscript received 17 February 2023; revised 7 April 2023; accepted 8 April
2023. Date of publication 25 April 2023; date of current version 12 May 2023. an ASR, the recognition accuracy may be affected by accent
This work was supported in part by the National Natural Science Foundation variations [12], therefore, accent conversion quality.
of China under Grant 62271432 in part by the Human-Robot Collaborative AI In this paper, we propose a novel AC framework without the
for Advanced Manufacturing and Engineering under Grant A18A2b0046, and need for either parallel or reference speech data. In the proposed
in part by the Agency for Science, Technology and Research, Singapore. The
associate editor coordinating the review of this manuscript and approving it for framework, a TTS system is first trained with target-accented
publication was Dr. Daniele Giacobello. (Corresponding author: Zhizheng Wu.) speech data. Then, a speech encoder is trained to convert the
Yi Zhou is with the Department of Electrical and Computer En- accent of the speech under the supervision of the pretrained
gineering, National University of Singapore, Singapore 117583 (e-mail: TTS model. Specifically, a source-accented speech and its cor-
[email protected]).
Zhizheng Wu, Mingyang Zhang, and Haizhou Li are with the Shenzhen
responding transcription are presented to a speech encoder and
Research Institute of Big Data, School of Data Science, Chinese University TTS system. The speech encoder is trained to minimize the
of Hong Kong, Shenzhen 518172, China (e-mail: [email protected]; distance between its output and the hidden text embedding in the
[email protected]; [email protected]). TTS system. On the other hand, the TTS system is trained only
Xiaohai Tian is with the Bytedance Inc., Singapore 048583 (e-mail: xiao-
[email protected]).
with speech data of the target accent. Its hidden representations,
Digital Object Identifier 10.1109/LSP.2023.3270079 e.g., text embedding, are expected to carry the target accent. In
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
534 IEEE SIGNAL PROCESSING LETTERS, VOL. 30, 2023
Fig. 1. The system diagram of the proposed AC framework. The TTS system is first pretrained with target-accented speech data. Then the speech encoder is
trained with source-accented speech guided by the TTS system. The training objective is to minimize the combination of the contrastive loss LC between the
speech embedding and the text embedding, the text classification loss LT C with the ground-truth phoneme sequences, and the reconstruction loss LT T S between
the TTS predicted and converted acoustic features. Only the parameters in the speech encoder are updated.
this way, we force the speech encoder to generate target-accented into the target space. The speech encoder is an attention based
text embeddings. During conversion, the speech encoder and network that uses acoustic features as inputs and predicts a
the speech decoder work in tandem to generate target-accented speech embedding phoneme sequence. The attention in the
speech. encoder aligns the acoustic features and the phoneme sequences.
The speech embedding is a bottleneck feature that represents the
linguistic information before the phoneme prediction. Hence, the
II. PROPOSED ACCENT CONVERSION FRAMEWORK speech embedding has the equal length to the phoneme sequence
The proposed AC framework utilizes a speech encoder to as well as the text embedding, regardless of the speaking rate of
convert source-accented speech into the target pronunciation speakers.
with the guidance of a pretrained TTS system. Fig. 1 illustrates the proposed AC framework, where the
pretrained TTS is fixed. The phoneme transcriptions of the
source-accented speech are passed into the text encoder, and
A. Training Process the acoustic features are used as the speech encoder input. The
Among the various speech synthesis tasks, TTS systems, output text embeddings are taken by the speech encoder to
especially the end-to-end (E2E) models [16], [17], [18], have learn the target pronunciation from acoustic features. The model
gained great success thanks to the power of deep learning and parameters in the speech encoder are updated by the supervision
vast large-scale corpus contributed by the community [19]. of the corresponding phoneme sequence, text embedding, and
Recent advances in voice conversion (VC) have succeeded in the predicted acoustic features by the pretrained TTS. The total
non-parallel training by taking advantage of the latent repre- loss function is defined as follows:
sentation of a TTS system [20], [21], [22], [23]. The hidden
text embedding in a well-trained TTS model is a robust repre- LSE = wC LC + wT C LT C + LT T S , (1)
sentation underlying the speech content [22]. While these VC
systems effectively modify the speaker timbre, they are not able where LSE is the total loss of the speech encoder. LC and
to map the pronunciation to the target accent. This motivates LT C denote the contrastive loss and text classification loss,
us to consider that if the TTS system is trained with ONLY while wC and wT C are their respective weights. The contrastive
target-accented speech, the text embedding is expected to be loss is used to guide the speech embedding toward the text
free from other accents, which is highly desirable for AC. embedding [24]. LT T S is the reconstruction loss between the
Building on the success of prior studies in VC and TTS, TTS predicted and the converted acoustic features [17].
we propose a pronunciation mapping mechanism for AC. The By completing the speech encoder training, the speech em-
training of the accent conversion network is supervised by the bedding should reside in the same feature space as the text
pretrained TTS network. The proposed framework first trains embedding that represents the target accent information. The
an E2E TTS system [17] with target-accented speech data from source-accented speech is projected onto the target phoneme
multiple speakers. The TTS system consists of a text encoder representation. AC is thus achieved.
and a speech decoder, which can be found from Fig. 1. The
text encoder takes audio transcriptions as input and generates a
text embedding, while the speech decoder generates acoustic B. Conversion Process
features from the text representation conditioned on speaker During conversion, speech transcriptions are not available.
embeddings [16], [17], [18], [23]. It is assumed that the text First, the speech encoder takes acoustic features extracted from
embedding in this TTS system represents a feature space that the source-accented speech and encodes them into the speech
characterizes pronunciation of the target accent. embedding. Then, the speech decoder of the pretrained TTS
Subsequently, we train a speech encoder to find an effective generates acoustic features from the speech embedding. The
mapping to convert the pronunciation of source-accented speech converted speech waveform is obtained using a neural vocoder.
ZHOU et al.: TTS-GUIDED TRAINING FOR ACCENT CONVERSION WITHOUT PARALLEL DATA 535
TABLE I TABLE II
THE DESCRIPTION OF PARALLEL DATA REQUIREMENTS FOR DIFFERENT AC THE AVERAGE WER AND PER RESULTS. SOURCE AND TARGET DENOTE THE
SYSTEMS. THE TTS SYSTEM IS OMITTED FROM THIS TABLE SINCE IT IS NOT NATURAL SPEECH WITH THE SOURCE AND TARGET ACCENTS, RESPECTIVELY
AN ACCENT CONVERSION SYSTEM
III. EXPERIMENTS
In this work, we refer to American, British, and Canadian (c) BNF-PC-AC: This is an extension of system (b) with an
English as target accents, while Indian and Chinese English are additional pronunciation correction model. It first created
considered as source accents. We converted the English speech target-accented speech in the desired speaker’s voice using
of Indian and Chinese speakers to the target accents. parallel speech data. Then, a pronunciation correction model
was implemented to convert the acoustic features of source-
accented speech to the synthesized target-accented speech [3].
A. Experimental Systems and Setups The pronunciation correction model used the same network
In the experiment, we selected 4 speakers with the Indian configurations with the VC model in system (a). This system
or Chinese accents ASI (male), TNI (female), TXHC (male), was speaker dependent, so we trained 4 models.
and LXC (female) from the L2-ARCTIC corpus [25]. For each (d) TTS: This is a multi-speaker TTS system, which is the
speaker, 1031 utterances were used during training, and the pretrained TTS described in Section II-A. It used speech
remaining 100 utterances were used for conversion. data of American, British, and Canadian speakers from the
The Parallel WaveGAN neural vocoder [26] was trained VCTK corpus [27]. The selected training data included 63
with VCTK corpus [27]. All speech signals were resampled at speakers of approximately 20.5 hours. It adopted Tacotron2
16 kHz. The 80-dimensional Mel-spectrogram was extracted as network architecture [17] and the 256-dimensional speaker
the acoustic features with a window size of 50 ms and frameshift embedding [30] was added to the speech decoder [23]. This
of 12.5 ms. serves as the upper bound of our proposed TTS-AC system.
We implemented 5 different systems using different tech- (e) TTS-AC: This is the proposed AC system, as described in
niques for comparison. Table I gives an overview of the parallel Section II. The TTS pretraining was the same as system (d).
data requirements of all conversion systems. We note that our The speech encoder was speaker specific. The speech encoder
proposed TTS-AC does not require parallel data during either consisted of 2 pyramid BLSTM layers [31] of 256 units in
training or conversion process. each direction, 1 LSTM layer of 512 units with location-aware
attention, and a final softmax layer of 512 units. wC and wT C
(a) parallel-VC: This is a sequence to sequence voice conversion in the loss function 1 were set to 30 and 1, respectively. At
(VC) system taking parallel speech data between two speakers run-time, it adopted the beam search with a window size of 10.
with different accents. We selected speakers bdl (male) and
slt (female) as the target, who were with the American accent B. Objective Evaluations
from the CMU arctic database [28]. We had 4 source-to-target We report word error rate (WER), intra-accent and inter-
conversion speaker pairs: ASI-to-bdl, TNI-to-slt, TXHC-to- accent phoneme error rates (PER) as objective metrics.
bdl, and LXC-to-slt. It converted the 80-dimensional Mel- 1) Word Error Rate (WER): Speech was transcribed using
spectrogram from the source-accented to the target-accented the Google Cloud Speech-to-Text1 service. English (United
speech. The model followed the same implementation with States) was chosen, so the selected ASR was trained with a
the speech embedding to Mel-Spectrogram synthesizer in [3], large amount of American English speech data. The results are
and the only difference was the dimension of the input layer. summarized in Table II. A smaller WER value stands for a better
In this system, converted speech does not preserve the speaker performance. We note that the WER values of the source and
identity. target speech are 29.83% and 5.79%, respectively. The results
(b) BNF-AC: This is an AC system that takes PPG as input tell the fact that ASR performance deteriorates with a different
and generates acoustic features [13]. 4 speaker dependent accent. Then, we further compare the generated speech from
models were trained with the source-accented speech of the different systems. The WER results of system (b) and (c) are
selected speaker [13]. This system needed reference speech similar (28.47% and 28.93%), which are also comparable with
from speakers with the target accents for conversion. So, we the previous study [3]. System (a) using parallel data achieves
used the same speech data from speaker bdl and slt, and there a better results of 18.14%. The proposed TTS-AC system ob-
were 4 target-for-source speaker pairs: bdl-for-ASI, slt-for- tains a low WER value of 15.75%, which is similar to the
TNI, bdl-for-TXHC, and slt-for-LXC. The 256-dimensional TTS system. It indicates that the proposed TTS-guided training
bottleneck features [3] extracted by a pretrained acoustic greatly retains the capability of the pretrained TTS system in
model were used as PPG inputs. The bottleneck details can
be found in [29]. The output was the 80-dimensional Mel-
spectrogram. The model configurations were the same as the 1 [Online]. Available: https://cloud.google.com/speech-to-text, accessed on
speech synthesizer in [3]. 20/10/2022.
536 IEEE SIGNAL PROCESSING LETTERS, VOL. 30, 2023
Fig. 2. MUSHRA test results for speech quality. source and target refer to the
natural speech with the source and target accents, respectively, which are used Fig. 3. Accentedness test results with 95% confidence intervals. A lower value
as reference. means that the speech sounds more similar to the target accent, and vice-versa.
The source and target are natural speech used as references.
REFERENCES [20] M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training
framework for text-to-speech and voice conversion using multi-source
[1] D. Felps, H. Bortfeld, and R. Gutierrez-Osuna, “Foreign accent conversion tacotron and WaveNet,” in Proc. Interspeech, 2019, pp. 1298–1302.
in computer assisted pronunciation training,” Speech Commun., vol. 51, [21] S.-W. Park, D.-Y. Kim, and M.-C. Joe, “Cotatron: Transcription-guided
no. 10, pp. 920–932, 2009. speech encoder for any-to-many voice conversion without parallel data,”
[2] S. Aryal and R. Gutierrez-Osuna, “Can voice conversion be used to reduce in INTERSPEECH, pp. 4696–4700, 2020.
non-native accents?,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal [22] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda, “Pretrain-
Process., 2014, pp. 7879–7883. ing techniques for sequence-to-sequence voice conversion,” IEEE/ACM
[3] G. Zhao, S. Ding, and R. Gutierrez-Osuna, “Converting foreign accent Trans. Audio, Speech, Lang. Process., vol. 29, pp. 745–755, 2021.
speech without a reference,” IEEE/ACM Trans. Audio, Speech, Lang. [23] M. Zhang, Y. Zhou, L. Zhao, and H. Li, “Transfer learning from
Process., vol. 29, pp. 2367–2381, 2021. speech synthesis to voice conversion with non-parallel training data,”
[4] S. Ding, G. Zhao, and R. Gutierrez-Osuna, “Accentron: Foreign accent IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1290–1302,
conversion to arbitrary non-native speakers using zero-shot learning,” 2021.
Comput. Speech Lang., vol. 72, 2022, Art. no. 101302. [24] J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Non-parallel sequence-to-
[5] T. Scovel, A Time to Speak: A Psycholinguistic Inquiry Into the Critical sequence voice conversion with disentangled linguistic and speaker rep-
Period for Human Speech. Belmont, CA, USA: Wadsworth, 1988. resentations,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28,
[6] D. Felps, C. Geng, and R. Gutierrez-Osuna, “Foreign accent conversion pp. 540–552, 2020.
through concatenative synthesis in the articulatory domain,” IEEE Trans. [25] G. Zhao et al., “L2-arctic: A non-native english speech corpus.,” in Proc.
Audio, Speech, Lang. Process., vol. 20, no. 8, pp. 2301–2312, Oct. 2012. Interspeech, 2018, pp. 2783–2787.
[7] M. Huckvale and K. Yanagisawa, “Spoken language conversion with [26] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel wavegan: A fast wave-
accent morphing,” in Proc. 6th ISCA Speech Synth. Workshop, 2007, form generation model based on generative adversarial networks with
pp. 64–70. multi-resolution spectrogram,” in Proc. IEEE Int. Conf. Acoust., Speech,
[8] D. Felps and R. Gutierrez-Osuna, “Developing objective measures of Signal Process., 2020, pp. 6199–6203.
foreign-accent conversion,” IEEE Trans. Audio, Speech, Lang. Process., [27] C. Veaux et al., “CSTR VCTK corpus: English multi-speaker corpus for
vol. 18, no. 5, pp. 1030–1040, Jul. 2010. CSTR voice cloning toolkit,” The Centre for Speech Technol. Res., Univ.
[9] S. Aryal, D. Felps, and R. Gutierrez-Osuna, “Foreign accent conversion of Edinburgh, Edinburgh, U.K., 2017.
through voice morphing,” in Proc. Interspeech, 2013, pp. 3077–3081. [28] J. Kominek and A. W. Black, “The CMU arctic speech databases,” in Proc.
[10] S. Aryal and R. Gutierrez-Osuna, “Reduction of non-native accents 5th ISCA workshop speech Synth., 2004, pp. 223–224.
through statistical parametric articulatory synthesis,” J. Acoust. Soc. Amer., [29] Y. Zhou, X. Tian, and H. Li, “Language agnostic speaker embedding for
vol. 137, no. 1, pp. 433–446, 2015. cross-lingual personalized speech generation,” IEEE/ACM Trans. Audio,
[11] S. Aryal and R. Gutierrez-Osuna, “Articulatory-based conversion of for- Speech, Lang. Process., vol. 29, pp. 3427–3439, 2021.
eign accents with deep neural networks,” in Proc. Interspeech, 2015. [30] L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end
[12] W. Quamer, A. Das, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez- loss for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech,
Osuna, “Zero-shot foreign accent conversion without a native reference,” Signal Process., 2018, pp. 4879–4883.
in Proc. Interspeech 2022, 2022, pp. 4920–4924. [31] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell:
[13] G. Zhao, S. Sonsaat, J. Levis, E. Chukharev-Hudilainen, and R. Gutierrez- A neural network for large vocabulary conversational speech recogni-
Osuna, “Accent conversion using phonetic posteriorgrams,” in Proc. IEEE tion,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2016,
Int. Conf. Acoust., Speech, Signal Process., 2018, pp. 5314–5318. pp. 4960–4964.
[14] W. Li et al., “Improving accent conversion with reference encoder and [32] Method for the Subjective Assessment of Intermediate Sound Quality, Rec.
end-to-end text-to-speech,” 2020, arXiv:2005.09271. ITU-R Recommendation BS.1534-1, International Telecommunications
[15] S. Liu et al., “End-to-end accent conversion without using native utter- Union, Geneva, Switzerland, 2003.
ances,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2020, [33] M. J. Munro and T. M. Derwing, “Foreign accent, comprehensibility, and
pp. 6289–6293. intelligibility in the speech of second language learners,” Lang. Learn.,
[16] W. Ping et al., “Deep voice 3: 2000-speaker neural text-to-speech,” in Proc. vol. 45, no. 1, pp. 73–97, 1995.
Int. Conf. Learn. Representations, 2018, pp. 214–217. [34] J. Lorenzo-Trueba et al., “The voice conversion challenge 2018: Promoting
[17] J. Shen et al., “Natural TTS synthesis by conditioning wavenet on MEL development of parallel and nonparallel method,” in INTERSPEECH,
spectrogram predictions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal pp. 195–202, 2018.
Process., 2018, pp. 4779–4783. [35] Y. Zhao et al., “Voice conversion challenge 2020: Intra-lingual semi-
[18] Y. Ren et al., “Fastspeech 2: Fast and high-quality end-to-end text to parallel and cross-lingual voice conversion,” in Proc. Joint Workshop
speech,” in Proc. Int. Conf. Learn. Representations, 2020. Blizzard Challenge Voice Convers. Challenge, 2020, pp. 80–98.
[19] W.-C. Huang, T. Hayashi, Y.-C. Wu, H. Kameoka, and T. Toda,
“Voice transformer network: Sequence-to-sequence voice conversion us-
ing transformer with text-to-speech pretraining,” in INTERSPEECH,
pp. 4676–4680, 2020.