Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Tang, Chuanxin; Luo, Chong; Zhao, Zhiyuan; Yin, Dacheng; Zhao, Yucheng; Zeng, Wenjun

Computer Science > Sound

arXiv:2109.05426 (cs)

[Submitted on 12 Sep 2021]

Title:Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Authors:Chuanxin Tang, Chong Luo, Zhiyuan Zhao, Dacheng Yin, Yucheng Zhao, Wenjun Zeng

View PDF

Abstract:Given a piece of speech and its transcript text, text-based speech editing aims to generate speech that can be seamlessly inserted into the given speech by editing the transcript. Existing methods adopt a two-stage approach: synthesize the input text using a generic text-to-speech (TTS) engine and then transform the voice to the desired voice using voice conversion (VC). A major problem of this framework is that VC is a challenging problem which usually needs a moderate amount of parallel training data to work satisfactorily. In this paper, we propose a one-stage context-aware framework to generate natural and coherent target speech without any training data of the target speaker. In particular, we manage to perform accurate zero-shot duration prediction for the inserted text. The predicted duration is used to regulate both text embedding and speech embedding. Then, based on the aligned cross-modality input, we directly generate the mel-spectrogram of the edited speech with a transformer-based decoder. Subjective listening tests show that despite the lack of training data for the speaker, our method has achieved satisfactory results. It outperforms a recent zero-shot TTS engine by a large margin.

Comments:	Published in Interspeech'21
Subjects:	Sound (cs.SD); Artificial Intelligence (cs.AI); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2109.05426 [cs.SD]
	(or arXiv:2109.05426v1 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2109.05426

Submission history

From: Chong Luo [view email]
[v1] Sun, 12 Sep 2021 04:17:53 UTC (273 KB)

Computer Science > Sound

Title:Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Zero-Shot Text-to-Speech for Text-Based Insertion in Audio Narration

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators