Neural Speech Synthesis with Transformer Network

Li, Naihan; Liu, Shujie; Liu, Yanqing; Zhao, Sheng; Liu, Ming; Zhou, Ming

Computer Science > Computation and Language

arXiv:1809.08895 (cs)

[Submitted on 19 Sep 2018 (v1), last revised 30 Jan 2019 (this version, v3)]

Title:Neural Speech Synthesis with Transformer Network

Authors:Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, Ming Liu, Ming Zhou

View PDF

Abstract:Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:1809.08895 [cs.CL]
	(or arXiv:1809.08895v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.1809.08895

Submission history

From: Naihan Li [view email]
[v1] Wed, 19 Sep 2018 07:41:17 UTC (3,333 KB)
[v2] Tue, 13 Nov 2018 08:57:52 UTC (3,333 KB)
[v3] Wed, 30 Jan 2019 12:40:57 UTC (3,333 KB)

Computer Science > Computation and Language

Title:Neural Speech Synthesis with Transformer Network

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Neural Speech Synthesis with Transformer Network

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators