Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, Rif A. Saurous
Proceedings of the 35th International Conference on Machine Learning, PMLR 80:5180-5189, 2018.

Abstract

In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Cite this Paper


BibTeX
@InProceedings{pmlr-v80-wang18h, title = {Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis}, author = {Wang, Yuxuan and Stanton, Daisy and Zhang, Yu and Ryan, RJ-Skerry and Battenberg, Eric and Shor, Joel and Xiao, Ying and Jia, Ye and Ren, Fei and Saurous, Rif A.}, booktitle = {Proceedings of the 35th International Conference on Machine Learning}, pages = {5180--5189}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80}, series = {Proceedings of Machine Learning Research}, month = {10--15 Jul}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v80/wang18h/wang18h.pdf}, url = {https://proceedings.mlr.press/v80/wang18h.html}, abstract = {In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.} }
Endnote
%0 Conference Paper %T Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis %A Yuxuan Wang %A Daisy Stanton %A Yu Zhang %A RJ-Skerry Ryan %A Eric Battenberg %A Joel Shor %A Ying Xiao %A Ye Jia %A Fei Ren %A Rif A. Saurous %B Proceedings of the 35th International Conference on Machine Learning %C Proceedings of Machine Learning Research %D 2018 %E Jennifer Dy %E Andreas Krause %F pmlr-v80-wang18h %I PMLR %P 5180--5189 %U https://proceedings.mlr.press/v80/wang18h.html %V 80 %X In this work, we propose “global style tokens” (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system. The embeddings are trained with no explicit labels, yet learn to model a large range of acoustic expressiveness. GSTs lead to a rich set of significant results. The soft interpretable “labels” they generate can be used to control synthesis in novel ways, such as varying speed and speaking style – independently of the text content. They can also be used for style transfer, replicating the speaking style of a single audio clip across an entire long-form text corpus. When trained on noisy, unlabeled found data, GSTs learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
APA
Wang, Y., Stanton, D., Zhang, Y., Ryan, R., Battenberg, E., Shor, J., Xiao, Y., Jia, Y., Ren, F. & Saurous, R.A.. (2018). Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis. Proceedings of the 35th International Conference on Machine Learning, in Proceedings of Machine Learning Research 80:5180-5189 Available from https://proceedings.mlr.press/v80/wang18h.html.

Related Material