0% found this document useful (0 votes)
195 views

Lecture 10 - Text To Speech

This document discusses generative text-to-speech (TTS) synthesis. It begins with an overview of generative acoustic models for parametric TTS, including hidden Markov models and neural networks. It then covers moving beyond parametric TTS through learned features, WaveNet, and end-to-end models. The document outlines typical TTS system flows and databases. It also introduces a probabilistic formulation of TTS as estimating a predictive distribution and introduces approximations for tractable inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
195 views

Lecture 10 - Text To Speech

This document discusses generative text-to-speech (TTS) synthesis. It begins with an overview of generative acoustic models for parametric TTS, including hidden Markov models and neural networks. It then covers moving beyond parametric TTS through learned features, WaveNet, and end-to-end models. The document outlines typical TTS system flows and databases. It also introduces a probabilistic formulation of TTS as estimating a predictive distribution and introduces approximations for tractable inference.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Generative Model-Based

Text-to-Speech Synthesis
Andrew Senior (DeepMind London) Many thanks to
Heiga Zen
February 23rd, 2017@Oxford
Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR)


→ “OK Google, directions home”

Text-to-speech synthesis (TTS)


“Take the first left” →

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 1 of 50


Speech production process

text
(concept)

freq transfer char


fundamental freq
voiced/unvoiced
modulation of carrier wave

frequency speech
transfer
by speech information

characteristics

magnitude
start-- end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise

air flow

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 2 of 50


Typical flow of TTS system

TEXT
Sentence segmentation
Word segmentation
Text normalization Text analysis
Part-of-speech tagging
Pronunciation
Prosody prediction
Speech synthesis
discrete ⇒ discrete Waveform generation
NLP
Frontend
SYNTHESIZED discrete ⇒ continuous
SEECH Speech
Backend

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 3 of 50


Speech synthesis approaches

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50


Speech synthesis approaches

Rule-based, formant synthesis


[1]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50


Speech synthesis approaches

Rule-based, formant synthesis Sample-based, concatenative


[1] synthesis [2]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50


Speech synthesis approaches

Rule-based, formant synthesis Sample-based, concatenative


[1] synthesis [2]

Model-based, generative synthesis

p(speech= | text=”Hello, my name is Heiga Zen.”)

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50


Unit selection concatenative speech synthesis

• Build a database with wide linguistic diversity.


• Forced align and chop up into diphones.
• For a new utterance, choose units matching the diphone sequence.
• Minimize total cost by greedy search.
P
• Cost = i U (i) + J(i, i − 1)
• Splice together adjacent units matching up last pitch period.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 5 of 50


TTS databases

• Want high quality,


− Studio recording
− Controlled, consistent conditions
− No background noise
− Single (professional) speaker
• Typically read speech

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 6 of 50


TTS databases

• VCTK (Voice Cloning Tool Kit)


− 109 native speakers of English 400 sentences. 96kHz 24 bits
− Intended for adaptation of an average voice.
• Google TTS 10s of hours
• Edingburgh Merlin system
https://github.com/CSTR-Edinburgh/merlin

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 7 of 50


TTS performance metrics

• TTS performance is subjective.


• Intelligibility (in noise)
• Naturalness
− Mean Opinion Score (5 point scale)
− A/B preference tests.
− e.g. Amazon Mechanical Turk 100 utterances 5–7 tests per sample
− Care needed to control for human factors.
• Objective measures
− PESQ
− Robust MOS

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 8 of 50


Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) Observed


W Transcriptions (data) Observed
w Given text Observed
x Synthesized speech Unobserved

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 9 of 50


Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) Observed


W Transcriptions (data) Observed
w Given text Observed
x Synthesized speech Unobserved

Synthesis
X W w
• Estimate posterior predictive distribution
→ p(x | w, X , W)
x
• Sample x̄ from the posterior distribution

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 9 of 50


Probabilistic formulation
Introduce auxiliary variables (representation) + factorize dependency
ZZZ X X

p(x | w, X , W) = p(x | o)p(o | l, λ)p(l | w)
∀l ∀L

p(X | O)p(O | L, λ)p(λ)p(L | W)/ p(X ) dodOdλ

where

O, o: Acoustic features X W w

L, l: Linguistic features
O L l
λ: Model
λ

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 10 of 50


Approximation (1)

Approximate {sum & integral} by best point estimates (like MAP) [3]

p(x | w, X , W) ≈ p(x | ô)


X W w
where
 O L l
{ô, l̂, Ô, L̂, λ̂} = arg max
o,l,O,L,λ
λ
p(x | o)p(o | l, λ)p(l | w)
o

p(X | O)p(O | L, λ)p(λ)p(L | W)

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 11 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) Predict linguistic features λ̂


l
o
ô = arg max p(o | l̂, λ̂) Predict acoustic features
o ô
x̄ ∼ fx (ô) = p(x | ô) Synthesize waveform
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ)
λ λ

l̂ = arg max p(l | w) λ̂


l
o
ô = arg max p(o | l̂, λ̂)
o ô
x̄ ∼ fx (ô) = p(x | ô)
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ)
λ λ

l̂ = arg max p(l | w) λ̂


l
o
ô = arg max p(o | l̂, λ̂)
o ô
x̄ ∼ fx (ô) = p(x | ô)
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) λ̂


l
o
ô = arg max p(o | l̂, λ̂)
o ô
x̄ ∼ fx (ô) = p(x | ô)
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) Predict linguistic features λ̂


l
o
ô = arg max p(o | l̂, λ̂)
o ô
x̄ ∼ fx (ô) = p(x | ô)
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) Predict linguistic features λ̂


l
o
ô = arg max p(o | l̂, λ̂) Predict acoustic features
o ô
x̄ ∼ fx (ô) = p(x | ô)
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) Predict linguistic features λ̂


l
o
ô = arg max p(o | l̂, λ̂) Predict acoustic features
o ô
x̄ ∼ fx (ô) = p(x | ô) Synthesize waveform
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Approximation (2)
Joint → Step-by-step maximization [3]

Ô = arg max p(X | O) Extract acoustic features X W w


O
L̂ = arg max p(L | W) Extract linguistic features O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Learn mapping
λ λ

l̂ = arg max p(l | w) Predict linguistic features λ̂


l
o
ô = arg max p(o | l̂, λ̂) Predict acoustic features
o ô
x̄ ∼ fx (ô) = p(x | ô) Synthesize waveform
x

Representations: acoustic, linguistic, mapping

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 12 of 50


Representation – Linguistic features

Hello, world. Sentence: length, ...

Hello, world. Phrase: intonation, ...

hello world Word: POS, grammar, ...

h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...

h e l ou w er l d Phone: voicing, manner, ...

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 13 of 50


Representation – Linguistic features

Hello, world. Sentence: length, ...

Hello, world. Phrase: intonation, ...

hello world Word: POS, grammar, ...

h-e2 l-ou1 w-er1-l-d Syllable: stress, tone, ...

h e l ou w er l d Phone: voicing, manner, ...

→ Based on knowledge about spoken language


• Lexicon, letter-to-sound rules
• Tokenizer, tagger, parser
• Phonology rules

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 13 of 50


Representation – Acoustic features

Duration model
• Typically run a parametric synthesizer on frames (e.g. 5ms windows)
• Need to know how many frames each phonetic unit lasts.
• Model this separately e.g. FFNN linguistic features → duration.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 14 of 50


Representation – Acoustic features

Piece-wise stationary, source-filter generative model p(x | o)

Vocal source Vocal tract filter


Pulse train (voiced)
Cepstrum, LPC, ... overlap/shift
windowing
Fundamental Aperiodicity, [dB]
80
voicing, ...

frequency
+ Speech
e(n)
0 x(n) = h(n)*e(n)
0 8 [kHz]

h(n)
White noise (unvoiced)

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 15 of 50


Representation – Acoustic features

Piece-wise stationary, source-filter generative model p(x | o)

Vocal source Vocal tract filter


Pulse train (voiced)
Cepstrum, LPC, ... overlap/shift
windowing
Fundamental Aperiodicity, [dB]
80
voicing, ...

frequency
+ Speech
e(n)
0 x(n) = h(n)*e(n)
0 8 [kHz]

h(n)
White noise (unvoiced)

→ Needs to solve inverse problem


• Estimate parameters from signals
• Use estimated parameters (e.g., cepstrum) as acoustic features
Andrew Senior Generative Model-Based Text-to-Speech Synthesis 15 of 50
Representation – Mapping

Rule-based, formant synthesis [1]

Ô = arg max p(X | O) Vocoder analysis X W w


O
L̂ = arg max p(L | W) Text analysis O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Extract rules
λ λ

l̂ = arg max p(l | w) Text analysis λ̂


l
o
ô = arg max p(o | l̂, λ̂) Apply rules
o ô
x̄ ∼ fx (ô) = p(x | ô) Vocoder synthesis
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 16 of 50


Representation – Mapping

Rule-based, formant synthesis [1]

Ô = arg max p(X | O) Vocoder analysis X W w


O
L̂ = arg max p(L | W) Text analysis O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Extract rules
λ λ

l̂ = arg max p(l | w) Text analysis λ̂


l
o
ô = arg max p(o | l̂, λ̂) Apply rules
o ô
x̄ ∼ fx (ô) = p(x | ô) Vocoder synthesis
x

→ Hand-crafted rules on knowledge-based features

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 16 of 50


Representation – Mapping

HMM-based [4], statistical parametric synthesis [5]

Ô = arg max p(X | O) Vocoder analysis X W w


O
L̂ = arg max p(L | W) Text analysis O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Train HMMs
λ λ

l̂ = arg max p(l | w) Text analysis λ̂


l
o
ô = arg max p(o | l̂, λ̂) Parameter generation
o ô
x̄ ∼ fx (ô) = p(x | ô) Vocoder synthesis
x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 17 of 50


Representation – Mapping

HMM-based [4], statistical parametric synthesis [5]

Ô = arg max p(X | O) Vocoder analysis X W w


O
L̂ = arg max p(L | W) Text analysis O L l
L
Ô L̂ l̂
λ̂ = arg max p(Ô | L̂, λ)p(λ) Train HMMs
λ λ

l̂ = arg max p(l | w) Text analysis λ̂


l
o
ô = arg max p(o | l̂, λ̂) Parameter generation
o ô
x̄ ∼ fx (ô) = p(x | ô) Vocoder synthesis
x

→ Replace rules by HMM-based generative acoustic model

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 17 of 50


Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


HMM-based generative acoustic model for TTS

• Context-dependent subword HMMs


• Decision trees to cluster & tie HMM states → interpretable

l1 ... lN
l
...
q1 q2 q3 q4 : Discrete

o1 o2 o3 o4 o5 o6 ... ... ... ... oT o1 o2 o3 o4 : Continuous


o2 o2

T
XY
p(o | l, λ) = p(ot | qt , λ)P (q | l, λ) qt : hidden state at t
∀q t=1
T
XY
= N (ot ; µqt , Σqt )P (q | l, λ)
∀q t=1

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 19 of 50


HMM-based generative acoustic model for TTS

• Non-smooth, step-wise statistics


→ Smoothing is essential

• Difficult to use high-dimensional acoustic features (e.g., raw spectra)


→ Use low-dimensional features (e.g., cepstra)

• Data fragmentation
→ Ineffective, local representation

A lot of research work have been done to address these issues

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 20 of 50


Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


Alternative acoustic model

HMM: Handle variable length & alignment


Decision tree: Map linguistic → acoustic
Linguistic features l

yes no

yes no yes no

yes no yes no

...
Statistics of acoustic features o

Regression tree: linguistic features → Stats. of acoustic features

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 22 of 50


Alternative acoustic model

HMM: Handle variable length & alignment


Decision tree: Map linguistic → acoustic
Linguistic features l

yes no

yes no yes no

yes no yes no

...
Statistics of acoustic features o

Regression tree: linguistic features → Stats. of acoustic features


Replace the tree w/ a general-purpose regression model
→ Artificial neural network
Andrew Senior Generative Model-Based Text-to-Speech Synthesis 22 of 50
FFNN-based acoustic model for TTS [6]

Target
Frame-level acoustic feature o t o t−1 ot o t+1

ht

Frame-level linguistic feature lt


lt−1 lt lt+1
Input

X
ht = g (Whl lt + bh ) λ̂ = arg min kot − ôt k2
λ t
ôt = Woh ht + bo λ = {Whl , Woh , bh , bo }

ôt ≈ E [ot | lt ] → Replace decision trees & Gaussian distributions


Andrew Senior Generative Model-Based Text-to-Speech Synthesis 23 of 50
RNN-based acoustic model for TTS [7]

Target
Frame-level acoustic feature o t o t−1 ot o t+1

Recurrent
connections

Frame-level linguistic feature lt


lt−1 lt lt+1
Input

X
ht = g (Whl lt + Whh ht−1 + bh ) λ̂ = arg min kot − ôt k2
λ t
ôt = Woh ht + bo λ = {Whl , Whh , Woh , bh , bo }

FFNN: ôt ≈ E [ot | lt ] RNN: ôt ≈ E [ot | l1 , . . . , lt ]


Andrew Senior Generative Model-Based Text-to-Speech Synthesis 24 of 50
NN-based generative acoustic model for TTS

• Non-smooth, step-wise statistics


→ RNN predicts smoothly varying acoustic features [7, 8]

• Difficult to use high-dimensional acoustic features (e.g., raw spectra)


→ Layered architecture can handle high-dimensional features [9]

• Data fragmentation
→ Distributed representation [10]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 25 of 50


NN-based generative acoustic model for TTS

• Non-smooth, step-wise statistics


→ RNN predicts smoothly varying acoustic features [7, 8]

• Difficult to use high-dimensional acoustic features (e.g., raw spectra)


→ Layered architecture can handle high-dimensional features [9]

• Data fragmentation
→ Distributed representation [10]

NN-based approach is now mainstream in research & products


• Models: FFNN [6], MDN [11], RNN [7], Highway network [12], GAN
[13]
• Products: e.g., Google [14]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 25 of 50


NN-based generative model for TTS

text
(concept)

freq transfer char


fundamental freq
voiced/unvoiced
modulation of carrier wave

frequency speech
transfer
by speech information

characteristics

magnitude
start-- end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise

air flow

Text → Linguistic → (Articulatory) → Acoustic → Waveform

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 26 of 50


Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


Knowledge-based features → Learned features

Unsupervised feature learning


~
x(t) w(n+1)

world
0000000000000000000100

Acoustic Linguistic
⇒ feature ⇒
feature
o(t) l(n-1) l(n)

0000000000100000000000

hello
x(t) w(n)
(raw FFT spectrum) (1-hot representation of word)

• Speech: auto-encoder at FFT spectra [9, 15] → positive results


• Text: word [16], phone & syllable [17] → less positive

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 28 of 50


Relax approximation
Joint acoustic feature extraction & model training

Two-step optimization → Joint optimization


Ô = arg max p(X | O)
 X W w
O
λ̂ = arg max p(Ô | L̂, λ)p(λ)
 L l
λ O
L̂ l̂

λ
{λ̂, Ô} = arg max p(X | O)p(O | L̂, λ)p(λ)
λ,O λ̂

o
Joint source-filter & acoustic model optimization
• HMM [18, 19, 20] ô

• NN [21, 22] x

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 29 of 50


Relax approximation
Joint acoustic feature extracion & model training

Mixed-phase cepstral analysis + LSTM-RNN [22]


pn
fn
G(z)
Pulse train
xn z z z -1 z -1 z -1 z -1
- en
Hu-1 (z)
sn
Speech
1-

... dt(v) d(u)


t
...
Cepstrum
... ot(v) o(u)
t ... Derivatives
Forward
propagation

Back propagation

Linguistic
features
... lt ...

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 30 of 50


Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


Relax approximation
Direct mapping from linguistic to waveform

No explicit acoustic features

{λ̂, Ô} = arg max p(X | O)p(O | L̂, λ)p(λ) X W w


λ,O

⇓ L l

λ̂ = arg max p(X | L̂, λ)p(λ) L̂ l̂


λ
λ
Generative models for raw audio
• LPC [23] λ̂

• WaveNet [24] x

• SampleRNN [25]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 32 of 50


WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0 , x1 , . . . , xN −1 } : raw waveform


N
Y −1
p(x | λ) = p(x0 , x1 , . . . , xN −1 | λ) = p(xn | x0 , . . . , xn−1, λ)
n=0

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 33 of 50


WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0 , x1 , . . . , xN −1 } : raw waveform


N
Y −1
p(x | λ) = p(x0 , x1 , . . . , xN −1 | λ) = p(xn | x0 , . . . , xn−1, λ)
n=0

WaveNet [24]
→ p(xn | x0 , . . . , xn−1 , λ) is modeled by convolutional NN

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 33 of 50


WaveNet: A generative model for raw audio

Autoregressive (AR) modelling of speech signals

x = {x0 , x1 , . . . , xN −1 } : raw waveform


N
Y −1
p(x | λ) = p(x0 , x1 , . . . , xN −1 | λ) = p(xn | x0 , . . . , xn−1, λ)
n=0

WaveNet [24]
→ p(xn | x0 , . . . , xn−1 , λ) is modeled by convolutional NN

Key components
• Causal dilated convolution: capture long-term dependency
• Gated convolution + residual + skip: powerful non-linearity
• Softmax at output: classification rather than regression

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 33 of 50


WaveNet – Causal dilated convolution

100ms in 16kHz sampling = 1,600 time steps


→ Too long to be captured by normal RNN/LSTM
Dilated convolution
Exponentially increase receptive field size w.r.t. # of layers
p(xn | xn-1 ,..., xn-16)
Output

Hidden
layer3
Hidden
layer2
Hidden
layer1

Input
xn-16 ... xn-3 xn-2 xn-1

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 34 of 50


WaveNet – Non-linearity

...

25
512

6
512

1x1
Residual block

25
6
512
512 256
Residual block 1x1 p(xn | x0 ,..., xn-1 )

Softmax
256 256 256

ReLU

ReLU

1x1

1x1
30 512
512 256
1x1

Residual block
512
6
25

512
1x1

Residual block
6
25

512
... Skip connections
1x1 : 1x1 conv

Residual block : 2x1 dilated conv → Gated activation → 1x1 Conv + Residual connection

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 35 of 50


WaveNet – Softmax

Amplitude
Time

Analog audio signal

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 36 of 50


WaveNet – Softmax

Amplitude
Time

Sampling & Quantization

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 36 of 50


WaveNet – Softmax
1
Category index

Amplitude
16
Time

Categorical distribution → Histogram


- Unimodal
- Multimodal
- Skewed
...

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 36 of 50


WaveNet – Softmax
1
Category index

Amplitude
16
Time

Categorical distribution → Histogram


- Unimodal
- Multimodal
Prof. D. Jurafsky - “Now TTS is
- Skewed the same problem as language
... modeling!”

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 36 of 50


WaveNet – Conditional modelling

p(xn | xn-1 ,..., xn-32 , hn, λ )

Res. block

Linguistic features
Res. block

Embedding
hn
Res. block
...
at time n
Res. block

Res. block Res. block

xn-4 xn-3 xn-2 xn-1

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 37 of 50


WaveNet vs conventional audio generative models

Assumptions in conventional audio generative models


[23, 26, 27, 22]
• Stationary process w/ fixed-length analysis window
→ Estimate model within 20–30ms window w/ 5–10 shift
• Linear, time-invariant filter within a frame
→ Relationship between samples can be non-linear
• Gaussian process
→ Assumes speech signals are normally distributed

WaveNet
• Sample-by-saple, non-linear, capable to take additional inputs
• Arbitrary-shaped signal distribution

SOTA subjective naturalness w/ WaveNet-based TTS [24]


HMM LSTM Concatenative WaveNet
Andrew Senior Generative Model-Based Text-to-Speech Synthesis 38 of 50
Outline

Generative TTS

Generative acoustic models for parametric TTS


Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS


Learned features
WaveNet
End-to-end

Conclusion & future topics


Relax approximation
Towards Bayesian end-to-end TTS

Integrated end-to-end

L̂ = arg max p(L | W)
 X W w
L
λ̂ = arg max p(X | L̂, λ)p(λ)

λ
⇓ λ
λ̂ = arg max p(X | W, λ)p(λ)
λ λ̂

Text analysis is integrated to model

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 40 of 50


Relax approximation
Towards Bayesian end-to-end TTS

Bayesian end-to-end

λ̂ = arg max p(X | W, λ)p(λ) X W w
λ
x̄ ∼ f (w, λ̂) = p(x | w, λ̂)
x


x̄ ∼ fx (w, X , W) = p(x | w, X , W) λ
Z
= p(x | w, λ)p(λ | X , W)dλ
K
1 X
≈ p(x | w, λ̂k ) ← Ensemble x
K
k=1

Marginalize model parameters & architecture

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 41 of 50


Generative model-based text-to-speech synthesis

• Bayes formulation + factorization + approximations

• Representation: acoustic features, linguistic features, mapping


− Mapping: Rules → HMM → NN
− Feature: Engineered → Unsupervised, learned

• Less approximations
− Joint training, direct waveform modelling
− Moving towards integrated & Bayesian end-to-end TTS

Naturalness: Concatenative ≤ Generative


Flexibility: Concatenative  Generative (e.g., multiple speakers)

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 42 of 50


Beyond “text”-to-speech synthesis

TTS on conversational assistants

• Texts aren’t fully contained


• Need more context
− Location to resolve homographs
− User query to put right emphasis

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 43 of 50


Beyond “text”-to-speech synthesis

TTS on conversational assistants

• Texts aren’t fully contained


• Need more context
− Location to resolve homographs
− User query to put right emphasis

We need representation that can


organize the world information & make it accessible & useful
from TTS generative models

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 43 of 50


Beyond “generative” TTS

Generative model-based TTS


• Model represents process behind speech production
− Trained to minimize error against human-produced speech
− Learned model → speaker

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 44 of 50


Beyond “generative” TTS

Generative model-based TTS


• Model represents process behind speech production
− Trained to minimize error against human-produced speech
− Learned model → speaker

• Speech is for communication


− Goal: maximize the amount of information to be received

Missing “listener”
→ “listener” in training / model itself?

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 44 of 50


Thanks!

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 45 of 50


References I

[1] D. Klatt.
Real-time speech synthesis by rule.
Journal of ASA, 68(S1):S18–S18, 1980.

[2] A. Hunt and A. Black.


Unit selection in a concatenative speech synthesis system using a large speech database.
In Proc. ICASSP, pages 373–376, 1996.
[3] K. Tokuda.
Speech synthesis as a statistical machine learning problem.
https://www.sp.nitech.ac.jp/~tokuda/tokuda_asru2011_for_pdf.pdf.
Invited talk given at ASRU 2011.
[4] T. Yoshimura, K. Tokuda, T. Masuko, T. Kobayashi, and T. Kitamura.
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis.
IEICE Trans. Inf. Syst., J83-D-II(11):2099–2107, 2000.
(in Japanese).

[5] H. Zen, K. Tokuda, and A. Black.


Statistical parametric speech synthesis.
Speech Commn., 51(11):1039–1064, 2009.

[6] H. Zen, A. Senior, and M. Schuster.


Statistical parametric speech synthesis using deep neural networks.
In Proc. ICASSP, pages 7962–7966, 2013.
[7] Y. Fan, Y. Qian, F.-L. Xie, and F. Soong.
TTS synthesis with bidirectional LSTM based recurrent neural networks.
In Proc. Interspeech, pages 1964–1968, 2014.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 46 of 50


References II

[8] H. Zen.
Acoustic modeling for speech synthesis: from HMM to RNN.
http://research.google.com/pubs/pub44630.html.
Invited talk given at ASRU 2015.
[9] S. Takaki and J. Yamagishi.
A deep auto-encoder based low-dimensional feature extraction from FFT spectral envelopes for statistical parametric
speech synthesis.
In Proc. ICASSP, pages 5535–5539, 2016.
[10] G. Hinton, J. McClelland, and D. Rumelhart.
Distributed representation.
In D. Rumelhart, J. McClelland, and the PDP Research Group, editors, Parallel distributed processing: Explorations in
the microstructure of cognition. MIT Press, 1986.
[11] H. Zen and A. Senior.
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis.
In Proc. ICASSP, pages 3872–3876, 2014.
[12] X. Wang, S. Takaki, and J. Yamagishi.
Investigating very deep highway networks for parametric speech synthesis.
In Proc. ISCA SSW9, 2016.
[13] Y. Saito, S. Takamichi, and Saruwatari.
Training algorithm to deceive anti-spoofing verification for DNN-based speech synthesis.
In Proc. ICASSP, 2017.
[14] H. Zen, Y. Agiomyrgiannakis, N. Egberts, F. Henderson, and P. Szczepaniak.
Fast, compact, and high quality LSTM-RNN based statistical parametric speech synthesizers for mobile devices.
In Proc. Interspeech, 2016.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 47 of 50


References III
[15] P. Muthukumar and A. Black.
A deep learning approach to data-driven parameterizations for statistical parametric speech synthesis.
arXiv:1409.8558, 2014.
[16] P. Wang, Y. Qian, F. Soong, L. He, and H. Zhao.
Word embedding for recurrent neural network based TTS synthesis.
In Proc. ICASSP, pages 4879–4883, 2015.
[17] X. Wang, S. Takaki, and J. Yamagishi.
Investigation of using continuous representation of various linguistic units in neural network-based text-to-speech
synthesis.
IEICE Trans. Inf. Syst., E90-D(12):2471–2480, 2016.

[18] T. Toda and K. Tokuda.


Statistical approach to vocal tract transfer function estimation based on factor analyzed trajectory hmm.
In Proc. ICASSP, pages 3925–3928, 2008.
[19] Y.-J. Wu and K. Tokuda.
Minimum generation error training with direct log spectral distortion on LSPs for HMM-based speech synthesis.
In Proc. Interspeech, pages 577–580, 2008.
[20] R. Maia, H. Zen, and M. Gales.
Statistical parametric speech synthesis with joint estimation of acoustic and excitation model parameters.
In Proc. ISCA SSW7, pages 88–93, 2010.
[21] K. Tokuda and H. Zen.
Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis.
In Proc. ICASSP, pages 4215–4219, 2015.
[22] K. Tokuda and H. Zen.
Directly modeling voiced and unvoiced components in speech waveforms by neural networks.
In Proc. ICASSP, pages 5640–5644, 2016.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 48 of 50


References IV

[23] F. Itakura and S. Saito.


A statistical method for estimation of speech spectral density and formant frequencies.
Trans. IEICE, J53A:35–42, 1970.
[24] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and
K. Kavukcuoglu.
WaveNet: A generative model for raw audio.
arXiv:1609.03499, 2016.
[25] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, A. Courville, and Y. Bengio.
SampleRNN: An unconditional end-to-end neural audio generation model.
arXiv:1612.07837, 2016.
[26] S. Imai and C. Furuichi.
Unbiased estimation of log spectrum.
In Proc. EURASIP, pages 203–206, 1988.
[27] H. Kameoka, Y. Ohishi, D. Mochihashi, and J. Le Roux.
Speech analysis with multi-kernel linear prediction.
In Proc. Spring Conference of ASJ, pages 499–502, 2010.
(in Japanese).

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 49 of 50


X W w X W w X W w X W w

O L l
O L l O L l
Ô L̂ l̂
λ λ
λ
o
o λ̂
ô o


x x x
x

(1) Bayesian (2) Auxiliary variables (3) Joint maximization (4) Step-by-step maximization
+ factorization e.g., statistical parametric TTS

X W w X W w X W w X W w

L l L l
O
L̂ l̂ L̂ l̂

λ λ λ
λ
λ̂ λ̂ λ̂

x x x x

(5) Joint acoustic feature (6) Conditional WaveNet (7) Integrated end-to-end (8) Bayesian end-to-end
extraction + model training -based TTS

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 50 of 50

You might also like