0% found this document useful (0 votes)

195 views

Lecture 10 - Text To Speech

This document discusses generative text-to-speech (TTS) synthesis. It begins with an overview of generative acoustic models for parametric TTS, including hidden Markov models and neural networks. It then covers moving beyond parametric TTS through learned features, WaveNet, and end-to-end models. The document outlines typical TTS system flows and databases. It also introduces a probabilistic formulation of TTS as estimating a predictive distribution and introduces approximations for tractable inference.

Uploaded by

Viswakarma Chakravarthy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

195 views

Lecture 10 - Text To Speech

Uploaded by

Viswakarma Chakravarthy

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Generative Model-Based

Text-to-Speech Synthesis
Andrew Senior (DeepMind London) Many thanks to
Heiga Zen
February 23rd, 2017@Oxford
Outline

Generative TTS

Generative acoustic models for parametric TTS

Hidden Markov models (HMMs)
Neural networks

Beyond parametric TTS

Learned features
WaveNet
End-to-end

Conclusion & future topics

Text-to-speech as sequence-to-sequence mapping

Automatic speech recognition (ASR)

→ “OK Google, directions home”

Text-to-speech synthesis (TTS)

“Take the first left” →

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 1 of 50

Speech production process

text
(concept)

freq transfer char

fundamental freq
voiced/unvoiced
modulation of carrier wave

frequency speech
transfer
by speech information

characteristics

magnitude
start-- end
Sound source
fundamental
voiced: pulse
frequency
unvoiced: noise

air flow

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 2 of 50

Typical flow of TTS system

TEXT
Sentence segmentation
Word segmentation
Text normalization Text analysis
Part-of-speech tagging
Pronunciation
Prosody prediction
Speech synthesis
discrete ⇒ discrete Waveform generation
NLP
Frontend
SYNTHESIZED discrete ⇒ continuous
SEECH Speech
Backend

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 3 of 50

Speech synthesis approaches

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50

Speech synthesis approaches

Rule-based, formant synthesis

[1]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50

Speech synthesis approaches

Rule-based, formant synthesis Sample-based, concatenative

[1] synthesis [2]

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50

Speech synthesis approaches

Rule-based, formant synthesis Sample-based, concatenative

[1] synthesis [2]

Model-based, generative synthesis

p(speech= | text=”Hello, my name is Heiga Zen.”)

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 4 of 50

Unit selection concatenative speech synthesis

• Build a database with wide linguistic diversity.

• Forced align and chop up into diphones.
• For a new utterance, choose units matching the diphone sequence.
• Minimize total cost by greedy search.
P
• Cost = i U (i) + J(i, i − 1)
• Splice together adjacent units matching up last pitch period.

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 5 of 50

TTS databases

• Want high quality,

− Studio recording
− Controlled, consistent conditions
− No background noise
− Single (professional) speaker
• Typically read speech

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 6 of 50

TTS databases

• VCTK (Voice Cloning Tool Kit)

− 109 native speakers of English 400 sentences. 96kHz 24 bits
− Intended for adaptation of an average voice.
• Google TTS 10s of hours
• Edingburgh Merlin system
https://github.com/CSTR-Edinburgh/merlin

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 7 of 50

TTS performance metrics

• TTS performance is subjective.

• Intelligibility (in noise)
• Naturalness
− Mean Opinion Score (5 point scale)
− A/B preference tests.
− e.g. Amazon Mechanical Turk 100 utterances 5–7 tests per sample
− Care needed to control for human factors.
• Objective measures
− PESQ
− Robust MOS

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 8 of 50

Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) Observed

W Transcriptions (data) Observed
w Given text Observed
x Synthesized speech Unobserved

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 9 of 50

Probabilistic formulation of TTS

Random variables

X Speech waveforms (data) Observed

W Transcriptions (data) Observed
w Given text Observed
x Synthesized speech Unobserved

Synthesis
X W w
• Estimate posterior predictive distribution
→ p(x | w, X , W)
x
• Sample x̄ from the posterior distribution

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 9 of 50

where

O, o: Acoustic features X W w

L, l: Linguistic features
O L l
λ: Model
λ

Andrew Senior Generative Model-Based Text-to-Speech Synthesis 10 of 50

Approximation (1)

Approximate {sum & integral} by best point estimates (like MAP) [3]

p(x | w, X , W) ≈ p(x | ô)

X W w
where
O L l
{ô, l̂, Ô, L̂, λ̂} = arg max
o,l,O,L,λ
λ
p(x | o)p(o | l, λ)p(l | w)
o

p(X | O)p(O | L, λ)p(λ)p(L | W)
ô