Bayesian Speech and Language Processing
Bayesian Speech and Language Processing
With this comprehensive guide you will learn how to apply Bayesian machine learning
techniques systematically to solve various problems in speech and language processing.
The authors address the difficulties of straightforward applications and provide detailed
examples and case studies to demonstrate how you can successfully use practical
Bayesian inference methods to improve the performance of information systems.
Shinji Watanabe received his Ph.D. from Waseda University in 2006. He has been a
research scientist at NTT Communication Science Laboratories, a visiting scholar at
Georgia Institute of Technology and a senior principal member at Mitsubishi Electric
Research Laboratories (MERL), as well as having been an associate editor of the IEEE
Transactions on Audio Speech and Language Processing, and an elected member of the
IEEE Speech and Language Processing Technical Committee. He has published more
than 100 papers in journals and conferences, and received several awards including the
Best Paper Award from IEICE in 2003.
Jen-Tzung Chien is with the Department of Electrical and Computer Engineering and the
Department of Computer Science at the National Chiao Tung University, Taiwan, where
he is now the University Chair Professor. He received the Distinguished Research Award
from the Ministry of Science and Technology, Taiwan, and the Best Paper Award of the
2011 IEEE Automatic Speech Recognition and Understanding Workshop. He serves
currently as an elected member of the IEEE Machine Learning for Signal Processing
Technical Committee.
JEN-T ZU NG C H IEN
National Chiao Tung University
Contents
Preface page xi
Notation and abbreviations xiii
1 Introduction 3
1.1 Machine learning and speech and language processing 3
1.2 Bayesian approach 4
1.3 History of Bayesian speech and language processing 8
1.4 Applications 9
1.5 Organization of this book 11
2 Bayesian approach 13
2.1 Bayesian probabilities 13
2.1.1 Sum and product rules 14
2.1.2 Prior and posterior distributions 15
2.1.3 Exponential family distributions 16
2.1.4 Conjugate distributions 24
2.1.5 Conditional independence 38
2.2 Graphical model representation 40
2.2.1 Directed graph 40
2.2.2 Conditional independence in graphical model 40
2.2.3 Observation, latent variable, non-probabilistic variable 42
2.2.4 Generative process 44
2.2.5 Undirected graph 44
2.2.6 Inference on graphs 46
2.3 Difference between ML and Bayes 47
2.3.1 Use of prior knowledge 48
2.3.2 Model selection 49
2.3.3 Marginalization 50
2.4 Summary 51
vi Contents
Contents vii
viii Contents
Contents ix
x Contents
References 405
Index 422
Preface
xii Preface
Acknowledgments
First we want to thank all of our colleagues and research friends, especially members of
NTT Communication Science Laboratories, Mitsubishi Electric Research Laboratories
(MERL), National Cheng Kung University, IBM T. J. Watson Research Center, and
National Chiao Tung University (NCTU). Some of the studies in this book were actu-
ally conducted when the authors were working in these institutes. We also would like to
thank many people for reading a draft and giving us valuable comments which greatly
improved this book, including Tawara Naohiro, Yotaro Kubo, Seong-Jun Hahm, Yu
Tsao, and all of the students from the Machine Learning Laboratory at NCTU. We are
very grateful for support from Anthony Vetro, John R. Hershey, and Jonathan Le Roux
at MERL, and Sin-Horng Chen, Hsueh-Ming Hang, Yu-Chee Tseng, and Li-Chun Wang
at NCTU. The great efforts of the editors of Cambridge University Press, Phil Meyler,
Sarah Marsh, and Heather Brolly, are also appreciated. Finally, we would like to thank
our families for supporting our whole research lives.
Shinji Watanabe
Jen-Tzung Chien
General notation
This book observes the following general mathematical notation to avoid any confusion
arising from notation:
B = {true, false}
Z+ = {1, 2, · · · }
R
Set of real numbers
R>0
RD
Set of D dimensional real numbers
∗
Set of all possible strings composed of letters
∅
Empty set
a
Scalar variable
a
Vector variable
⎡ ⎤
a1
⎢ . ⎥
a = a1 · · · aN = ⎣ .. ⎦
aN
a b
A=
c d
D × D identity matrix
|A|
tr[A]
A, A
Set or sequential variable
A = {a1 , · · · , aN } = {an }N
n=1
A = {an }
f (x) or fx
Function of x
p(x) or q(x)
Functional of f . Note that a functional uses the square brackets [·] while a
function uses the bracket (·).
Another form of the expectation of f (x), where the subscript with the prob-
ability distribution and/or the conditional variable is omitted, when it is
trivial.
1 a = a
δ(a, a ) =
0 Otherwise
δ(x − x )
We also list the notation specific for speech and language processing. This book tries
to maintain consistency by using the same notation, while it also tries to use commonly
used notation in each application. Therefore, some of the same characters are used to
denote different variables, since this book needs to introduce many variables.
Common notation
Set of model parameters
M
Model variable including types of models, structure, hyperparameters, etc.
Set of hyperparameters
Q(·|·)
H
Hessian matrix
Acoustic modeling
T ∈ Z+
Number of speech frames
t ∈ {1, · · · , T}
ot ∈ RD
O = {ot |t = 1, · · · , T}
J ∈ Z+
Number of unique HMM states in an HMM
st ∈ {1, · · · , J}
S = {st |t = 1, · · · , T}
K ∈ Z+
Number of unique mixture components in a GMM
vt ∈ {1, · · · , K}
V = {vt |t = 1, · · · , T}
αt (j) ∈ [0, 1]
βt (j) ∈ [0, 1]
δt (j) ∈ [0, 1]
The highest probability along a single path, at time t which accounts for
previous observations {o1 , · · · , ot } and ends in state j at time t
ξt (i, j) ∈ [0, 1]
γt (j, k) ∈ [0, 1]
πj ∈ [0, 1]
aij ∈ [0, 1]
ωjk ∈ [0, 1]
μjk ∈ RD
jk ∈ RD×D
Rjk ∈ RD×D
Language modeling
w ∈ ∗
Category (e.g., word in most cases, phoneme sometimes). The element is rep-
resented by a string in ∗ (e.g., “I” and “apple” for words and /a/ and /k/ for
phonemes) or a natural number in Z+ when the elements of categories are
numbered.
V ⊂ ∗
Vocabulary (dictionary), i.e., a set of distinct words, which is a subset of ∗
|V|
Vocabulary size
v ∈ {1, · · · , |V|}
w(v) ∈ V
{w(v) |v = 1, · · · , |V|} = V
J ∈ Z+
Number of categories in a chunk (e.g., number of words in a sentence or num-
ber of phonemes or HMM states in a speech segment)
i ∈ {1, · · · , J}
wi ∈ V
W = {wi |i = 1, · · · , J}
wii−n+1 = {wi−n+1 · · · wi }
p(wi |wi−1
i−n+1 ) ∈ [0, 1]
c(wi−1 +
i−n+1 ) ∈ Z
λwi−1
i−n+1
M ∈ Z+
Number of documents
m ∈ {1, · · · , M}
Document index
dm
c(w(v) , dm ) ∈ Z+
K ∈ Z+
Number of unique latent topics
zi ∈ {1, · · · , K}
Z = {zj |j = 1, · · · , J}
Abbreviations