Lecture 2
Lecture 2
(Part I)
Lecture 2
CS 753
Instructor: Preethi Jyothi
Recall: Statistical ASR
Let W denote a word sequence. An ASR decoder solves the foll. problem:
Pr(O | "down")
a01 a12 a23 a34
down
Pht+1 0 1 2 3 4
sub-word units Q corresponding to the word sequence W and the language model
b ()
P (W ) provides a prior
1 probability for2 W . b () b3( )
St+1
Ot+1
O O O O .... O
Compute arg max Pr(O | w)
Acoustic model:
1 The most2commonly3 used acoustic
4 models in ASR
T systems to-
day are Hidden Markov Models (HMMs). Please refer to Rabiner (1989) for a com-
Trt prehensive tutorial of a a
11HMMs and their22applicability to ASR a
33 in the 1980’s (with acoustic w
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob- features
a a a a
left abilistic models01 12 labeling problems.
23 Since speech 34 is represented O
Pr(O | "left")
for linear sequence
1 2 3
Pht+1 0
sub-word units Q corresponding to the word sequence W and the language model 4
in the form of a sequence of acoustic vectors O, it lends itself to be naturally mod-
P (W ) provides a prior probability for W .
St+1
eled using HMMs. 1 b () 2 b () 3 b ()
j
Themodel:
Acoustic HMM isThe
defined
mostby specifyingused
commonly transition probabilities
acoustic models in (a i ) and
ASR observation
systems to-
Ot+1
day(or
areemission)
O
1
probability
Hidden Markov Models
2 O
distributions
(HMMs). Please
O
(b3j (Oi ))refer
(along O
to Rabiner
....
4 with the number
T O
(1989) for aofcom-
hidden
a
abilistic models for01
linear sequence labeling20
a12 problems. Since speech is represented
a23 a34
right
Pht+1
insub-word
0
the form of a sequence
units
1
of acoustictovectors
Q corresponding
2
O, it
the word lends itself
sequence
3
to bethe
W and naturally
languagemod-
model
4
eled using
P (W HMMs.a prior probability for W .
) provides
b1( ) b2( ) b3( j)
Pr(O | "right")
The HMM is defined by specifying transition probabilities (ai ) and observation
Ot+1 Acoustic
(or model:
emission) O The most
probability
1
commonly
O2
distributions used
(bj (O O3 O4 ....
acoustic models in ASR systems to-
i )) (along with the number of hidden OT
day are
states Hidden
in the Markov
HMM). Models
An HMM (HMMs).
makes Please refer
a transition fromtostate
Rabiner
i to(1989)
state jfor a com-
with a
prehensiveoftutorial
probability aji . On of HMMsaand
reaching statetheir applicability
j, the observationtovector
ASR inat the
that1980’s (with
state (O j)
Figure 2.1: Standard topology used to represent a phone HMM.
ideas that are largely applicable to systems today). HMMs are used to build prob-
What are Hidden Markov Models (HMMs)?
a i|Hj
) the transition probability
α2(1) = .32*.2 from
+ .02*.25 = .069 previous state qi to cu
* . (3 |
α (1) =i=1
.02 P ( 1
*
.8 *P
1 |C )
.2
b the state observation likelihood of the observation sy
rt)
4
H
(o )P ( *
.5 P(C|C) j t * P(1|C)
sta
) .5 * .5
P(
C
compute the forward probability
* P ( 3 |
at time t are
r t) .1
s ta *
C | .2
at 1 (i) the previous
P (
forward
3 path probability from the 1previous time step 3
ai j
π
the transition probability from previous state q
o1 i to
o2
current state q j o3
b j (ot ) the state observation likelihood of the observation symbol ot given
t
the current state j
PTER 9 • H IDDEN M ARKOV M ODELS
Visualizing the forward recursion
αt-2(N) αt-1(N)
qN qN qN
aNj αt(j)= Σi αt-1(i) aij bj(ot)
qj
αt-2(1) αt-1(1)
q1 q1 q1 q1
Figure A.7
Forward Algorithm
The forward algorithm, where forward[s,t] represents a (s). t
1. Initialization:
a1 ( j) = p j b j (o1 ) 1 j N
2. Recursion:
N
X
at ( j) = at 1 (i)ai j b j (ot ); 1 j N, 1 < t T
i=1
3. Termination:
N
X
P(O|l ) = aT (i)
i=1
have zero probability.
Figure 9.9 The forward algorithm. We’ve used the notation forward[s,t] to represent
at (s). Now that we have seen the structure of an HMM, we turn to algorithms for
computing things withThree problems
them. An for HMMs
influential tutorial by Rabiner (1989), based on
tutorials by Jack Ferguson in the 1960s, introduced the idea that hidden Markov
Decoding: The Viterbi Algorithm
models should be characterized by three fundamental problems:
e thatViterbi Path
we represent the most probable path The three factors
by taking that areover
the maximum multiplied
all in Eq. 9.19 for extending th
Probability
e previous state sequences max v1.(2)=.32
Liketo other dynamic
compute programming
the Viterbi algo-at time
vprobability
2(2)= max(.32*.12,
.02*.10)t =are
.038
q1 ,...,qt 1
Viterbi fillsq2each cell recursively. GivenH that we had
P(H|H) already
* P(1|H) computed
H the H
previous Viterbi path probability
H
|v the from the pre
P(C
Ht) 1 (i) .6 * .2
ility of being in every state at time t 1, we.4compute *P
(1|C the Viterbi probability
ng the most probable of the extensions of the
H) ai j paths that
* .5 )
transition
thelead probability from previous state qi to
to the current
or a given state q j at time t, the value v ( j) is computed
b) *jP(ot ) |H )
as
the state observation likelihood of the observation
v2(1) = max(.32*.20, .02*.25) = .064
* . (3|
v1(1) =t.02 ( 1
.8 *P
|C .2
the current state j
rt)
4
P ( H *
.5 P(C|C) * P(1|C)
sta
q1 C N C C C
H|
3 |
*
i=1
P (
r t) .1
s ta *
| .2
ee factors that are multiplied
P in Eq. A.14 for extending the previous paths to
( C
Viterbi
Finally, we can give a formal recursion
definition of the Viterbi recursion as follows:
1. Initialization:
v1 ( j) = p j b j (o1 ) 1 jN
bt1 ( j) = 0 1 jN
2. Recursion
N
vt ( j) = max vt 1 (i) ai j b j (ot ); 1 j N, 1 < t T
i=1
N
btt ( j) = argmax vt 1 (i) ai j b j (ot ); 1 j N, 1 < t T
i=1
3. Termination:
N
The best score: P⇤ = max vT (i)
i=1
N
The start of backtrace: qT ⇤ = argmax vT (i)
i=1
also the most likely state sequence. We compute this best state sequence by keeping
Viterbi
Viterbi backtrace
track of the path of hidden states that led to each state, as suggested in Fig. A.10, and
backtrace then at the end backtracing the best path to the beginning (the Viterbi backtrace).
P(H|H) * P(1|H)
q2 H H P(C .6 * .2 H H
|H)
*P
.4 * (1|C
.5 )
* .4 (3|H
) |H )
(1P v2(1) = max(.32*.20, .02*.25) = .064
) *
v1(1) = .02 |C .2
P
(H *
rt)*
P .5
sta
q1 P(C|C) * P(1|C)
C C C
.8
C
H|
) .5 * .5
P(
C
( 3|
* P
r )
t .1
s ta *
C | .2
P(
3 1 3
π
o1 o2 o3
t
Figure A.10 The Viterbi backtrace. As we extend each path to a new state account for the next observation,
Gaussian Observation Model
• So far, we considered HMMs with discrete outputs
• In acoustic models, HMMs output real valued vectors
• Hence, observation probabilities are defined using probability density functions
• A widely used model: Gaussian distribution
2 1 1
(x µ) 2
N (x|µ, )= p e 2 2
2⇡ 2
2
• HMM emission/observation probabilities bj(x) = 𝒩(x | μj, σj ) where μj is
2
the mean associated with state j and σj is its variance
1.0
μ = 0, σ 2 = 0.2,
μ = 0, σ 2 = 1.0,
0.8
μ = 0, σ 2 = 5.0,
μ = −2, σ 2 = 0.5,
0.6
φμ,σ (x)
2
0.4
0.2
1
0.0
0.8
−5 −4 −3 −2 −1 0 1 2 3 4 5 0.6
x 0.4
0.2
0 3
2
1 3
0 2
1
-1 0
-2 -1
-2
-3 -3
Gaussian Mixture Model
• A single Gaussian observation model assumes that
the observed acoustic feature vectors are unimodal
• More generally, we use a “mixture of Gaussians” to
model multiple modes in the data
Mixture Models
Gaussian Mixture Model
• A single Gaussian observation model assumes that
the observed acoustic feature vectors are unimodal