Fundamentals of Speech Recognition Suggested Project The Hidden Markov Model 1. Project Introduction
Fundamentals of Speech Recognition Suggested Project The Hidden Markov Model 1. Project Introduction
Suggested Project
The Hidden Markov Model
1. Project Introduction
For this project, it is proposed that you design and implement a hidden Markov model
(HMM) that optimally matches the behavior of a set of training sequences that will be
provided to you as part of this project. The goal will be to use the standard set of
forward-backward estimation algorithms to optimally determine the best (maximum
likelihood) HMM that matches a given set of training data. You are also asked to use the
Viterbi algorithm for estimating the model parameters and comparing and contrasting the
results for the two methods. You may also want to investigate the effects of using
subsets of the training data on the estimated models. Finally, if time permits, you are
asked to design your own sequence generator and determine the effects of changing the
training sequence characteristics on the estimated models.
Model 1:
number of states, Q=5
type of density of observations: discrete
number of observation possibilities, K=5
type of model: ergodic
state transition matrix density: random
state observation matrix density: random
state prior density: random
Model 2:
number of states, Q=5
type of density of observations: discrete
number of observation possibilities, K=5
type of model: ergodic
state transition matrix density: skewed
state observation matrix density: skewed
state prior density: skewed
Model 3:
number of states, Q=5
type of density of observations: discrete
number of observation possibilities, K=5
type of model: left-right
state transition matrix density: constrained
Model 4:
number of states, Q=5
type of density of observations: discrete
number of observation possibilities, K=5
type of model: left-right
state transition matrix density: constrained
state observation matrix density: skewed
state prior density: constrained
A plot showing the generic structure of both the ergodic model (Figure 1) and the left-
right model (Figure 2) is given below. (Note that only a subset of the state transitions are
shown in the figure since it got very messy showing all possible state-state transition
paths.)
Figure 1—5 state ergodic model (only some of the actual state transitions shown in
figure)
For this project there are four training sets (available from the course website), labeled:
1. hmm_observations_ergodic_random.mat,
2. hmm_observations_ergodic_skewed.mat,
2. the first two mat files contain training sequences for the ergodic model with either
random entries for the state prior density, the state transition matrix, and the state
observation matrix (the first mat file), or skewed entries for the state prior density,
the state transition matrix, and the state observation matrix (the second mat file).
The second two mat files contain training sequences for the left-right model with
either random entries for the state observation matrix (the third mat file), or
skewed entries for the state observation matrix (the fourth mat file). The model
state prior density and state transition matrix are both highly skewed for both the
third and fourth mat files. Again it is noted that for left-right models, the duration
of the training sequences are variable and are specified in the duration array. For
the ergodic models, the duration of each training sequence is fixed at T=100
observations.
The goal of the forward-backward algorithm is to determine the values of the state prior
density, the state transition matrix, and the state observation density that maximizes the
1. Obtain initial estimates of the state prior density, the state transition matrix, and the
state observation matrix from either random guesses, or from skewed initial conditions.
b=
bQ (1) bQ (2) bQ ( K )
bi (k ) = probability of observing the k th symbol in state i
2. Use as the training set one of the set of observations read in earlier. We denote the
complete training set of nex sequences of observations, with Tn being the duration of the
n-th observation sequence, as:
Initialization Step:
βT (i, n) = 1; i = 1, 2,..., Q (unscaled β )
n
Iteration Step:
for t = Tn − 1, Tn − 2,...,1
Q
βt (i, n) = ∑ aij βˆt +1 ( j , n) b j (Otn+1 ), i = 1, 2,..., Q
j =1
5. Re-estimate the state prior density, the state transition matrix and the state observation
matrix, using the relations:
∑ γ (i, n)
1
πi = , i = 1, 2,..., Q
n =1
nex
2. re-estimation of state transition matrix:
nex Tn −1
∑ ∑ ξ (i, j, n) t
aij = n =1 t =1
nex Tn −1
, i = 1, 2,..., Q, j = 1, 2,..., Q
∑ ∑ γ (i, n)
n =1 t =1
t
∑ ∑ γ ( j , n)
n =1 t =1
t
Otn = k
b j (k ) = nex Tn
, j = 1, 2,..., Q, k = 1, 2,..., K
∑∑ γ ( j, n)
n =1 t =1
t
nex Tn
Ltraining = −∑∑ log(c(t , n))
n =1 t =1
An interesting feature that can be used when the actual generator sequence is known is to
do a single forward-backward pass using the actual generator state prior density, the
actual state transition matrix, and the actual state observation matrix, thereby giving the
training sequence likelihood for the actual generator. Interestingly, the best estimated
model can have a training sequence likelihood that is actually higher than that of the
generator sequence—especially when the actual sequences are generated from randomly
generated state priors, state transition matrices, and state observation matrices. However,
most of the time, the training sequence likelihoods are worse than that of the generator
sequence—at least until the re-estimation procedure has reached a stable optimum and
not gotten stuck at a local minimum.
Initialization (t = 1):
δ1 (i, n) = πi + bi (O1n ), i = 1, 2,..., Q
ψ 1 (i, n) = 0; i = 1, 2,..., Q
Recursion:
for t = 2,3,..., Tn
δt ( j, n) = max δt −1 (i, n) + aij + b j (Otn ), j = 1, 2,..., Q
1≤i ≤ Q
Termination:
P * (n) = max δTn (i, n)
1≤i ≤Q
Once the Viterbi alignment of the training sequences with the current model has been
obtained, we have essentially uncovered the hidden parts of the model, so we have a
unique assignment of each observation to a distinct model state. Hence the re-estimation
of the state prior density, the state transition matrix, and the state observation matrix is
considerably simpler than for the forward-backward method, and can be stated simply as
follows:
∑ c[q 1
*
( n) = i ]
πi = n =1
, i = 1, 2,..., Q
nex
where c[q1* (n) = i ] = 1 when q1* (n) = i and it is zero otherwise
∑ ∑ c[q t
*
(n) = i, qt +1* (n) = j ]
aij = n =1 t =1
nex Tn −1
, i = 1, 2,..., Q, j = 1, 2,..., Q
∑ ∑ c[q
n =1 t =1
t
*
( n) = i ]
where c[qt * (n) = i, qt +1* (n) = j ] = 1 when qt * (n) = i and qt +1* (n) = j
and it is zero otherwise
∑∑ c[q t
*
(n) = j , Otn = k ]
b j (k ) = n =1 t =1
nex Tn −1
, j = 1, 2,..., Q, k = 1, 2,..., K
∑ ∑ c[q
n =1 t =1
t
*
( n) = j ]
The purpose of this project is to teach you how to realize both a forward-backward and a
Viterbi procedure for estimating parameters of Hidden Markov Models. To that end you
are asked to read in each of the four training sets, and to program HMM re-estimation
algorithms using both a forward-backward approach and a Viterbi approach, and to
compare and contrast the two procedures in terms of speed, accuracy, sensitivity to initial
estimates, computational issues (taking logs of zero-valued quantities), and any other
issue that arises during the course of this project.
Once you read in each of the training sets, familiarize yourself with the data sequence
which is of the form data(nex,T) where nex is 50 and T is 100. This data sequence is the
observed set of outputs of the model, but, of course, there are no observed states as the
states are hidden. Actually the array states(nex,T) has the generator state data but this
cannot be used in any manner in your simulations. You should also remind yourself that
the duration of each training sequence (n=1,2,…,nex) is specified in the array
duration(1:nex). (For the ergodic models that we will be using, the duration of each
sequence is T=100. For the left-right models the duration is variable and is specified in
duration(1:nex).)
Once the training data has been read in, the first step you should do is score the training
set on the generator model by running one full iteration of the forward-backward routine
(which you have to write), and by noting the sum of the negative logs of the scaling
sequences. This likelihood score represents a target value for your actual re-estimation
routine—a target that you will sometimes surpass slightly (why is this the case) and
mostly will miss hitting due to local optima or bad initial starts.
The next step is to create full forward-backward estimation and HMM model re-
estimation routines, as well as a Viterbi estimation and HMM model re-estimation
routine. Once these routines have been debugged, you can begin playing with the four
training sets and determining how well you can estimate the model parameters. It is an
interesting exercise to use the actual HMM densities (rather than random estimates) to
determine how the total log likelihood score varies as you re-estimate the densities
starting from the “ideal conditions”; however this exercise is just to show how good a
solution can be obtained if you can get past local minima of the re-estimation routines.
In summary, you are asked to build two sets of HMM re-estimation routines, one based
on the forward-backward algorithm, one based on the Viterbi state sequence, and to
determine how well your algorithms work on two types of models, namely an ergodic
model with 5 states and 5 possible observations in each state, and a left-right model with
5 states and 5 possible observations in each state. You should consider using several
random starts to determine the best solution for each model and each training set. You
are also given two versions of each model, one with randomly chosen values for the state
prior density, the state transition matrix and the state observation matrix, and one with
skewed values for the state prior density and the state transition matrix. Hence you
should experiment with each of these training sets to understand which training methods
work best, and why. Finally, you should investigate the effects of using highly
constrained models (such as the left-right model) on the state prior density estimation and
the state transition matrix estimation.
1. for cases when the converged likelihood is comparable to the generator likelihood,
how do the model densities match those of the generator?
2. for cases when the converged likelihood is comparable to the generator likelihood,
how do the model states match those of the generator?
3. for cases when the converged likelihood is smaller than the generator likelihood (i.e.,
convergence at a local minimum but not the global minimum), what is happening with
the states and how do the resulting models compare to the generator model?
4. how does the speed of convergence compare for the forward-backward re-estimation
and the Viterbi re-estimation methods.
5. how do the computational requirements for the forward-backward and Viterbi methods
compare?
6. how do the likelihood scores of the generator models compare when using forward-
backward scoring with those when using Viterbi scoring? What accounts for these
differences and how do they compare for random and skewed training sequences.
7. what do you think would be the effect of fewer training sequences (you can try this out
by using less than the nex=50 sequences supplied)? What do you think would be the
effect of more training sequences?
If you are successful in making the re-estimation methods work properly and efficiently,
you might want to build your own HMM sequence generator and experiment with
creating a range of model observations and see how well your re-estimation routines
work on this new data. It is especially interesting to investigate the effect of increased
numbers of observation sequences to determine how many are needed to make the
convergence more rapid and less sensitive to the initial estimates of model parameters.