Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
Isolated-Word Speech Recognition Using Hidden Markov Models: H Akon Sandsmark December 18, 2010
Håkon Sandsmark
1 Introduction
Speech recognition is a challenging problem on which much work has been done the last
decades. Some of the most successful results have been obtained by using hidden Markov
models as explained by Rabiner in 1989 [1].
A well working generic speech recognizer would enable more efficient communication
for everybody, but especially for children, analphabets and people with disabilities. A
speech recognizer could also be a subsystem in a speech-to-speech translator.
The speech recognition system implemented during this project trains one hidden
Markov model for each word that it should be able to recognize. The models are trained
with labeled training data, and the classification is performed by passing the features to
each model and then selecting the best match.
apple
banana
kiwi
speech
feature extraction lime classification
orange
peach
pineapple
Figure 1: Flow chart of the system. The features extracted from the speech signal are
passed to each word model and the best match is selected.
1
2 Background theory
2.1 Hidden Markov models
Basic knowledge of hidden Markov models is assumed, but the two most important
algorithms used in this project will be described.
The observable output from a hidden state is assumed to be generated by a mul-
tivariate Gaussian distribution, so there is one mean vector and covariance matrix for
each state. We will also assume that the state transition probabilities are independent
of time, such that the hidden Markov chain is homogenous.
We will now define the notation for describing a hidden Markov model as used in
this project. There is a total number of N states. An element ass0 in the transition
probability matrix A denotes the transition probability from state s to state s0 , and
the probability for the chain to start in state s is πs . The mean vector and covariance
matrix for the multivariate Gaussian distribution modeling the observable output from
state s are µs and Σs , respectively. For an observation o, bs (o) denotes the probability
density of the multivariate Gaussian distribution of state s at the values of o. We will
sometimes denote the collection of parameters describing the hidden Markov model as
λ = {A, π, µ, Σ}.
The recursive structure is revealed as we reduced the problem from needing f (o1 , . . . , oT , sT ; λ)
for all sT to needing f (o1 , . . . , oT −1 , sT −1 ; λ) for all sT −1 . Let us introduce the forward
variable to ease the notation.
2
α1 (s) ≡ f (o1 , S1 = s; λ) (6)
= bs (o1 )πs (7)
αt (s) ≡ f (o1 , . . . , ot , St = s; λ) (8)
X
= bs (ot ) as0 s αt−1 (s0 ) (9)
s0
Implemented naı̈vely top-down (backwards in time) this would not bring us any luck
because of the exponentially recursive structure. The naı̈ve algorithm is however easily
convertible to an efficient variant using dynamic programming where we calculate the
forward variables bottom-up (forwards in time). We simply calculate αt (s) for all states
s, first for t = 1 and then all the way up to T . This way all the forward variables from
the previous time step are readily available when needed.
(t)
The E-step thus consists of calculating these expectations for a fixed λ. Let Vs
(t)
denote the event of transition from state s at time step t, and Vs,s0 the event of transition
from s to s0 at t. Then we calculate these expectations by using indicator functions and
linearity of expectation.
3
πs = E{1[Vs(1) ]} = P (Vs(1) ) (15)
P (t) P (t)
E{ t 1[Vs,s0 ]} t P (Vs,s0 )
ass0 = P (t)
= P (t)
(16)
E{ t 1[Vs ]} t P (Vs )
P (t) P (t)
E{ t 1[Vs ]ot } t P (Vs )ot
µs = P (t)
= P (t)
(17)
E{ t 1[Vs ]} t P (Vs )
(t) (t)
E{ t 1[Vs ](ot oT T T
P P
t − µs µs )} t P (Vs )ot ot
Σs = P (t)
= P (t)
− µ s µs T (18)
E{ t 1[Vs ]} t P (Vs )
Note that the non-italic T denotes transpose and has nothing to do with time.
To be able to calculate these probabilities we first introduce the backward variable
which is very similar to the forward variable previously defined.
βT (s) ≡ 1 (19)
βt (s) ≡ f (ot+1 , . . . , oT |St = s; λ) (20)
X
= ass0 bs0 (ot+1 )βt+1 (s0 ) (21)
s0
The backward variable has its name because it is first calculated for the last time step
and then backwards in time when implemented with dynamic programming (essentially
the reverse procedure of the one described in detail for the forward variable).
Then we rename the probabilities to the same symbols as used by Rabiner and express
them by forward and backward variables:
4
πs = γ1 (s) (28)
0
P
t ξt (s, s )
ass0 = P (29)
γt (s)
P t
γt (s)ot
µs = Pt (30)
γt (s)
P t T
t γt (s)ot ot
Σs = P − µ s µs T (31)
t γ t (s)
To summarize the E-step boils down to computing γt (s) and ξt (s, s0 ) for all s, s0
and t while the parameters λ are fixed, and then the M-step will update λ by using the
calculations done in the E-step. This is iterated until satisfaction.
3 System design
3.1 Feature extraction
The source speech is sampled at 8000 Hz and quantized with 16 bits. The signal is split
up in short frames of 80 samples corresponding to 10 ms of speech. The frames overlap
with 20 samples on each side. The idea is that the speech is close to stationary during
this short period of time because of the relatively limited flexibility of the throat. We
will pick out our features from the frequency domain, but before we get there by taking
the fast Fourier transform, we multiply by a Hamming window to reduce spectral leakage
caused by the framing of the signal.
ï3
x 10
1 3.5
Speech signal
0.8 Hamming window
3
0.6
0.4 2.5
0.2
2
|X(F)|
0
1.5
ï0.2
ï0.4 1
ï0.6
0.5
ï0.8
ï1 0
0 10 20 30 40 50 60 70 80 0 500 1000 1500 2000 2500 3000 3500 4000
time F [Hz]
(a) Speech signal and Hamming window in time (b) Single-sided magnitude spectrum of the
domain. same speech signal multiplied by the Hamming
window.
The D largest local maxima from the single-sided magnitude spectrum are are picked
as features for each frame, and D is indeed an important parameter of the system that
will be discussed later.
5
1 0.07
Speech signal
0.8 Hamming window
0.06
0.6
0.4 0.05
0.2
0.04
|X(F)|
0
0.03
ï0.2
ï0.4 0.02
ï0.6
0.01
ï0.8
ï1 0
0 10 20 30 40 50 60 70 80 0 500 1000 1500 2000 2500 3000 3500 4000
time F [Hz]
(a) Speech signal and Hamming window in time (b) Single-sided magnitude spectrum of the
domain. same speech signal multiplied by the Hamming
window.
3.2 Training
The training is a combination of both supervised and unsupervised techniques. We
train one hidden Markov model per word with already classified speech signals. One
important choice is the number of different states in each model. The goal is that each
state should represent a phoneme in the word. The clustering of the Gaussians is however
unsupervised and will depend on the initial values used for the Baum-Welch algorithm.
For this project, totally random guesses (that obey the statistical properties) for A
and π were used as initial values. For Σs , the diagonal covariance matrix for the training
data was used for all states. For each state a random training data point was chosen as
µs . The training examples for each word are concatenated together, and Baum-Welch
is run for 15 iterations.
3.3 Classification
Let λi denote the parameter set for word i. When presented with an observation
o1 , . . . , oT , the selection is done as follows.
And we recognize that f (o1 , . . . , oT ; λi ) is exactly what the forward algorithm com-
putes.
6
Training apple Training lime
4000 4000
3500 3500
3000 3000
2500 2500
F2 [Hz]
F2 [Hz]
2000 2000
1500 1500
1000 1000
500 500
0 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
F1 [Hz] F1 [Hz]
3500 3500
3000 3000
2500 2500
F2 [Hz]
F2 [Hz]
2000 2000
1500 1500
1000 1000
500 500
0 0
0 500 1000 1500 2000 2500 3000 3500 4000 0 500 1000 1500 2000 2500 3000 3500 4000
F1 [Hz] F1 [Hz]
Figure 4: Fitted Gaussians after ten iterations of the Baum-Welch algorithm. We have
six states with one Gaussian each. The two most dominant frequencies (features) are
shown. Each green plus is represents a frame from a training speech signal. The stars
are the means of each Gaussian, and the ellipses indicate their 75% confidence interval.
Notice the higher frequencies present in the words containing unvoiced phonemes (‘peach’
and ‘orange’) compared to the words that do not (‘apple’ and ‘lime’).
were the number of hidden states, N , and the number of frequencies extracted from
each frame, D. The cross-validation was therefore run with different values for these
parameters, and the results are shown in table 1.
5 Discussion
The results are quite good compared to the simple approach taken, especially in the
feature extraction phase. More advanced features like Mel-frequency cepstral coefficients
were considered, but we decided on simple frequencies due to the low misclassification
7
N \D 2 3 4 5 6 7 8
2 21.9% 8.6%
3 21.0% 15.2% 9.5% 12.4% 1.9% 14.3% 5.7%
4 16.2% 11.4% 8.6% 5.7% 3.8% 6.7% 4.8%
5 13.3% 8.6% 9.5% 4.8% 2.9% 5.7% 4.8%
6 12.4% 10.5% 3.8% 5.7% 7.6% 6.7% 10.5%
7 15.2% 12.4% 6.7% 10.5% 7.6% 2.9% 8.6%
8 12.4% 5.7%
Table 1: Misclassification rates for five-fold cross-validation with different values for
the number of hidden states, N , and the number of frequencies extracted from each
frame, D. Each five-fold cross-validation procedure takes about 7 minutes with the 105
utterances on a 2 GHz Intel Core 2 Duo (serial execution).
rates achieved. It should be noted that this system would not perform well if trained and
tested with different speakers. This is because of the different frequency characteristics
of different voices, especially for speakers of different gender.
We also experimented with increasing the number of training iterations for the Baum-
Welch algorithm, including setting a threshold on the likelihood difference between steps.
That, however, proved to have little benefit in practice; neither the execution time nor the
misclassification rate showed any mentionable improvements over just fixing the number
of iterations to 15. The reason why the execution time did not show any significant
improvements is because most of the execution time is spent during feature extraction,
and not in training.
It is also interesting to note that when N is too small, there are many ‘apple’s
misclassified as ‘pineapple’s, and vice versa, due to the loss of temporal information.
Another important parameter is the number of samples in each frame. If the frame
is too small, it becomes hard to pick out meaningful features, and if it is too large,
temporal information is lost. However, due to time constraints, we did not test anything
else than 80 samples for this project.
The concatenation of the training examples trains a probability of transitioning from
the ‘last state’ to the ‘initial state’ that is not needed for classification. Rabiner gives a
modified Baum-Welch algorithm for multiple training examples such that concatenation
is not necessary, but that was not implemented during this project as the concatenation
seemed to work well.
8
recognizer.
The Matlab implementation along with the data set is published as open source and
can be found at http://code.google.com/p/hmm-speech-recognition/.
7 References
[1] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proceedings of the IEEE, vol. 77, pp. 257–286, Feb 1989.
[2] C. M. Bishop, Pattern Recognition and Machine Learning (Information Science and
Statistics). Springer, 1st ed. 2006. corr. 2nd printing ed., October 2007.
[3] X. Huang, A. Acero, and H.-W. Hon, Spoken Language Processing: A Guide to
Theory, Algorithm and System Development. Prentice Hall PTR, May 2001.