Automatic Speech Recognition 2
Automatic Speech Recognition 2
net/publication/228687340
CITATIONS READS
2 1,144
1 author:
Jia Pei
Longer Vision Tech
4 PUBLICATIONS 2 CITATIONS
SEE PROFILE
All content following this page was uploaded by Jia Pei on 30 May 2014.
JIA Pei
Email: [email protected]
First Edition
1
Abstract
ASR has already been researched and applied as a HCI for over thirty
years [6]. Normally, in order to apply speech recognition, an entire system
needs to be developed from scratch. Fortunately, several famous open source
speech recognition systems are available on the Internet, such as HTK devel-
oped by Cambridge University [38, 39] (note: HTK has license restriction),
Julius maintained by Kyoto University [16], ISIP speech recognition envi-
ronment by Mississippi State University [7, 23] and CMU Sphinx [18, 30].
Modern general-purpose speech recognition systems, including all the above
four open source ASR systems, are generally based on Mel Frequency Cep-
stral Coefficients (MFCCs) for audio signal presentation and Hidden Morkov
Models (HMMs) to model the speech stochastic process.
HMM was first described in a series of statistical papers by Baum [3]
and some other authors in the 1960’s. Speach recognition reseach based
on HMM started in the mid-1970s. A representative ASR at that time is
Dragon [1, 2]. After that, variants of HMM were explored, such as discrete
HMMs [19], semicontinuous HMMs [13–15] and continuous HMMs [24]. In
1989, Rabiner reviewed HMM theories and summarized some problems in
speach recognition that could be modeled by HMM [25]. All the above HMM
related researches are summarized in the online tutorial “Ten years of HMMs”
[5].
In fact, from the viewpoint of pattern recognition and machine learning,
HMM, as well as another famous control model - Kalman filter could be
viewed as examples of Dynamic Bayesian Networks (DBNs) [10]. Murphy
reviewed this point in his PhD thesis [21] and stated HMMs has limited
“expressive power” but DBNs generalize HMMs by allowing the state space
to be represented in factored form, rather than a single discreted random
variable. Therefore, DBNs, also known as directed graphic models [4] should
be a more generalized tool for speech recognition. In fact, early in 1995,
Zweig has already put DBNs into speech recognition in his PhD thesis [40].
The entire architecture of CMU Sphinx4 speech recognition system is
cited in figure 1.
In the following section, how to generally construct an entire ASR system
is elaborated.
Figure 1: Sphinx4 ASR Framework [30]
• The frequency range of the sound that human being can hear is be-
tween 20 to 20,000Hz. According to Nyquist theory, in order to be
able to analyze the sound (1D signal) which can be heard, a sample
rate of at least 2*20,000 is necessary. The most common sample rate
for speech/music is Sr = 44, 100Hz, which absolutely satisfies Nyquist
theory.
1
in audio files. In fact, in the above speech recognition OSSs, Sr is often
selected as 8,000Hz, 16,000Hz and 32,000Hz.
• From [37], “the voiced speech of a typical adult male will have a fun-
damental frequency of from 85 to 155 Hz, and that of a typical adult
female from 165 to 255 Hz”. Therefore, although we can hear a sound
below 85 Hz, we are never able to speak out such a sound. In order to
be able to analyze the harmonic component with the lowest frequency
of the voice that can be spoken out by human being, the window size
should be at least of length Sr /85.
• The window size should be narrow enough that the speech articulators
do not significantly change in that window.
• In order to ease later Fourier transform, it’s better to ensure the window
size be 2N , N = 1, 2, · · ·
• Suitable window size to discretize the speech signal: 128, 256, 512,
1,024 and 2,048.
After this speech discretization, it’s easy for us to understand why the ad-
jacent frames (windows) are partially overlapped. A Hamming window oper-
ating on the speech signal attenuates the signal at both window edges, which
causes the signal at the edge is not well analyzed in this frame. Therefore,
we may recenter the sample window in the next frame partially overlapping
with the current sample window, in order to analyze the entire speech signal
without loss.
0.1.2 Frontend
Several canonical feature extraction methods have been proposed for speech
recognition as the frontend, including MFCC, LPC, RASTA-PLP [11, 12]
2
and TECC [8], etc. Variants based on the above audio processing methods
flooded by removing preemphasis, adding liftering technology, weighting filter
banks rather than the triangular way, etc. Here, we only explain MFCC in
detail.
Preprocessing
We first denote the raw wave data as sn , n = 0, 1, · · · . The audio signal
preprocessing is summarized in the following several steps:
MFCC
After preprocessing, MFCCs are calculated in the following substeps.
3
Figure 2: Hamming Window Function [32]
where
0 if fk < pu−1 or fk > pu+1
fk −pu−1
bu (fk ) = pu −pu−1
if pu−1 ≤ fk < pu u = 1, 2, · · · , M
pu+1 −fk
if pu ≤ fk ≤ pu+1
pu+1 −pu
(7)
If the concerned frequency is limited between 0 and 8,000, and 19 fil-
ters are finally selected to compose the filter banks, namely, p0 = 0,
4
(a) Original Signal
5
pM +1 = 8, 000 and M = 19, the frequency edges of triangular filter
banks are easily calculated as the following vector p=(0.0, 94.0, 200.6,
321.6, 458.7, 614.3, 790.8, 991.0, 1218.1, 1475.6, 1767.8, 2099.2, 2475.1,
2901.4, 3385.0, 3933.6, 4555.8, 5261.5, 6062.0, 6970.0, 8000.0), as shown
in figure 5.
It’s reported in [20] that “for speech/music this classification problem,
the results are (statistically) significantly better if Mel-based cepstral
features rather than linear-based cepstral features are used.” Here in
our application, we just take this conclusion as true.
3. Take a log function on the above obtained M mel bins.
Bu′ = ln(Bu ) u = 1, 2 · · · , M (8)
4. Take the Discrete Cosine Transform (DCT) [17] of the list of mel bins,
as if it were a signal.
M
X
′ π(2m − 1)u
C(u) = Bu cos u = 1, 2 · · · , M (9)
u=1
2M
The MFCCs are just the amplitudes of the above obtained spectrum
C(u), which will be used as the extracted features for both training
and recognition.
6
Figure 4: Mel Hz Plot with Even Distributed Mel Slots
7
(a) Absolute Values of DFT
(c) Mel
8
(e) DCT
Figure 6: MFCC
2. Frequency to mel scale mapping is not 1 to 1. Generally, MFCC is an
energy compaction method, which tries to pack energies from lots of
frequency bins to a small number of mel bins. During the process of
voice reconstruction, a pseudo inverse is calculated so the energy from
a small number of mel bins can be roughly redistributed to a great
number of frequency bins.
Finally, the generated real and imaginary parts from Wk are amplitude-
modulated (AM) by the sequence kSk k.
Voice Reconstruction
Figure 8 just shows the entire process of recoverying a single audio frame
from DCT parameters only.
As we can see, although (c) in figure 8 is quite similar to (b) in figure 6,
there are big differences between (d) in figure 8 and (d) in figure 3. That
means, for one audio frame, the signal can’t be well reconstructed. However,
9
Figure 7: Used Filter Bank Weights
comparing (g) or (h) in figure 8 and (a) in figure 3, we may conclude that
from a long-term aspect of view, the signal can be reconstructed reasonably
well, without considering the amplitude infidelity.
Two methods are finally used to join the recovered audio frames into a
whole signal. The first one (refer to (g) in figure 8) ignores the overlapped
part of every audio frames, the second one (refer to (h) in figure 8) averages
the overlapped part of every two neighbour frames. Experiments show the
first method gives better results.
10
(a) Inverse DCT to Obtain Mel Log
11
(e) Inverse Hamming to Obtain Preemphasis Data
12
0.1.5 Language Model
Language models contain a list of words and their probability of occurrence in
a given sequence. Unlike acoustic models which provide phoneme-level speech
structure, language models provide word-level language structure. Language
models typically fall into two categories: graph-driven models [28,29] and N-
Gram models [27,31]. Graph-driven models could be looked on as 1D Markov
model, which means the word probability is only based on the previous word;
while N-Gram models could be looked on as (n − 1)D Markov model, which
means the word probability is estimated by the previous n − 1 words.
0.1.6 Decoder
To recognize all words of a sentence in context is obviously a much more
complicated task than to just recognize isolated words. If we simply consider
to recognize the isolated words, then, decoder could be looked on as a search
engine, which essentially look for the most likely single word in the dictionary
that has the most similar pronunciation as what the user has pronounced.
Therefore, speech recongnition finally becomes a graph search problem to
find the most likely phoneme sequence. Generally speaking, there are many
categories of search algorithms, such as (refer to [36]): brute-force search or
exhaustive search, heuristic search including A∗ search, breath-first search,
depth-first search, etc. Since our acoustic model is based on HMM, Viterbi
search algorithm specific for HMM will be adopted in our experiments.
• WSJ – Wall Street Journal [9], which refers to a large speech data set
(or “corpus”) that was suitable for training the acoustic model. WSJ is
read by many adult male and female speakers with American English.
The dataset is available online in LDC (Linguistic Data Consortium);
13
summation of Gaussians. A single Gaussian is defined by a mean and
a variance, or, in the case of a multidimensional Gaussian, by a mean
vector and a covariance matrix, or, under some simplifying assump-
tions, a variance vector. Here, a mixture of 8 Gaussian distributions is
used in this acoustic model;
• 16k – the training speech files are sampled at the rate of 16kHz;
• 40mel – 40 Mel scale filters are used to suit for 16k different audio
1+f /700
frequencies. To convert f hertz into m mel, m = 1127.01048loge ;
and the inverse, f = 700(em/1127.01048 − 1);
14
Bibliography
15
[10] Z. Ghahramani. Adaptive Processing of Sequences and Data Structures
. Lecture Notes in Artificial Intelligence., chapter Learning Dynamic
Bayesian Networks, pages 168 – 197. Springer-Verlag, Berlin, 1998.
[17] Syed Ali Khayam. The discrete cosine transform (dct): Theory and
application. Technical report, Department of Electrical & Computer
Engineering, Michigan State University, March 10th 2003.
16
[21] K. Murphy. Dynamic Bayesian Networks: Representation, Inference
and Learning. PhD thesis, UC Berkeley, Computer Science Division,
July 2002.
[23] Joe Picone and the Staff at ISIP. Fundamentals of Speech Recognition:
A Tutorial Based on a Public Domain C++ Toolkit. Institute for Signal
and Information Processing (ISIP), Mississippi State University, August
15 2002.
[28] J.-P. Vert and M. Kanehisa. Graph-driven features extraction from mi-
croarray data using diffusion kernels and kernel cca, December 9-14 2002.
17
[31] Wikipedia. N-gram — wikipedia, the free encyclopedia, 2007. [Online;
accessed 28-December-2007].
[38] S. Young. The htk hidden markov model toolkit: Design and philosophy.
Technical Report Technical Report TR.153, Department of Engineering,
Cambridge University, 1994.
18