Tutorial On Speech Recognition: Alex Acero Microsoft Research
Tutorial On Speech Recognition: Alex Acero Microsoft Research
Alex Acero
Microsoft Research
Outline
ASR Overview:
Signal Processing.
Hidden Markov Models Fundamentals.
HMMs in Speech Recognition.
HMM Training
Language Models.
Signal
Processing
Pattern
Matching
Template
Dictionary
One
Signal Processing
Signal Acquisition
Speech is captured by a microphone.
The analog signal is converted to a digital
signal by sampling it 16000 times a second.
Each value is quantized to 16 bits, a number
between -32768 and 32767.
A/D
,14,1617,-3456,...
Spectral Analysis
Take a window of 20-30 ms (320 to 480
samples).
FFT (Fast Fourier Transform) used to
compute energy in several frequency bands,
typically 20 to 40.
Cepstrum computed, 12 or so coefficients.
Spectral Analysis
A frame, a snapshot of the inputs
spectrum, is represented by a vector of 12
cepstrum coefficients.
To describe the spectral change with time,
one frame is computed every 10 ms or so,
i.e. 100 frames per second.
A word that lasts 1 second is described by
1200 numbers.
HMM Fundamentals
Number of states
Transition probabilities aij
Output probabilities bi(x):
a13
a45
a34
a23
a12
a55
a44
a33
a22
a11
a24
K 1
bi ( x) = cik N ( x, ik , ik )
k =0
p 1
( xm m ) 2
1
N ( x, , ) =
exp{
}
p/2
2
2 m
(2 )
m=0
a35
States
Probability of a path
5
4
2
1
Input Frames
1
13
a13
a45
a34
a23
a12
a24
a55
a44
a33
a22
a11
a35
18
Probability of a model
Is the probability of the best path.
Problem: We need to evaluate all possible
paths, and there grow exponential with
number of states and input frames.
Solution: The Viterbi algorithm has
complexity that grows linearly with number
of states and input frames.
Large vocabulary
Problem:
DH(SIL,IH)
IH(DH,S)
S(IH,SIL)
Continuous speech
SIL DH IH
SIL
IH
Z SIL
SIL
EH
SIL
Baum-Welch Algorithm
t (i ) = P( x1 , x2 L xt , st = i | )
N
t +1 ( j ) = t (i )aij b j ( xt +1 )
i =1
P ( X | ) = T (i )
i =1
Exact solution.
s1
s2
a1j
a2j
sN
sj
aNj
t
t+1
HMM Training
HMM Training
j =
4
xij
i =0
j =
( xij j )
i =0
HMM Training
Rule of thumb: Each state has to appear in
the training database at least 10 times.
Speaker-Independent systems have lots of
data (models well trained), but contain high
variability.
Speaker-Dependent systems do not have as
much data (models not so well trained) but
they are more consistent.
Bigram
P (W1 ) P (W2 | W1 ) P (W3 | W2W1 ) L P (WN | WN 1WN 2 )
Trigram
Trigrams
P(THIS IS A TEST) =
P(THIS) P(IS|THIS) P(A|THIS IS) P(TEST|IS A)
N-Gram Performance
Error
Rate
Multiplier
x5.0
x3.0
x1.5
x1.0
Trigrams
Bigrams
Unigrams
Uniform
Perplexity
Perplexity measures the average branching
of a text when presented to a language
model. PP = 2 LP = P (W1 , W2 , LWn ) 1/ n
An empirical observation E k P where
E is the error rate and P the perplexity.
A language model with low perplexity is
good for ASR.
Tradeoff between perplexity and coverage.
Advantages:
Disadvantages:
Unconstrained Task
Noisy Environment
x5.0
Large Vocabulary
Speaker Variance
x1.0
More
Constraint
Less
Constraint
Telephone Digits
Requirements:
Tools
Low Level (Offer more control, but
requires more work): HTK (Entropic).
High Level (Offer less control but easier to
use):
MS SAPI
http://www.research.microsoft.com/research/srg/
Interface NLP/ASR
Summary
Current ASR has many limitations
Nonetheless, it is possible to build useful
systems with ASR.
Application Developer needs to carefully
design the dialogue.
Hard to integrate NLP and Speech.
Research is needed here.