0% found this document useful (0 votes)
51 views38 pages

Tutorial On Speech Recognition: Alex Acero Microsoft Research

HMMs are commonly used for speech recognition, modeling phonemes as states. Training involves segmenting a speech database with HMMs and updating model parameters. Language models help constrain recognition by providing word probabilities. While error rates are still high for open domains, constrained systems can achieve low error rates through limited vocabularies, grammars, and contexts. Real applications require robustness to noise, non-vocabulary words, and barge-in capabilities. Tight integration of NLP and ASR remains a challenge.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views38 pages

Tutorial On Speech Recognition: Alex Acero Microsoft Research

HMMs are commonly used for speech recognition, modeling phonemes as states. Training involves segmenting a speech database with HMMs and updating model parameters. Language models help constrain recognition by providing word probabilities. While error rates are still high for open domains, constrained systems can achieve low error rates through limited vocabularies, grammars, and contexts. Real applications require robustness to noise, non-vocabulary words, and barge-in capabilities. Tight integration of NLP and ASR remains a challenge.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PS, PDF, TXT or read online on Scribd
You are on page 1/ 38

Tutorial on Speech Recognition

Alex Acero
Microsoft Research

ACL Workshops Tutorial


Interactive Spoken Dialog Systems:
Bringing Speech and NLP Together in Real Applications

Outline

ASR Overview:

Signal Processing.
Hidden Markov Models Fundamentals.
HMMs in Speech Recognition.
HMM Training
Language Models.

Tools and Commercial systems.


Summary.

Speech Recognition Basics

Signal
Processing

Pattern
Matching
Template
Dictionary

One

Signal Processing

Signal Acquisition
Speech is captured by a microphone.
The analog signal is converted to a digital
signal by sampling it 16000 times a second.
Each value is quantized to 16 bits, a number
between -32768 and 32767.

A/D

,14,1617,-3456,...

Spectral Analysis
Take a window of 20-30 ms (320 to 480
samples).
FFT (Fast Fourier Transform) used to
compute energy in several frequency bands,
typically 20 to 40.
Cepstrum computed, 12 or so coefficients.

Spectral Analysis
A frame, a snapshot of the inputs
spectrum, is represented by a vector of 12
cepstrum coefficients.
To describe the spectral change with time,
one frame is computed every 10 ms or so,
i.e. 100 frames per second.
A word that lasts 1 second is described by
1200 numbers.

HMM Fundamentals

Isolated digit recognition


10 templates: one template Mi per digit.
Compare input x with all templates.
Select the most similar template j:
j = min{ f (x, M i )}
where f is the comparison function.

Hidden Markov Models (HMM)


HMM defined by:

Number of states
Transition probabilities aij
Output probabilities bi(x):

a13

a45

a34

a23

a12

a55

a44

a33

a22

a11

a24

K 1

bi ( x) = cik N ( x, ik , ik )
k =0

p 1
( xm m ) 2
1
N ( x, , ) =
exp{
}
p/2
2
2 m
(2 )
m=0

a35

States

Probability of a path
5
4
2
1

Input Frames
1

13

a13

a45

a34

a23

a12

a24

a55

a44

a33

a22

a11

a35

18

P = b1 ( x1 ) a11b1 ( x2 )a12b2 ( x3 )a22b2 ( x4 ) a22b2 ( x5 ) a24b4 ( x6 )


a44b4 ( x7 ) a44b4 ( x8 )a44b4 ( x9 ) a44b4 ( x10 ) a44b4 ( x11 )a44b4 ( x12 )
a45b5 ( x13 )a55b5 ( x14 ) a55b5 ( x15 )a55b5 ( x16 )a55b5 ( x17 )a55b5 ( x18 )

Probability of a model
Is the probability of the best path.
Problem: We need to evaluate all possible
paths, and there grow exponential with
number of states and input frames.
Solution: The Viterbi algorithm has
complexity that grows linearly with number
of states and input frames.

Large vocabulary

Problem:

Huge number of models: 60,000 models


Models are hard to train
Cannot easily add new words.

Solution: Use phoneme models

Context Dependent Models


To improve accuracy, phone models are
context-dependent.
Example: THIS = DH IH S

DH(SIL,IH)

IH(DH,S)

S(IH,SIL)

503=125,000 possible models are clustered


to about 8,000 generalized triphones

Continuous speech

Example: THIS IS A TEST

SIL DH(SIL,IH) IH(DH,S) S(IH,SIL) SIL IH(SIL,Z)


Z(IH,SIL) SIL A(SIL,SIL) SIL T(SIL,EH) EH(T,S)
S(EH,T) T(S,SIL) SIL
SIL DH(SIL,IH) IH(DH,S) S(IH,IH) IH(S,Z) Z(IH,A)
A(Z,T) T(A,EH) EH(T,S) S(EH,T) T(S,SIL) SIL

SIL DH IH

SIL

IH

Z SIL

SIL

EH

SIL

Baum-Welch Algorithm
t (i ) = P( x1 , x2 L xt , st = i | )
N

t +1 ( j ) = t (i )aij b j ( xt +1 )
i =1

P ( X | ) = T (i )
i =1

Exact solution.

s1
s2

a1j
a2j

sN

sj

aNj
t

t+1

Speeding Search: Pruning


Viterbi algorithm: replace sum with max.
Speed-Accuracy Tradeoff: Computations
can be reduced by eliminating paths that are
not promising, at the expense of having a
chance of eliminating the best path.
Typically, computations can be reduced by
more than a factor of 10, without affecting
the error rate significantly.

HMM Training

HMM Training

A lot of speech is needed to train the


models. Done with an iterative algorithm:
1
2
3

Take initial model (i.e. uniform probabilities).


Segment Database (Run Viterbi algorithm).
N 1
N 1
Update models.

j =
4

xij
i =0

j =

( xij j )
i =0

If Converged stop, otherwise go to 2.

HMM Training
Rule of thumb: Each state has to appear in
the training database at least 10 times.
Speaker-Independent systems have lots of
data (models well trained), but contain high
variability.
Speaker-Dependent systems do not have as
much data (models not so well trained) but
they are more consistent.

Language Models in ASR

Bayes Rule in ASR


P( A | W ) P(W )
P(W | A) =
P( A)
W = max P(W | A) = max P( A | W ) P(W )

where A is the Acoustics and W the


sequence of words. P(W) is the language
model.

Statistical Language Model


P (W ) = P (W1W2W3 LWN ) = P(W1 ) P (W2 | W1 )
P (W3 | W2W1 ) L P (WN | WN 1 LW2W1 )

P(W1 ) P(W2 ) P(W3 ) L P(WN ) Unigram


P(W1 ) P(W2 | W1 ) P(W3 | W2 ) L P(WN | WN 1 )

Bigram
P (W1 ) P (W2 | W1 ) P (W3 | W2W1 ) L P (WN | WN 1WN 2 )
Trigram

Trigrams
P(THIS IS A TEST) =
P(THIS) P(IS|THIS) P(A|THIS IS) P(TEST|IS A)

These statistical language models predict


the probability of the current word given the
past history.
While simplistic, they contain a lot of
information

P(IS|THIS) >> P(IS|HAVE)

N-Gram Performance
Error
Rate
Multiplier

x5.0

x3.0

x1.5

x1.0
Trigrams

Bigrams

Unigrams

Uniform

Perplexity
Perplexity measures the average branching
of a text when presented to a language
model. PP = 2 LP = P (W1 , W2 , LWn ) 1/ n
An empirical observation E k P where
E is the error rate and P the perplexity.
A language model with low perplexity is
good for ASR.
Tradeoff between perplexity and coverage.

Context Free Grammar

Advantages:

Low error rate (because of low perplexity).


Compact.
Easy for application developers.

Disadvantages:

Poor coverage (out of language sentences).

ASR in Real Applications

ASR problem Space


x10.0
Error
Rate
Multiplier

Unconstrained Task
Noisy Environment

x5.0

Large Vocabulary
Speaker Variance

x1.0
More
Constraint

Less
Constraint

ASR Accuracy in the Lab

Command &Control (300 words + CFG)

Discrete Dictation (60K words)

3% word error rate (speaker-dependent)

Continuous Dictation (60K words)

1% word error rate (speaker-independent)

7% word error rate (speaker-dependent)

Telephone Digits

0.3% word error rate (speaker-independent)

Commercial ASR Systems

Requirements:

Windows 95 and Windows NT


Pentium Processor
16MB (command-and-control), 24MB (discrete
dictation), 32MB (continuous dictation)

For dialogue systems we are interested in


C&C (command & control).

Tools
Low Level (Offer more control, but
requires more work): HTK (Entropic).
High Level (Offer less control but easier to
use):

Microsoft Speech SDK


Novells SRAPI
Suns JSAPI

MS SAPI

http://www.research.microsoft.com/research/srg/

Supports ASR and TTS.


An application can use many engines.
ASR Engines supporting SAPI include:

AT&T, Dragon, IBM, Kurztweil, Lernout & Hauspie,


Microsoft, OMRON, Teles AG, VCS, VPC.

TTS Engines supporting SAPI include:


Acuvoice, AT&T, Centigram, DEC, Elan Informatique,
Eloquent, First Byte, Lernout & Hauspie, Microsoft, Telia
Promotor, Telefonica.

Command & Control (SAPI)


Application passes active vocabulary to
ASR engine.
Engine finds pronunciation for active
vocabulary (from Dictionary or LTS rules).
Application passes CFG grammar.
Engine compiles CFG grammar and listens
for input, returning top N hypotheses and
their probabilities.

C & C Types of Errors


Pronunciation errors make some words
unrecognizable.
User says something outside the
vocabulary/grammar.
User speaks while machine is talking.
Background noise fires recognizer (false
alarm). Rejection needs improvement.
Spontaneous speech (uhms, ahms)

ASR in Real Applications


New word addition
Rejection
Barge-in.
Robustness to noise.
ASR is just a part of an application, User
Interface is critical.

Interface NLP/ASR

Loose coupling: NLP selects correct input


from a list of top N candidate sentences:

Easy to implement, not optimal

Tight coupling. NLP provides the language


probabilities for the search:

Optimal but hard to implement.


Can a NLP system reduce perplexity?

Summary
Current ASR has many limitations
Nonetheless, it is possible to build useful
systems with ASR.
Application Developer needs to carefully
design the dialogue.
Hard to integrate NLP and Speech.
Research is needed here.

You might also like