0% found this document useful (0 votes)
45 views35 pages

Speech Recognition: Lecture 11: Advanced Topics

This lecture discusses advanced topics in speech recognition including evaluating performance, generating N-best strings from lattices, and discriminative training. It covers measuring accuracy using edit distance between the transcription and reference, computing N-best paths efficiently in weighted graphs, and generating unique top N strings from lattices rather than just paths. Discriminative training is also mentioned.

Uploaded by

sdede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views35 pages

Speech Recognition: Lecture 11: Advanced Topics

This lecture discusses advanced topics in speech recognition including evaluating performance, generating N-best strings from lattices, and discriminative training. It covers measuring accuracy using edit distance between the transcription and reference, computing N-best paths efficiently in weighted graphs, and generating unique top N strings from lattices rather than just paths. Discriminative training is also mentioned.

Uploaded by

sdede
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Speech Recognition

Lecture 11: Advanced Topics.

Mehryar Mohri
Courant Institute of Mathematical Sciences
[email protected]
This Lecture
Speech recognition evaluation
N-best strings algorithms
Lattice generation
Discriminative training

Mehryar Mohri - Speech Recognition page 2 Courant Institute, NYU


Performance Measure
Accuracy: based on edit-distance of speech
recognition transcription and reference
transcription.

• word or phone accuracy.


• lattice oracle accuracy: edit-distance of lattice
and reference transcription.
Note: performance measure does not match the
quantity optimized to learn models.

• word-error rate lattices.


Mehryar Mohri - Speech Recognition page 3 Courant Institute, NYU
Word Error Rates

*
*

* Based on 1998 evaluation


Edit-Distance
Definition: minimal cost of a sequence of edit
operations transforming one string into another.
Edit operations and costs:

• standard edit-distance definition: insertion,


deletions, substitutions, all with same cost one.

• general case: more general operations, arbitrary


non-negative costs.
Application: measuring word error rate in speech
recognition and other string processing tasks.
Mehryar Mohri - Speech Recognition page 5 Courant Institute, NYU
Local Edits
Edit operations: insertion: ε → a, deletion: a → ε,
substitution: a → b (a ≠ b).
Example: 2 insertions, 3 deletions, 1 substitution
cttg!!ac
!ta!gt!c
This is called an alignment.

Mehryar Mohri - Speech Recognition page Courant6 Institute, NYU


Edit-Distance Computation
Standard case: textbook recursive algorithm
(Cormen, Leiserson, Rivest, 1992), quadratic
complexity, O(|x||y|) for two strings x and y .
General case: (MM, Pereira, and Riley, 2000; MM, 2003)

• construct tropical semiring edit-distance


transducer
T e with arbitrary edit costs.

• represent x and y by automata X and Y .


• compute best path of X ◦ T ◦ Y . e

• complexity quadratic: O(|T ||X||Y |). e


Mehryar Mohri - Speech Recognition page 7 Courant Institute, NYU
Global Alignment - Example
Example: c(A, G) = 1, c(A,T) = c(G, C) = .5, no cost
for matching symbols. C:G/0.5
G:C/0.5
Representation: T:A/0.5
A:T/0.5
G:A/1.0
A:G/1.0
A G C T T:T/0.0
0 0 0 0 0
G:G/0.0
C:C/0.0
A:A/0.0
A C C T G
0 0 0 0 0 0
0

echo “A G C T” | farcompilestrings >X.fsm


Mehryar Mohri - Speech Recognition page Courant8 Institute, NYU
Global Alignment - Example
Program: fsmcompose X.fsm Te.fsm Y.fsm |
fsmbestpath -n 1 >A.fsm

Graphical representation:

A:A/0 G:C/0.5 C:C/0 T:T/0 e:G/2


0 1 2 3 4 5/0

A:A/0 G:C/0.5 C:C/0 T:T/0 e:G/2 6/0


e:e/0 1 2 3 4 5
0 e:e/0
A:A/0 G:e/2 e:C/2 C:C/0 T:T/0 e:G/2
7 8 9 10 11 12 13/0

Mehryar Mohri - Speech Recognition page Courant9 Institute, NYU


Edit-Distance of Automata
Definition: the edit-distance of two automata A
and B is the minimum edit-distance of a string
accepted by A and a string accepted by B .
Computation:

• best path of A ◦ T ◦ B . e

• complexity for acyclic automata:O(|T ||A||B|) .


e

Generality: any weighted transducer in the tropical


semiring defines an edit-distance. Learning edit-
distance transducer using EM algorithm.
Mehryar Mohri - Speech Recognition page 10 Courant Institute, NYU
This Lecture
Speech recognition evaluation
N-best strings algorithms
Lattice generation
Discriminative training

Mehryar Mohri - Speech Recognition page 11 Courant Institute, NYU


N-Best Sequences
Motivation: rescoring.

• first pass using a simple acoustic and grammar


lattice or N-best list.

• re-evaluate alternatives with a more


sophisticated model or use new information.
General problem:

• speech recognition, handwriting recognition.


• information extraction, image processing.
Mehryar Mohri - Speech Recognition page 12 Courant Institute, NYU
N-Shortest-Paths Problem
Problem: given a weighted directed graph G, a
source state s and a set of destination or final
states F, find the N shortest paths in G from s to F.
Algorithms:
• (Dreyfus, 1969): O(|E| + N log(|E|/|Q|)).

• (MM, 2002): shortest-distance algorithm, N-tropical


semiring.
• (Eppstein, 2002): O(|E| + |Q| log |Q| + N ).

+ explicit representation of N best paths: O(|Q| N 2 ).

Mehryar Mohri - Speech Recognition page 13 Courant Institute, NYU


N-Shortest Strings != N-Shortest-Paths
Problem: given a weighted directed graph G, a
source state s and a set of destination or final
states F, find the N shortest strings in G from s to F.
Example: NAB Eval 95.
Thresh Non-Unique Unique
1.5 8 2
2.0 24 4
2.5 54 4
3.0 1536 48

Mehryar Mohri - Speech Recognition page 14 Courant Institute, NYU


N-Shortest Paths
Program: fsmprune -c1.5 lat.fsm |
farprintstrings -c -iNAB.wordlist

in addition the launch of Microsoft corporation's windows ninety five


software will mean more memory will be required to run -2038.46
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required around -2037.8
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required to run -2037.51
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required to run -2037.42
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required around -2036.85
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required around -2036.76
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required to run -2036.47
in addition the launch of Microsoft corporation's windows ninety five
software will mean more memory will be required around -2035.81

Mehryar Mohri - Speech Recognition page 15 Courant Institute, NYU


N-Shortest Strings
Program: fsmprune -c1.5 lat.fsm |
farprintstrings -c -u -iNAB.wordlist

in addition the launch of Microsoft corporation's windows ninety five


software will mean more memory will be required to run -2038.46

in addition the launch of Microsoft corporation's windows ninety five


software will mean more memory will be required around -2037.8

Mehryar Mohri - Speech Recognition page 16 Courant Institute, NYU


Algorithms Based on N-Best Paths
(Chow and Schwartz, 1990; Soon and Huang, 1991)
Idea: use K-best paths algorithm to generate K ! N
distincts paths.
Problems:

• K not known in advance.


• in practive, K may be sometimes quite large, that
is K ∼ 2N, which affects both time and space
complexity.

Mehryar Mohri - Speech Recognition page 17 Courant Institute, NYU


N-Best String Algorithm
(MM and Riley, 2002)
Idea: apply N-best paths algorithm to on-the-fly
determinization of input automaton. But, N-best
paths algorithms require shortest distances to F’.
Weighted determinization (partial):

• eliminates redundancy, no determinizability


issue.

• on-demand computation: only the part needed


is computed.

• on-the-fly computation of the needed shortest-


distances to final states.
Mehryar Mohri - Speech Recognition page 18 Courant Institute, NYU
Shortest-Distances to Final States
Definition: let d(q, F ) denote the shortest distance
from q to the set of final states F in input (non-
deterministic) automaton A , and let d! (q ! , F ! ) be
defined in the same way in the resulting
(deterministic) automatonB .
Theorem: for any state = {(q1 , w1 ), . . . , (qn , wn )}
q !

in B , the following holds:


d! (q ! , F ! ) = min {wi + d(qi , F )} .
i=1,...,n

Mehryar Mohri - Speech Recognition page 19 Courant Institute, NYU


Simple N-Shortest-Paths Algorithm
1 for p ← 1 to |Q! | do r[p] ← 0
2 π[(i! , 0)] ← nil
3 S ← {(i! , 0)}
4 while S "= ∅
5 do (p, c) ← head(S); Dequeue(S)
6 r[p] ← r[p] + 1
7 if (r[p] = N and p ∈ F ) then exit
8 if r[p] ≤ N
9 then for each e ∈ E[p]
10 do c! ← c + w[e]
11 π[(n[e], c! )] ← (p, c)
12 Enqueue(S, (n[e], c! ))

Mehryar Mohri - Speech Recognition page 20 Courant Institute, NYU


N-Best String Alg. - Experiments
NAB 40K Bigram

100
5 5
5
90

5 4 4
4
3 3
4 3
word accuracy

5 3 2 2
2
4 2
3
80

2 1 1 1
1
5
43 1
2
70

1
60

0 1 2 3 4 5 6
x real-time
1. 1-Best 2. 10-Best 3. 100-Best 4. 1000-Best 5. Lattice

Additional time to pay for N-best very small even for large N.
Mehryar Mohri - Speech Recognition page 21 Courant Institute, NYU
N-Best String Alg. - Properties
Simplicity and efficiency:

• easy to implement: combine two general


algorithms.

• works with any N-best paths algorithm.


• empirically efficient.
Generality:

• arbitrary input automaton (not nec. acyclic).


• incorporated in FSM Library ( ).
fsmbestpath

Mehryar Mohri - Speech Recognition page 22 Courant Institute, NYU


This Lecture
Speech recognition evaluation
N-best strings algorithms
Lattice generation
Discriminative training

Mehryar Mohri - Speech Recognition page 23 Courant Institute, NYU


Speech Recognition Lattices
Definition: weighted automaton representing
speech recognizer’s alternative hypotheses.
is/22.36 10
this/16.8 4 my/63.09
1
hi/80.76 is/70.97 number/34.56
8 hard/22 15 57
I/47.36 this/153.0 is/71.16 Mike/192.5
0 2 5 87 uh/83.34

is/77.68 7 my/19.2
I’d/143.3 card/20.1
9
3 this/90.3 6

like/41.58

Mehryar Mohri - Speech Recognition page 24 Courant Institute, NYU


Lattice Generation
(Odell, 1995; Ljolje et al., 1999)
Procedure: given transition e in N, keep in lattice
transition ((p[e], t! ), i[e], o[e], (n[e], t)) with best start
time (p[e], t! ) during Vierbi decoding.

(r, t!! )

(q, t)

(r, t! )

t! t!! t+1

Mehryar Mohri - Speech Recognition page 25 Courant Institute, NYU


Lattice Generation
Computation time: little extra computation over
one-best.
Optimization:

• projection on output (words or phonemes).


• epsilon-removal.
• pruning: keeps transitions and states lying on
paths whose total weight is within a threshold
of the best path.

• garbage-collection (use same pruning).


Mehryar Mohri - Speech Recognition page 26 Courant Institute, NYU
Notes
Heuristics: not all paths within beam are kept in
lattice.
Lattice quality: oracle accuracy, that is best
accuracy achieved by any path in lattice.
Optimizations: weighted determinization and
minimization.

• in general, dramatic reduction of redundancy


and size.

• bad for some lattices, typically uncertain cases.


Mehryar Mohri - Speech Recognition page 27 Courant Institute, NYU
in/14.3 50

in/25

saint/68.1
in/21.7

around/107
at/33.8

Speech Recognition Lattice


at/35

at/31.6

at/28.5

at/32.8

arrive/24.2
at/29.7

arrive/23.2 at/31.1

arrive/27.7 in/24.7
saint/53.6
19 48
arrive/21.7 in/27.2 saint/46.8

53 petersburg/73.6
75
saint/42.5 petersburg/91.5 around/116
arrive/16.5 27 at/28.9

arrive/16.2 saint/41.3 petersburg/85.5


at/31.7
and/49.7 76
49 saint/46.2 around/98.8
64 petersburg/93.1
arrive/15.6 26
at/25.8
arrive/15.2 petersburg/98.3 72
at/31.7 saint/47.9 63 around/103
petersburg/77.5

arrive/16.5
22 at/35 saint/49.7
around/105
62 petersburg/85.2
arrive/19.7 in/29 saint/39.8 77
28 around/109
petersburg/84.4
at/22.2 saint/56.1 61
arrive/19.1 around/104
petersburg/90.3
at/23.4 around/92.7
arrive/16.9 71
saint/34.4
and/59.3 at/19.5 petersburg/92.3 96
arriving/41.3 around/109 nine/11.8
saint/39.1 a/16.3

ninety/18
arrive/21.1 79 97
leave/39.2 at/26.4
32 45 saint/45.4 around/87.6 a/9.34
nine/12.9
arrive/16
leave/35.9 at/37.3 saint/57.8
around/104
nine/5.61
flights/79 8 leave/41.9 at/36.8 saint/44.4 around/91.6 98
55 petersburg/97.9 nine/10.7 a/14.1
arrive/13.7
69 around/99.3 nine/17.2 m/16.6
saint/33.2 petersburg/99.9
flights/83.8 leave/34.6 arrive/16.6 at/32.8
a/22.8 102 m/19
95
arrive/20.5 petersburg/104 91 a/13.3 m/13.9
at/28.9 saint/38.7 around/117 and/19.1
nine/13
leave/50.7
81 105 /13.6
and/14.2 nine/24.9
flights/83.4 detroit/99.1 around/97.2 a/17.9 100 m/16.3
leave/31.3 12 arrive/15.6 at/35.1 saint/51.5 petersburg/86.1
9
flights/88.2
leave/37.3 detroit/102 saint/45.5 around/92.2 nine/9.2 92
arrive/16.9 m/13.6
petersburg/92.8 a/26.6
25 at/32 nine/5.66
detroit/103
nine/18.7
3 leave/47.4 82 m/24.1
flights/63.5 11 arrive/13.1 at/33.7 saint/37.8 petersburg/85.1 70 a/18.5 104 /16.2
which/77.7 detroit/106 20 47 around/87.8
and/55.8
leave/57.7
saint/44.2 around/105 a/8.73 m/12.5
flights/68.3 arrive/15.7 at/27.5 petersburg/88.1 78 87 nine/17.2
54 67 94
which/72.9 2 detroit/96.3 15 petersburg/90 around/109 around/48.8
flights/61.8 7 and/57 a/13.6 m/10.9
leave/53.4 arrive/14.4
at/23.6 saint/43.1 nine/6.31 nine/8.8 101
and/55.3 21 m/21.4
0 detroit/88.5
which/81.6 flights/72.4 leave/54.4 arrive/18.2
petersburg/98.8 around/93.2 86 around/102
at/29.8 saint/38 nine/8.76 a/16.4
13 99
which/69.9 4 detroit/106
flights/55.2 leave/70.9 m/7.03
24 saint/46.5 around/109 nine/28 nine/18.8 a/22.3
arrive/22.7 at/20 petersburg/92 103 /20.8
and/53.5
flights/59.9 leave/60.4 detroit/102
m/17.5
6 at/21.2 saint/43.3 petersburg/80.9 around/111 nine/12.3
and/55 arrive/20.2 a/12.1
leave/45.8 detroit/105 around/97.2
flights/53.5
at/17.3 saint/49.7 56 around/97.8 nine/31.2
flights/64 arrive/19.6 petersburg/91.1 93
detroit/99.7
1 leave/61.9 16 85
flights/45.4 arrive/13.9 at/20.1 petersburg/73.3
saint/48.6 around/99.5
detroit/91.9
leave/68.9 14 nine/21.9
that/60.1 arrive/16.6 at/25.7 46 petersburg/84.8 at/65 89
detroit/110
saint/36.8 68
leave/67.6
flights/50.2 and/53.1 around/85.6
nine/21.9
at/27.4 saint/50.1 petersburg/77.1 at/12.2
arrive/14.1 57
flights/43.7 detroit/99.4 at/15.2 nine/28
leave/73.6 17 petersburg/76.4
5 at/21.3 saint/38.7 around/81
and/55 arrive/16 88
31 80
detroit/91.6 around/97.1 nine/12.4
flights/54.3 leave/82.1 at/17.4
arrive/20.5 at/20.1
detroit/109
leave/44.4 saint/42.3 90
33
18 at/23.5
10 arrive/14.4
petersburg/75.6 at/14.1
saint/49.5 nine/9.45
30
leave/51.4 at/26.3 around/113
arrive/13.6 83
saint/43.1
leave/64.6 nine/22.6
arrive/18.6 at/27.5 saint/56.5 around/111
73
petersburg/89.3
saint/45 around/101
arrive/23.1 at/20.1
59 84
petersburg/84.1
arrive/17.1 saint/43 around/99.5
at/23.5

arrive/16.5
at/25.2 saint/49.3

Mehryar Mohri - Speech Recognition page 28 Courant Institute, NYU


60

at/21.3 saint/55.4
43
Lattice after Determinization
(MM, 1997)

around/81
27
at/19.1
32
around/83
nine/21.8

saint/51.2 petersburg/83.6 around/107 nine/5.61


9 14 19 23

36
arriving/45.2 saint/51.9 petersburg/85.9 nine/21.7
12 17 21 around/96.5 a/15.3
in/16.3
25 ninety/18
at/16.1
arrive/17.4 at/23
7 11 petersburg/80 a/15.3
18 22 29 33 m/
5 saint/49.4 at/16.1 26 ninety/34.1
and/49.7 arrives/23.6
at/20.9 m/
which/69.9 flights/53.1 leave/61.2 detroit/105 8 13 saint/43
0 1 2 3 4 nine/21.7
that/60.1 a/9.34
petersburg/85.6 around/97.1 30 34
6 arrive/12.8 saint/43 16 20
at/21.9 a/14.1
10 15 nine/21.7

around/97.1 28 ninety/34.1
35
and/18.7 nine/10.7

and/18.7 31

24 ninety/34.1

and/18.7

Mehryar Mohri - Speech Recognition page 29 Courant Institute, NYU


Lattice after Minimization
(MM, 1997)

around/97.1 nine/29.1

arrive/12.9 at/21.8 saint/48.6 petersburg/80 at/18


8 10 13 16
6 19 ninety/34.1
that/60.1 arrives/23.6 at/23
around/96.5 and/21.2
nine/13 a/9.34 m
which/69.9 flights/53.1 leave/61.2 detroit/105 21 23
0 1 2 3 4 22
and/49.7 ninety/18
5 arrive/17.4 around/97.2
in/24.9 saint/48.6 petersburg/80.6 at/19.1
7 11 14 17 20
arriving/45.2
nine/13
saint/51.1 petersburg/83.6 around/107 18
9 12 15

Mehryar Mohri - Speech Recognition page 30 Courant Institute, NYU


This Lecture
Speech recognition evaluation
N-best strings algorithms
Lattice generation
Discriminative training

Mehryar Mohri - Speech Recognition page 31 Courant Institute, NYU


Discriminative Techniques
Maximum-likelihood: parameters adjusted to
increase joint likelihood of acoustic and CD phone
or word sequences, irrespective of the probability
of other word hypotheses.
Discriminative techniques: takes into account
competing word hypotheses and attempts to
reduce the probability of incorrect ones.

• Main problems: computationally expensive,


generalization.

Mehryar Mohri - Speech Recognition page 32 Courant Institute, NYU


Objective Functions
Maximum likelihood (joint):
!
m
F = argmax log pθ (oi , wi ).
θ i=1
Conditional maximum likelihood (CML):
!
m
pθ (oi , wi ) !
m
F = argmax log pθ (oi |wi ) = argmax log .
θ i=1 θ i=1
p θ (o i )

Maximum mutual information (MMI/MMIE)


!
m
pθ (oi , wi )
F = argmax log .
θ i=1
pθ (oi )pθ (wi )
Equivalenty to CML when independent of theta.
Mehryar Mohri - Speech Recognition page 33 Courant Institute, NYU
References
• Y. Chow and R. Schwartz, The N-Best Algorithm: An Efficient Procedure for Finding top N
Sentence Hypotheses. In Proceedings of the International Conference on Acoustics, Speech, and
Signal Processing (ICASSP '90), Albuquerque, New Mexico, April 1990, pp. 81–84.

• S. E. Dreyfus. An appraisal of some shortest path algorithms. Operations Research,


17:395-412, 1969.

• David Eppstein, Finding the shortest paths, SIAM Journal of Computing, vol.28, no. 2, pp. 652–
673, 1998.

• Andrej Ljolje and Fernando Pereira and Michael Riley, Efficient general lattice generation
and rescoring. In Proceedings of the European Conference on Speech Communication
and Technology (Eurospeech ’99), Budapest, Hungary, 1999.

• Mehryar Mohri. Finite-State Transducers in Language and Speech Processing. Computational


Linguistics, 23:2, 1997.

• Mehryar Mohri. Statistical Natural Language Processing. In M. Lothaire, editor, Applied


Combinatorics on Words. Cambridge University Press, 2005.

Mehryar Mohri - Speech Recognition page 34 Courant Institute, NYU


References
• Mehryar Mohri. Edit-Distance of Weighted Automata: General Definitions and Algorithms.
International Journal of Foundations of Computer Science, 14(6):957-982, 2003.

• Mehryar Mohri and Michael Riley. An Efficient Algorithm for the N-Best-Strings Problem.
In Proceedings of the International Conference on Spoken Language Processing 2002 (ICSLP
’02), Denver, Colorado, September 2002.

• Mehryar Mohri, Fernando C. N. Pereira, and Michael Riley. The Design Principles of a
Weighted Finite-State Transducer Library. Theoretical Computer Science, 231:17-32, January
2000.

• Julian Odell. The Use of Context in Large Vocabulary Speech Recognition. Ph.D. thesis, 1995.
Cambridge University, UK.

• Frank Soong and Eng-Fong Huang, A Tree-Trellis Based Fast Search for Finding the N Best
Sentence Hypotheses in Continuous Speech Recognition. In Proceedings of the International
Conference on Acoustics, Speech, and Signal Processing (ICASSP ’91), Toronto, Canada,
November 1991, pp. 705–708.

Mehryar Mohri - Speech Recognition page 35 Courant Institute, NYU

You might also like