0 Rybach

This document discusses integrating multiple non-speech models, such as silence and noise, into WFST-based speech recognition systems. It presents several options for constructing transducers with non-speech models without modifying the decoder. Some options significantly reduce the size of the language model transducer. The document also analyzes the memory efficiency, runtime efficiency, and impact on recognition performance of the different construction options.

Uploaded by

Pablo Loste Ramos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

39 views4 pages

0 Rybach

Uploaded by

Pablo Loste Ramos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

SILENCE IS GOLDEN: MODELING NON-SPEECH EVENTS

IN WFST-BASED DYNAMIC NETWORK DECODERS

David Rybach, Ralf Schlüter, Hermann Ney

Human Language Technology and Pattern Recognition, Computer Science Department,

RWTH Aachen University, 52056 Aachen, Germany
{rybach,schlueter,ney}@cs.rwth-aachen.de

ABSTRACT as needed during recognition, allowing us to deal with huge vocab-

Models for silence are a fundamental part of continuous speech ularies and complex language models (LM) in a memory efficient
recognition systems. Depending on application requirements, audio way [4]. In contrast to fully expanded static search graphs, the LM
data segmentation, and availability of detailed training data anno- transducer is kept separately, thus the reduction in transducer size
tations, it may be necessary or beneficial to differentiate between results in lower memory consumption of the speech recognizer. We
other non-speech events, for example breath and background noise. also describe specifics of the context dependency transducer con-
The integration of multiple non-speech models in a WFST-based dy- struction, which improve the runtime efficiency.
namic network decoder is not straightforward, because these models The remainder of this paper is organized as follows. Section 2
do not perfectly fit in the transducer framework. This paper de- briefly describes the decoder used. In Section 3 we detail the con-
scribes several options for the transducer construction with multiple struction of the individual speech recognition transducers. Section 4
non-speech models, shows their considerable different characteris- presents the experimental results, followed by conclusions in Sec-
tics in memory and runtime efficiency, and analyzes the impact on tion 5.
the recognition performance.
2. DECODER
Index Terms— LVCSR, WFST
In the WFST framework, the LM is represented by a transducer G,
1. INTRODUCTION L is a phone to word transducer derived from the pronunciation dic-
tionary, and C encodes the context dependency of the acoustic mod-
Acoustic models for non-speech events like silence and noise are els. These transducers are combined by the finite-state operation of
a fundamental part of a speech recognition system. If the silence composition as C ◦ L ◦ G. The composed transducer has tied HMM
model does not match with non-speech parts of the signal, the sys- labels on the input side and words as output labels. The HMM states
tem will produce insertion errors, while a vague silence model may are generated dynamically during decoding in our system. In the
cause deletion errors. Depending on the kind of audio data to be pro- dynamic network decoder, the composition of (C ◦ L) with G is
cessed and the upstream audio segmentation, different kinds of non- computed on demand (lazy evaluation) using special composition
speech events have to be considered, for example breath, laughter, filters which provide on-the-fly pushing of labels and weights [5].
hesitations, or background noise. For some systems it may therefore We use determinized and minimized L and C transducers and per-
be beneficial to train separate models for these non-speech events. form no further transducer optimizations. Our decoder is based on
Such models require of course respective precise annotations in the the OpenFst toolkit [6].
training data.
The WFST framework offers a clear and consistent way of mod-
eling the parts of a speech recognition system [1]. However, non- 3. TRANSDUCER CONSTRUCTION
speech events do not perfectly fit in this framework, because they are
The non-speech models need to be considered in all three transduc-
usually not covered by the LM [2]. The search graph construction is
ers. In this section, we describe the construction options for the G
implemented by a chain of token sequence expansions, from words
and L transducer as well as special treatments of context indepen-
down to HMM states. Hence, non-speech models need to be present
dent models in the C transducer.
also at the word level. If the non-speech tokens do not occur as labels
in the LM transducer, they cannot appear in the decoder output (un- 3.1. Language Model
less the decoder implements some less generic special treatment), In a G transducer representing a commonly used n-gram LM, the
which is necessary for some applications. Furthermore, the non- states encode word histories h and the arcs are labeled with words
speech labeled arcs in the LM transducer require a weight, which is w. The weight of an arc is the LM probability p(w|h). The backing
generally not consistent with the rest of the LM [3]. off structure is implemented by epsilon arcs to a state with reduced
In this paper, we analyze different options for integrating multi- history size [7].
ple non-speech models in the individual transducers involved with- Non-speech events are not part of the LM because a) they do
out modifying the generic decoder itself. Some of these options al- not occur in the text data used to estimate the LM, b) their occur-
low for a significant reduction in size of the LM transducer. We rence generally does not depend on the predecessor words, and c)
use a dynamic network decoder, which integrates the LM on-the-fly they are not useful for predicting following words. Nevertheless, as
This work has been funded in part by the Google Research Awards Pro- mentioned in the introduction, it may be necessary to integrate to-
gram and was partly realized as part of the Quaero Programme, funded by kens for non-speech events in the G transducer. As silence and noise
OSEO, French State agency for innovation. can occur at any position in the spoken word sequence, non-speech
# s# :c # cπ :π
labeled arcs have to be reachable in G before and after any other #, c
arc. The non-speech arcs are constructed as self-loops, because they a b# : s ∗, s # s# :d
shall not modify the word history. a, b
a b# : n
Adding loop arcs for all non-speech events to all states in the # n# :c
# dπ :π
∗, n #, d
G transducer allows for the insertion of these tokens without any
# n# : d
constraints and without reducing the context size. This construc-
tion heavily increases the transducer size though. An alternative pro- (a) C1 # cπ :π
ducing significantly less arcs is to add non-speech loops only at the ∗, s # s# : :c #, c
a b# :s
initial and the unigram (empty history) state. Thereby, non-speech a, b #, #
a b# :n # n# : :d
events remove the history of subsequent words. # dπ :π
∗, n #, d
3.2. Lexicon
(b) C2 π
The L transducer has words as output labels and phones as input c
a b s
labels. If the non-speech tokens are included in the G transducer, L d
n
integrates the pronunciations for these tokens like any other word. π
We can add the non-speech tokens as optional arcs after each
word in L. These arcs have epsilon output labels, thus the non- (c) P
speech tokens do not need to be handled by the G transducer. This # s# :c # cπ :π
construction introduces a limitation of recognizable sequences: af-
ter each word at most one non-speech token can be inserted, but not a b# :s # s# :d
sequences of different non-speech tokens. The length of the non-
a b# :n # n# :c
speech events is not limited though, as loop transitions are added at # dπ :π
the HMM state level. A simple solution for this problem would be
# n# :d
to add self loops at the word end state. However, the computation of
the reachable (output) label lookahead which is used for the compo- (d) C1 ◦ P # cπ :π
sition filter (cf. [5]), requires that all cycles in the transducer contain a b# :s # s# : :c
at least one output label (implementation in OpenFst).
a b# :n # n# : :d
A compromise between constraining the sequences of non- # dπ :π
speech events by adding optional arcs in L and reducing the LM
context by introducing loops arcs at the unigram state in G, is to (e) C2 ◦ P
do both. Thereby, we can insert one non-speech token after each
regular word without modifying the LM history and we can rec-
Fig. 1. C transducers with shifted CI phones (a), un-shifted CI
ognize longer sequences of several non-speech tokens by forcing a
phones (b); a phone sequence P (c) and the composition of the C
path trough the unigram state in G. This construction is reasonable,
transducers with P (d, e). s and n are CI non-speech pseudo phones,
because it can be assumed that words following a longer pause do
a, b, c, d, π are CD phones. Triphone models are denoted as a bc with
not depend on the preceding words. By adding only arcs for silence
left context a, right context c.
in L, we get a model which is very similar to the short silence model
described in [2].
The non-determinism can be eliminated by locally un-shifting or
3.3. C Transducer synchronizing the CI models in the C transducer. As shown in Fig-
The C transducer is used to transform sequences of context indepen- ure 1 (b), we redirect the arcs with CI model input labels to the initial
dent (CI) phones, the input labels of the L transducer, to sequences of state (empty history) and replace the output label by epsilon. When
context dependent (CD) phone models by applying transducer com- composed with L, all arcs with non-speech input labels, whose pre-
position as C ◦ L. Therefore, C has CI phones as output labels and decessor states are deterministic now, will merge at one state (see
the CD units are used as input labels. The states in C encode the CI Figure 1 (e)). A similar result can be obtained by applying trans-
phone history (2 phones for triphone models). In the construction ducer determinization to C ◦ L. The determinization however would
of an (output) deterministic transducer, as used in our system, the require the insertion of disambiguation symbols, because of the am-
output label is the right context of the triphone model used as input biguous transduction from HMM label strings to word label strings.
label. This introduces a delay of CD labels by one symbol [8]. The described partially un-shifted construction does not reduce
The models for non-speech events are usually context indepen- the size of C ◦ L much, but it improves the runtime performance
dent. In the C transducer, arcs with a CI pseudo phone (output) label in case of a acoustic model containing several CI models. Dur-
s lead to a state (∗, s) with encodes just s as history. This state will ing decoding, all state hypotheses for non-speech events at word
have arcs for all phones π with the non-speech model # s# (empty boundaries will be recombined before being expanded to word initial
left and right context) as input label and state (#, π) as target. The states. Thereby, the number of active state hypotheses is significantly
state (#, π) represents an empty left context. See Figure 1 (a) for an reduced and consequently the runtime performance.
illustration.
The composed transducer C ◦ L has at word ends a separate
state for each phone model with empty right context and for each
non-speech model. The non-deterministic outgoing arcs of this state 4. EXPERIMENTAL RESULTS
all have the same input label (the respective non-speech model), as
illustrated in Figure 1 (d). In contrast to arcs for CD models, whose We analyze the impact of the various options for the transducer con-
input labels depend on the output label, the breakdown of output struction on the recognition performance in terms of both recogni-
label dependent arcs is not required for CI models. tion quality and efficiency.
4.1. Recognition System
Table 1. Transducer size for both systems. The number of arcs for
We performed experiments on two different tasks: European Parlia- the G transducer is given for the transducers with and without non-
ment Plenary Sessions (EPPS) in English and data from the Quaero speech (non-sp.) loop arcs, for C ◦ L with and without optional
Project in English, which contains broadcast news and unconstrained non-speech arcs.
broadcast conversations in English. The systems differ in vocabulary
size, complexity of the LM, the annotation of non-speech events in arcs
the training data, and the audio segmentation. The EPPS test data is system transducer states w/ non-sp. w/o non-sp.
segmented automatically resulting in a considerable amount of non-
speech, while the Quaero data has a precise (manual) segmentation. C ◦L 65.2K 253.4K 214.2K
EPPS small G 1.8M 18.1M 9.2M
EPPS English: We use the system as described in [9]. The dictio-
large G 6.2M 62.7M 31.9M
nary contains 53K words with 59K pronunciations, modeled using
45 phones and (unless noted otherwise) 5 non-speech pseudo phones C ◦L 164.5K 515.5K 459.6K
Quaero
(silence, hesitation, breath, laughter, general noise). The acoustic G 8.3M 100.1M 58.7M
models (AM) consist of 4500 Gaussian mixtures modeling general-
ized triphone states with across word context dependency and us- Table 2. Recognition results for the EPPS task using the small LM
ing word boundary information. Triphones are modeled by 3-state (upper part) and the large LM (lower part). Optional non-speech arcs
HMMs, except for silence which has only one HMM state. The were added in G at all states or only at the initial (i) and the unigram
AM used for the first speaker independent recognition pass consists (u) state. Optional non-speech arcs in L were added either for all
of 900K densities. A second AM, consisting of 800k densities, es- non-speech events, for silence (sil.) only, or not at all.
timated using speaker adaptive training and discriminative training non-sp. arcs pass 1 pass2
(minimum phone error criterion) was used in the second recogni-
tion pass, which applies speaker adaption using fMLLR. We used LM L G WER WER sub. del. ins.
two 4-gram LMs of different size. The smaller LM contains 7.4M - all 14.4 12.0 8.2 2.0 1.8
n-grams, the larger one has 25.8M n-grams. The test set comprises - i, u 16.7 14.1 9.7 2.6 1.7
644 segments with a total duration of 2.85h with about 27K words small all - 17.3 14.6 9.0 2.0 3.6
in total. all i, u 14.5 12.1 8.3 2.0 1.8
Quaero English: The Quaero ASR system uses 150K words with sil. i, u 14.6 12.4 8.5 2.0 1.9
180K pronunciations, using the same phoneme set as the EPPS sys-
tem. The speaker independent AM consists of 1M densities for 4500 - all 13.8 11.3 7.7 1.9 1.7
Gaussian mixture models. The MFCC features are augmented with large all i, u 14.0 11.4 7.9 1.8 1.7
phone posterior features. The 4-gram LM contains 50.4M n-grams. sil. i, u 14.1 11.7 8.0 1.9 1.8
We used a simple one-pass decoding strategy for this task. A de-
that. The combination of both options yields results nearest to the
tailed description of the system is given in [10]. The test set used
baseline, at least for the EPPS system. Adding only optional silence
consists of 1482 segments with a total duration of 3.3h and contains
arcs to word ends in L deteriorates the results slightly.
about 40K words.
The Quaero system has slightly different characteristics. The du-
4.2. L and G Transducer ration of non-speech events in the test data is shorter here, because of
Table 1 illustrates the reduction in transducer size, in particular in the more precise segmentation. In addition, the non-speech models
the number of arcs in G. The table shows the size of C ◦ L built with are less accurate due to less precise training data annotations. Even
and without optional non-speech arcs at word ends, as described in though the difference is quite small, the best option here is to add
Section 3.2, plus the size of G with and without non-speech loop only silence instead of all non-speech tokens as optional word end
arcs as described in Section 3.1. The quantity of additional loop arcs arcs to L.
added (number of non-speech tokens times number of states in G) 4.3. Non-speech Modeling
is clearly relatively large, especially for complex language models
Instead of dealing with multiple non-speech events, we can also train
like the larger EPPS LM and the Quaero LM. With an arc size of
just one non-speech model, whose mixture model accounts for the
16 bytes in memory, the additional arcs allocate around 520MB for
different acoustic realizations. We evaluated this option by compar-
the larger EPPS LM and 630MB for the Quaero LM. Considering
ing systems having one 3-state HMM model for noise in addition to
that the LM often consumes the biggest fraction of memory of a
the 1-state silence model with systems considering all non-speech
speech recognition system, the transducer size reduction yields a no-
tokens as described in Section 4.1. All systems were bootstrapped
ticeable decrease in memory requirements in practice. The increase
from the multiple non-speech system used for the experiments in the
in transducer size for additional optional non-speech arcs in C ◦ L is
previous section, which might distort the results a little bit.
comparatively small.
The baseline EPPS system has in total 12 non-speech (tied) state
Trading memory for speed or accuracy is easy in many cases. In-
models (including one silence model), while the Quaero system has
teresting is therefore how the removal of non-speech arcs in G affects
only 8. The lower number of non-speech models in the Quaero AM
the recognition performance. The results achieved using the various
is mainly caused by a higher ambiguity in training data annotations
construction options of G and L are shown in Table 2 for EPPS and
and less observations for some of the events. The re-trained AM
Table 3 for the Quaero task. The baseline in these tables is a system
with one noise model has 4 non-speech HMM state models.
which has loop arcs at all states in G. If loops are added only to the
initial and the unigram states of G, the WER increases significantly. The results in Table 4 show that the EPPS system benefits from
Adding non-speech arcs to L only yields an even worse result, be- differentiated non-speech models, only slightly though. The Quaero
cause it is impossible to recognize sequences of different non-speech system is not affected from pooling the noise models, which is not
events this way. The increased number of insertion errors illustrates surprising because of the small difference in models as described
above. All systems have noise and silence loop arcs on all G states.
20000 14.6% 15.4% 16.1% 16.8%
un-shifted CI labels
Table 3. Recognition results for the Quaero task. Optional non- shifted CI labels
speech arcs were added in G at all states or only at the initial (i) and

active state hypotheses

15000
the unigram (u) state. Optional non-speech arcs in L were added
either for all non-speech events, for silence (sil.) only, or not at all.
10000
non-sp. arcs
L G WER sub. del. ins. 5000

- all 22.0 14.8 4.5 2.8

all - 24.8 15.5 4.1 5.2 0
3900 4000 4100 4200 4300 4400 4500 4600 4700 4800
all i, u 22.3 14.9 4.4 3.0
word errors
sil. i, u 22.0 14.5 4.7 2.9
Fig. 2. Number of active states hypotheses as function of the abso-
Table 4. Recognition results comparing acoustic models with one lute number of word errors for shifted and un-shifted CI labels in C.
noise model vs. 4 noise models, both containing an additional one The corresponding WER is shown at the top of the plot.
state silence model. 19
un-shifted CI labels
system # noise m. WER sub. del. ins. shifted CI labels
18
4 14.4 9.8 2.7 1.9
EPPS
1 14.8 10.0 2.6 2.1 17

WER [%]
4 22.0 14.7 4.5 2.8
Quaero 16
1 21.9 14.7 4.4 2.8
15
4.4. C Transducer
Figure 2 shows the effect of the C transducer construction described 14
0 0.2 0.4 0.6 0.8 1 1.2
in Section 3.3 on the size of the active search space. The plot depicts RTF
the number of active state hypotheses (after pruning) in relation to
the absolute number of word errors. Due to the earlier recombination Fig. 3. WER vs. real-time factor (RTF) for search networks created
of partial hypotheses after non-speech between words, the number of with either shifted or un-shifted CI labels in C. The G transducer
active state hypotheses is reduced by up to 40%. has loop arcs for all non-speech models.
The improvement in runtime efficiency is shown in Figure 3 [2] P. Garner, “Silence models in weighted finite-state transduc-
(measured on a 2.8 GHz Intel Core2). Due to caching of acous- ers,” in INTERSPEECH, Brisbane, Australia, Sep. 2008, pp.
tic scores, the decrease in real-time factor (processing time divided 1817–1820.
by audio duration) is lower than the reduction in search space size.
[3] C. Allauzen, M. Mohr, B. Roark, and M. Riley, “A generalized
Nevertheless, the RTF can be improved by up to 20%, which is no-
construction of integrated speech recognition transducers,” in
ticeable in practice.
INTERSPEECH, Montreal, Canada, May 2004, pp. 761–764.
The construction also improves the runtime performance of the
system with two noise models. The reduction of the active state [4] D. Rybach, R. Schüter, and H. Ney, “A comparative analysis
space is smaller, but still around 20%. of dynamic network decoding,” in ICASSP, Prague, Czech Re-
public, May 2011, pp. 5184–5187.
5. CONCLUSION [5] C. Allauzen, M. Riley, and J. Schalkwyk, “Filters for efficient
composition of weighted finite-state transducers,” in CIAA,
Whether or not multiple non-speech models improve a speech recog- Winnipeg, Canada, Aug. 2010.
nition system depends on the targeted application, the training data, [6] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri,
and the pre-processing. In our experiments the gain in recognition “OpenFst: a general and efficient weighted finite-state trans-
quality from including more than one non-speech model (in addition ducer library,” in CIAA, Prague, Czech Republic, Jul. 2007,
to a silence model) is small, if any. pp. 11–23.
If multiple non-speech models are present in the AM however, [7] C. Allauzen, M. Mohri, and B. Roark, “Generalized algorithms
the construction using loop arcs at the unigram state in the G trans- for constructing statistical language models,” in ACL, Sapporo,
ducer combined with optional non-speech arcs at word ends in L, Japan, Jul. 2003, pp. 40–47.
reduces the memory requirement significantly with a minor degra-
[8] M. Mohri and M. Riley, “Network optimizations for large-
dation in recognition accuracy. The adjusted construction of the C
vocabulary speech recognition,” Speech Communication,
transducer decreases the size of the active search space and therefore
vol. 28, no. 1, pp. 1–12, May 1999.
improves the runtime efficiency of the decoder.
[9] J. Lööf et al., “The RWTH 2007 TC-STAR evaluation sys-
tem for European English and Spanish,” in INTERSPEECH,
6. REFERENCES Antwerp, Belgium, Aug. 2007, pp. 2145–2148.
[10] M. Sundermeyer, M. Nussbaum-Thom, S. Wiesler et al., “The
[1] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with RWTH 2010 Quaero ASR evaluation system for English,
weighted finite-state transducers,” in Handbook of Speech Pro- French, and German,” in ICASSP, Prague, Czech Republic,
cessing, J. Benesty, M. Sondhi, and Y. Huang, Eds. Springer, May 2011, pp. 2212–2215.
2008, ch. 28, pp. 559–582.