0 Rybach
0 Rybach
WER [%]
4 22.0 14.7 4.5 2.8
Quaero 16
1 21.9 14.7 4.4 2.8
15
4.4. C Transducer
Figure 2 shows the effect of the C transducer construction described 14
0 0.2 0.4 0.6 0.8 1 1.2
in Section 3.3 on the size of the active search space. The plot depicts RTF
the number of active state hypotheses (after pruning) in relation to
the absolute number of word errors. Due to the earlier recombination Fig. 3. WER vs. real-time factor (RTF) for search networks created
of partial hypotheses after non-speech between words, the number of with either shifted or un-shifted CI labels in C. The G transducer
active state hypotheses is reduced by up to 40%. has loop arcs for all non-speech models.
The improvement in runtime efficiency is shown in Figure 3 [2] P. Garner, “Silence models in weighted finite-state transduc-
(measured on a 2.8 GHz Intel Core2). Due to caching of acous- ers,” in INTERSPEECH, Brisbane, Australia, Sep. 2008, pp.
tic scores, the decrease in real-time factor (processing time divided 1817–1820.
by audio duration) is lower than the reduction in search space size.
[3] C. Allauzen, M. Mohr, B. Roark, and M. Riley, “A generalized
Nevertheless, the RTF can be improved by up to 20%, which is no-
construction of integrated speech recognition transducers,” in
ticeable in practice.
INTERSPEECH, Montreal, Canada, May 2004, pp. 761–764.
The construction also improves the runtime performance of the
system with two noise models. The reduction of the active state [4] D. Rybach, R. Schüter, and H. Ney, “A comparative analysis
space is smaller, but still around 20%. of dynamic network decoding,” in ICASSP, Prague, Czech Re-
public, May 2011, pp. 5184–5187.
5. CONCLUSION [5] C. Allauzen, M. Riley, and J. Schalkwyk, “Filters for efficient
composition of weighted finite-state transducers,” in CIAA,
Whether or not multiple non-speech models improve a speech recog- Winnipeg, Canada, Aug. 2010.
nition system depends on the targeted application, the training data, [6] C. Allauzen, M. Riley, J. Schalkwyk, W. Skut, and M. Mohri,
and the pre-processing. In our experiments the gain in recognition “OpenFst: a general and efficient weighted finite-state trans-
quality from including more than one non-speech model (in addition ducer library,” in CIAA, Prague, Czech Republic, Jul. 2007,
to a silence model) is small, if any. pp. 11–23.
If multiple non-speech models are present in the AM however, [7] C. Allauzen, M. Mohri, and B. Roark, “Generalized algorithms
the construction using loop arcs at the unigram state in the G trans- for constructing statistical language models,” in ACL, Sapporo,
ducer combined with optional non-speech arcs at word ends in L, Japan, Jul. 2003, pp. 40–47.
reduces the memory requirement significantly with a minor degra-
[8] M. Mohri and M. Riley, “Network optimizations for large-
dation in recognition accuracy. The adjusted construction of the C
vocabulary speech recognition,” Speech Communication,
transducer decreases the size of the active search space and therefore
vol. 28, no. 1, pp. 1–12, May 1999.
improves the runtime efficiency of the decoder.
[9] J. Lööf et al., “The RWTH 2007 TC-STAR evaluation sys-
tem for European English and Spanish,” in INTERSPEECH,
6. REFERENCES Antwerp, Belgium, Aug. 2007, pp. 2145–2148.
[10] M. Sundermeyer, M. Nussbaum-Thom, S. Wiesler et al., “The
[1] M. Mohri, F. Pereira, and M. Riley, “Speech recognition with RWTH 2010 Quaero ASR evaluation system for English,
weighted finite-state transducers,” in Handbook of Speech Pro- French, and German,” in ICASSP, Prague, Czech Republic,
cessing, J. Benesty, M. Sondhi, and Y. Huang, Eds. Springer, May 2011, pp. 2212–2215.
2008, ch. 28, pp. 559–582.