Improving English Conversational Telephone Speech Recognition
Improving English Conversational Telephone Speech Recognition
Improving English Conversational Telephone Speech Recognition
net/publication/307889145
CITATIONS READS
19 712
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Ivan Medennikov on 01 November 2016.
Table 3: Data augmentation results on the HUB5 2000 evalua- Table 5: Perplexity results on the train, valid and test data
tion set
Acoustic model SWB WER, % CH WER, % Language model PPL train PPL valid PPL test
SDBN-DNN 12.1 23.3 4-gram (baseline) 66.366 62.946 87.039
SDBN-DNN + augm 11.8 (-0.3) 22.5 (-0.8) RNNLM 57.982 78.578 76.123
LSTM-LM (medium) 51.104 58.964 56.822
BLSTM 11.1 20.9
LSTM-LM (large) 46.033 54.821 52.892
BLSTM + augm 10.8 (-0.3) 20.4 (-0.5)
Here hlt , clt , ilt , ftl , olt , gtl ∈ Rn denote hidden state, memory To-End Memory Networks [26] and others. We are going to
cell state and the activations of input gate, forget gate, output investigate more complicated approaches of applying sophisti-
gate and input modulation gate in layer l ∈ [1, L] at time cated language models than simple n-best rescoring as well.
t, respectively; h0t ∈ Rn is an input word vector at time t;
T2n,4n : R2n → R4n is a linear transform with a bias; D is the
dropout operator that sets a random subset of its argument to
zero; symbol denotes element-wise multiplication. Logistic
(sigm) and hyperbolic tangent (tanh) activation functions in
these equations are applied element-wise. Activations hL t ∈ R
n
4. Discussion
The architecture of our system is depicted in Figure 2. In Figure 2: System architecture
Table 7 we present the results of comparison with existing
English CTS recognition systems. For clarity, we also specify
the data used for training acoustic and language models for 5. Acknowledgements
each system. Our system achieves the competitive results on
the HUB5 2000 evaluation set: 7.8% WER on the Switchboard The work was partially financially supported by the Govern-
part (which is the state-of-the-art result at the moment as far ment of the Russian Federation, Grant 074-U01.
as we know) and 16.0% WER on the CallHome part. Note that
acoustic models used in the system were trained only on the 300 6. References
hour Switchboard English CTS corpus. [1] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
We consider several ways of further improvement of our using context-dependent deep neural networks,” Proceedings of
system. First, a great accuracy gain can be obtained by adding the Annual Conference of International Speech Communication
Fisher and CallHome corpora into the AM training set. Second, Association (INTERSPEECH), 2011.
sequence-discriminative training of BLSTM acoustic models [2] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
can lead to substantial WER reduction [24]. Third, retraining discriminative training of deep neural networks,” Proceedings of
the SDBN extractor with the augmented data can provide ad- the Annual Conference of International Speech Communication
ditional improvement. Last but not least, we plan to carry out Association (INTERSPEECH), 2013.
experiments with other promising language model architectures [3] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker
such as Character-Aware Neural Language Models [25], End- adaptation of neural network acoustic models using i-vectors,”
Proc. Automatic Speech Recognition and Understanding (ASRU), [20] Y. Shi, W.-Q. Zhang, M. Cai, and J. Liu, “Empirically combining
pp. 55–59, 2013. unnormalized NNLM and back-off n-gram for fast n-best rescor-
ing in speech recognition,” EURASIP Journal on Audio, Speech,
[4] H. Soltau, G. Saon, and T. Sainath, “Joint training of convolu- and Music Processing, vol. 19, 2014.
tional and non-convolutional neural networks,” Proceedings of
International Conference on Acoustics, Speech and Signal Pro- [21] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cer-
cessing (ICASSP), 2014. nocky, “RNNLM — recurrent neural network language modeling
toolkit,” ASRU 2011 Demo Session, 2011.
[5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and [22] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural
A. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” network regularization,” arXiv preprint arXiv:1409.2329, 2014.
arXiv preprint arXiv:1412.5567, 2014. [23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
[6] G. Saon, H.-K. Kuo, S. Rennie, and M. Picheny, “The IBM G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
2015 english conversational telephone speech recognition sys- I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
tem,” Proceedings of the Annual Conference of International L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
Speech Communication Association (INTERSPEECH), 2015. S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
[7] A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
and G. Penn, “Deep bi-directional recurrent networks over spec- Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
tral windows,” Proc. Automatic Speech Recognition and Under- learning on heterogeneous systems,” 2015, software available
standing (ASRU), pp. 78–83, 2015. from tensorflow.org. [Online]. Available: http://tensorflow.org/
[8] R. P. Lippmann, “Speech recognition by machines and humans,” [24] H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott,
Speech communication, vol. 22, no. 1, pp. 1–15, 1997. R. Monga, and M. Mao, “Sequence discriminative distributed
training of long short-term memory recurrent neural networks,”
[9] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic Proceedings of the Annual Conference of International Speech
and bottle-neck features for LVCSR of meetings,” Proceedings Communication Association (INTERSPEECH), 2014.
of International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 4, pp. 757–760, 2007. [25] Y. Kim, Y. Jernite, D. Sontag, and A. Rush, “Character-aware
neural language models,” arXiv preprint arXiv:1508.06615, 2015.
[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deep
bottleneck features using stacked auto-encoders,” Proceedings of [26] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end
International Conference on Acoustics, Speech and Signal Pro- memory networks,” arXiv preprint arXiv:1503.08895, 2015.
cessing (ICASSP), pp. 3377–3381, 2013.
[11] A. Prudnikov, I. Medennikov, V. Mendelev, M. Korenevsky, and
Y. Khokhlov, “Improving acoustic models for russian spontaneous
speech recognition,” Speech and Computer, Lecture Notes in
Computer Science, vol. 9319, pp. 234–242, 2015.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
recognition toolkit,” Proc. IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), pp. 1–4, 2011.
[13] A. Kozlov, O. Kudashev, Y. Matveev, T. Pekhovsky, K. Si-
monchik, and A. Shulipa, “SVID speaker recognition system
for NIST SRE 2012,” Speech and Computer, Lecture Notes in
Computer Science, vol. 8113, pp. 278–285, 2013.
[14] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” Proc. ICML, pp. 1764–1772,
2014.
[15] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate
recurrent neural network acoustic models for speech recognition,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), pp. 1468–1472,
2015.
[16] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re-
current neural network architectures for large scale acoustic mod-
eling,” Proceedings of the Annual Conference of International
Speech Communication Association (INTERSPEECH), 2014.
[17] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-
mentation for speech recognition,” Proceedings of the Annual
Conference of International Speech Communication Association
(INTERSPEECH), 2015.
[18] V. Peddinti, D. Povey, and K. S., “A time delay neural network
architecture for efficient modeling of long temporal contexts,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), 2015.
[19] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudan-
pur, “Recurrent neural network based language model,” Proceed-
ings of the Annual Conference of International Speech Communi-
cation Association (INTERSPEECH), pp. 1045–1048, 2010.