Improving English Conversational Telephone Speech Recognition
Table 3: Data augmentation results on the HUB5 2000 evalua- Table 5: Perplexity results on the train, valid and test data
tion set
Acoustic model SWB WER, % CH WER, % Language model PPL train PPL valid PPL test
SDBN-DNN 12.1 23.3 4-gram (baseline) 66.366 62.946 87.039
SDBN-DNN + augm 11.8 (-0.3) 22.5 (-0.8) RNNLM 57.982 78.578 76.123
LSTM-LM (medium) 51.104 58.964 56.822
BLSTM 11.1 20.9
LSTM-LM (large) 46.033 54.821 52.892
BLSTM + augm 10.8 (-0.3) 20.4 (-0.5)
Here hlt , clt , ilt , ftl , olt , gtl ∈ Rn denote hidden state, memory To-End Memory Networks [26] and others. We are going to
cell state and the activations of input gate, forget gate, output investigate more complicated approaches of applying sophisti-
gate and input modulation gate in layer l ∈ [1, L] at time cated language models than simple n-best rescoring as well.
t, respectively; h0t ∈ Rn is an input word vector at time t;
T2n,4n : R2n → R4n is a linear transform with a bias; D is the
dropout operator that sets a random subset of its argument to
zero; symbol denotes element-wise multiplication. Logistic
(sigm) and hyperbolic tangent (tanh) activation functions in
these equations are applied element-wise. Activations hL t ∈ R
4. Discussion
The architecture of our system is depicted in Figure 2. In Figure 2: System architecture
Table 7 we present the results of comparison with existing
English CTS recognition systems. For clarity, we also specify
the data used for training acoustic and language models for 5. Acknowledgements
each system. Our system achieves the competitive results on
the HUB5 2000 evaluation set: 7.8% WER on the Switchboard The work was partially financially supported by the Govern-
part (which is the state-of-the-art result at the moment as far ment of the Russian Federation, Grant 074-U01.
as we know) and 16.0% WER on the CallHome part. Note that
acoustic models used in the system were trained only on the 300 6. References
