Improving English Conversational Telephone Speech Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/307889145

Improving English Conversational Telephone Speech Recognition

Conference Paper · September 2016


DOI: 10.21437/Interspeech.2016-473

CITATIONS READS

19 712

3 authors, including:

Ivan Medennikov Alexander Zatvornitskiy


Speech Technology Center Speech Technology Center
27 PUBLICATIONS   131 CITATIONS    13 PUBLICATIONS   65 CITATIONS   

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

OpenKWS 2016 View project

Exploring End-to-End Techniques for Low-Resource Speech Recognition View project

All content following this page was uploaded by Ivan Medennikov on 01 November 2016.

The user has requested enhancement of the downloaded file.


Improving English Conversational Telephone Speech Recognition
Ivan Medennikov1,2 , Alexey Prudnikov2,3 , Alexander Zatvornitskiy1,2,3
1
STC-innovations Ltd, St. Petersburg, Russia
2
ITMO University, St. Petersburg, Russia
3
Speech Technology Center Ltd, St. Petersburg, Russia
{medennikov,prudnikov,zatvornitskiy}@speechpro.com

Abstract The rest of this paper is organized as follows. Section 2


presents the investigation of several techniques of acoustic
The goal of this work is to build a state-of-the-art English con- modeling improvement, namely speaker-dependent bottleneck
versational telephone speech recognition system. We investi- features, deep BLSTM acoustic models, data augmentation and
gated several techniques to improve acoustic modeling, namely score fusion of DNN and BLSTM acoustic models. Section 3
speaker-dependent bottleneck features, deep Bidirectional Long describes the experiments on hypothesis rescoring with RNN-
Short-Term Memory (BLSTM) recurrent neural networks, data based language models. Finally, Section 4 concludes the paper
augmentation and score fusion of DNN and BLSTM models. and discusses future work.
Training set consisted of the 300 hour Switchboard English
speech corpus. We also examined the hypothesis rescoring
using language models based on recurrent neural networks. The
2. Acoustic modeling
resulting system achieves a word error rate of 7.8% on the In this section we study several acoustic modeling techniques
Switchboard part of the HUB5 2000 evaluation set which is the which are perspective for improving English CTS recognition.
competitive result. All experiments were performed on Switchboard-1 Release 2
Index Terms: conversational telephone speech recognition, (LDC97S62) training set. We report results in terms of word
deep neural networks, recurrent neural networks error rate on both Switchboard and CallHome subsets of the
HUB5 2000 evaluation set.
1. Introduction
2.1. Speaker-dependent bottleneck features
English conversational telephone speech (CTS) recognition sys-
Bottleneck features are widely used in ASR systems [9, 10].
tems are becoming better and better each year. This is caused
Here we present the acoustic modeling approach based on
by a large number of studies carried out on the Switchboard
speaker-dependent bottleneck (SDBN) features. This approach
English task, such as [1–7]. In recent years major improve-
was proposed in our previous work [11] for Russian sponta-
ment of English CTS recognition systems has been obtained
neous speech recognition and demonstrated high effectiveness.
by the use of the techniques listed below. First, acoustic mod-
The idea is to extract high-level features from DNN model,
els (AM) based on deep neural networks (DNN) significantly
which is adapted to the speaker and acoustic environment by the
outperformed Gaussian mixture models (GMM) [1]. Sequence-
use of i-vectors. The extracted features are applied to training
discriminative training of DNN acoustic models [2] also led to
another acoustic model (see Figure 1).
substantial recognition accuracy improvement. Second, apply-
Our approach consists of the following main steps:
ing acoustic models based on convolutional neural networks or
recurrent neural networks in combination with DNN acoustic 1. Training the DNN model on the source features using the
models showed high effectiveness. Last but not least, sophisti- Cross-Entropy (CE) criterion.
cated language models (LM) based on feedforward or recurrent 2. Expanding an input layer of the DNN trained at the first
neural networks demonstrated their superiority over n-gram step and retraining using input feature vector appended
language models. with i-vector. The regularizing term
So, the state-of-the-art results in terms of word error rate
(WER) on the Switchboard subset of the HUB5 2000 evaluation Nl Nl−1
L X
X X l l 2
set were improved from about 16% in 2011 to about 12% in R=λ (Wij − W̄ij ) (1)
2013, 10.4% in 2014 and 8% in 2015. The impressive WER l=1 i=1 j=1
of 8% reported by IBM researchers [6] is not too far from
the human word error rate on the Switchboard English CTS is added to the CE criterion for penalizing parameters
recognition task, which was estimated to be around 4% in [8]. deviation from the source model. Here Wl and W̄l are
In this work we present the study on building a state-of- weight matrices of l-th layer (1 ≤ l ≤ L) of the current
the-art English CTS recognition system. We used the approach and the source DNNs, Nl is the size of l-th layer, and N0
of finding and investigating the effective techniques and com- is the dimension of the input feature vector.
bining them. The resulting system achieves the competitive 3. Transforming the last hidden layer into two layers. The
results on the HUB5 2000 evaluation set: 7.8% WER on the first one is a bottleneck layer with weight matrix Wbn ,
Switchboard subset (which is the state-of-the-art result at the zero bias vector and linear activation function. The sec-
moment as far as we know) and 16.0% WER on the CallHome ond one is a non-linear layer with the dimension of the
subset. source layer, with weight matrix Wout and the original
Gaussian, which was trained with our toolset [13] on the full
Switchboard corpus. DNN training with the constructed SDBN
features (SDBN-DNN) was performed using the temporal con-
text of 31 frames taking every 5th frame. We applied the
following DNN configuration: 4 sigmoidal hidden layers with
2048 neurons in each, the output softmax layer with about 9000
neurons corresponding to senones of the GMM-HMM model,
which was trained using the same SDBN features. The training
was carried out with the sMBR criterion. For the comparison,
we also performed sMBR training of the speaker-adapted with
i-vectors DNN model (DNN-ivec). The results given in Table 1
demonstrate effectiveness of the presented approach.

Table 1: Speaker-dependent bottleneck approach results on the


HUB5 2000 evaluation set
Acoustic model SWB WER, % CH WER, %
DNN-baseline 12.9 24.5
DNN-ivec 12.5 (-0.4) 24.2 (-0.3)
SDBN-DNN 12.1 (-0.8) 23.3 (-1.2)
Figure 1: Speaker-dependent bottleneck approach scheme

2.2. Bidirectional Long Short-Term Memory recurrent


bias vector b, activation function f and the dimension of neural networks
the source layer. Acoustic models based on deep Bidirectional Long Short-Term
y = f (Wx + b) ≈ f (Wout (Wbn x + 0) + b). (2) Memory (BLSTM) recurrent neural networks demonstrate high
effectiveness in various ASR tasks [7,14,15]. In this subsection
These layers are formed by applying Singular Value we describe our experiments with these models carried out with
Decomposition (SVD) to the weight matrix W of the nnet3 setup of the Kaldi speech recognition toolkit.
source layer: We used BLSTM architecture with projection layers de-
scribed in paper [16]. The following configuration of the
W = USVT ≈ Ũbn Ṽbn
T
= Wout Wbn , (3) network was applied: 3 forward and 3 backward layers, cell
and hidden dimensions are 1024, recurrent and non-recurrent
where bn designates reduced dimension. projection dimensions are 128, input features are taken with the
4. Retraining the network formed at the previous step using temporal context of 5 frames. Training examples consisted of
the CE criterion with the penalty (1) for parameters chunks of 20 frames with additional left context of 40 frames
deviation from original values. and right context of 40 frames. We performed 8 epochs of
cross-entropy training with initial learning rate of 0.0003 and
5. Discarding all layers after the bottleneck and extracting
final learning rate of 0.00003. Model parameters were updated
high-level SDBN features using the resulting DNN.
using BPTT algorithm with the momentum value equal to 0.5.
6. Training the GMM-HMM acoustic model using the con- We tried a few input features configurations and chose
structed SDBN features and generating the senone align- 23-dimensional log mel filterbank energy (FBANK) features.
ment of the training data. First, we found that training data alignments prepared using
7. Training the final DNN-HMM acoustic model using SDBN-DNN acoustic model provide substantial improvement
SDBN features and the generated alignment. compared with GMM-derived alignments. Second, cepstral
mean normalization (CMN) of input features granted an addi-
For experiments we used the Kaldi speech recognition
tional improvement of the acoustic model. Third, we applied
toolkit [12], which contains the recipe for the Switchboard task.
speaker adaptation of BLSTM acoustic model using i-vectors
The performance of models was evaluated on the Switchboard
and obtained significant WER reduction. Lastly, the resulting
part of HUB5 2000 evaluation set. In this experiment we used
BLSTM was retrained using a wider chunk (80 frames) and
3-gram language model (750K n-grams, vocabulary of 30.3K
the same left and right contexts as used before. The retraining
words) from the Kaldi recipe. This model was trained on the
provided a substantial gain, we suppose this is due to the better
transcriptions of the Switchboard corpus only.
network performance on longer sequences.
The DNN-HMM model from this recipe
The main results of the experiments are summarized in
(local/run dnn.sh) [2] was considered to be a baseline
Table 2.
(DNN-baseline). This DNN with 6 hidden layers with 2048
sigmoidal neurons in each and the output softmax layer
2.3. Data augmentation
with about 9000 neurons was trained using 11 spliced 40-
dimensional fMLLR-adapted features and state-level Minimum For further improvement of our acoustic models, we tried the
Bayes Risk (sMBR) sequence-discriminative criterion. data augmentation approach presented in [17]. Two additional
80-dimensional SDBN features were constructed using the copies of the training data were created by modifying the speed
presented approach. We applied 100-dimensional i-vectors to 90% and 110% of the original speed. The alignments for
extracted by the use of Universal Background Model with 512 the speed perturbed data were generated using SDBN-DNN
3. Language modeling
Table 2: BLSTM results on the HUB5 2000 evaluation set
In this section we describe the experiments with language mod-
Acoustic model SWB WER, % CH WER, %
els. Word lattices obtained on the decoding pass with 3-gram
baseline BLSTM 12.6 23.8 LM and the best DNN+BLSTM models fusion in subsection 2.4
+DNN alignment 12.2 (-0.4) 22.6 (-1.2) were taken as a starting point for these experiments.
+CMN 12.1 (-0.5) 21.7 (-2.1) At the first stage, we applied lattice rescoring with the 4-
+i-vectors 11.3 (-1.3) 21.4 (-2.4) gram language model (4.7M n-grams) from the Kaldi recipe.
+retraining 11.1 (-1.5) 20.9 (-2.9) 4-gram LM was obtained by the linear interpolation of 4-gram
models trained on the transcriptions of Switchboard and Fisher
corpora. This LM had the same vocabulary as the 3-gram model
acoustic model from subsection 2.1. We applied the augmen- used in our previous experiments.
tation of the training data for both SDBN-DNN and BLSTM We also built two neural network LMs (NNLMs). We took
acoustic models. For BLSTM model, we also applied volume utterances from the transcriptions of Switchboard and Fisher
perturbation of the training data [18]: each
 recording was scaled corpora, shuffled them and replaced Out-Of-Vocabulary words
with a factor chosen randomly in range 18 , 2 .

with <UNK> token. These utterances were divided into two
As can be seen in Table 3, data augmentation provided parts: a valid set (20K utterances) and a train set (all other,
a considerable gain on the HUB5 2000 evaluation set. Note about 2.5M utterances). The transcriptions of the HUB5 2000
evaluation set were used as a test set.

Table 3: Data augmentation results on the HUB5 2000 evalua- Table 5: Perplexity results on the train, valid and test data
tion set
Acoustic model SWB WER, % CH WER, % Language model PPL train PPL valid PPL test
SDBN-DNN 12.1 23.3 4-gram (baseline) 66.366 62.946 87.039
SDBN-DNN + augm 11.8 (-0.3) 22.5 (-0.8) RNNLM 57.982 78.578 76.123
LSTM-LM (medium) 51.104 58.964 56.822
BLSTM 11.1 20.9
LSTM-LM (large) 46.033 54.821 52.892
BLSTM + augm 10.8 (-0.3) 20.4 (-0.5)

The first NNLM was Recurrent Neural Network Language


that for SDBN-DNN model we did not retrain the bottleneck
Model (RNNLM) [19]. It was shown that RNNLM significantly
extractor with the augmented data.
outperforms n-gram LM in various speech recognition tasks. In
particular, the results demonstrated by RNNLM in the English
2.4. Score fusion of SDBN-DNN and BLSTM acoustic mod- CTS recognition task can be found in the paper [20]. We trained
els our model using Mikolov’s RNNLM Toolkit [21]. We applied
Score fusion of acoustic models is a well known technique. the following RNNLM configuration: 256 neurons in the hid-
Its underlying idea is in combining the benefits of both dif- den layer, 4 × 200 MB of direct connections. To speed-up the
ferent model architectures and different input features. In this training we used the factorized output layer with 200 classes.
subsection we analyze effectiveness of this technique applied
to SDBN-DNN and BLSTM acoustic models. We used log-
likelihoods (LLH) determined by the formula Table 6: Rescoring results on the HUB5 2000 evaluation set
    Language model SWB WER, % CH WER, %
P1 (s|x) P2 (s|x) 3-gram (SWB) 9.9 18.9
LLH = α log + (1 − α) log (4)
P1 (s) P2 (s) 4-gram (SWB+FSH) 9.1 (-0.8) 17.6 (-1.3)
RNNLM 8.4 (-1.5) 16.8 (-2.1)
for the decoding with fusion of these acoustic models. Here LSTM-LM (medium) 8.0 (-1.9) 16.2 (-2.7)
P1 (s|x) and P2 (s|x) are posterior probabilities of state s given LSTM-LM (large) 7.8 (-2.1) 16.0 (-2.9)
input vector x on the current frame, P1 (s) and P2 (s) are prior
probabilities of state s for SDBN-DNN and BLSTM models re-
The second NNLM was LSTM recurrent neural network
spectively. We estimated prior probability of state s as average
LM (LSTM-LM) trained with dropout regularization [22]. This
posterior probability calculated with the corresponding model
model demonstrated state-of-the-art results in terms of perplex-
on the training data. α value was chosen equal to 0.5. The
ity (PPL) on the English Penn Treebank data set.
results of the experiments are given in Table 4. One can see the
The architecture of this LSTM-LM model with L layers is
given by the following equations [22]:
Table 4: Score fusion results on the HUB5 2000 evaluation set LSTM : hl−1
t , hlt−1 , clt−1 → hlt , clt , (5)
Acoustic model SWB WER, % CH WER, %  l 
it sigm

SDBN-DNN + augm 11.8 22.5 l
ft  sigm

D(hl−1 )

 l =  t
BLSTM + augm 10.8 20.4 T , (6)
 ot  sigm 2n,4n hlt−1
score fusion 9.9 (-0.9) 18.9 (-1.5)
gtl tanh
clt = ftl clt−1 + ilt gtl , (7)
significant WER improvement obtained by the score fusion of
SDBN-DNN and BLSTM acoustic models. hlt = olt tanh(clt ). (8)
Table 7: WER comparison with existing English CTS recognition systems on the HUB5 2000 evaluation set
System AM training data LM training data SWB CH
Vesely et al. [2] SWB SWB,FSH-1 12.6 24.1
Hannun et al. [5] SWB,FSH SWB,FSH 12.6 19.3
Peddinti et al. [18] SWB SWB,FSH 11.0 —
Soltau et al. [4] SWB SWB,FSH 10.4 19.1
Mohamed et al. [7] SWB,FSH,other SWB,FSH,other 9.2 —
Saon et al. [6] SWB,FSH,CH SWB,FSH,CH 8.0 14.1
This system SWB SWB,FSH 7.8 16.0

Here hlt , clt , ilt , ftl , olt , gtl ∈ Rn denote hidden state, memory To-End Memory Networks [26] and others. We are going to
cell state and the activations of input gate, forget gate, output investigate more complicated approaches of applying sophisti-
gate and input modulation gate in layer l ∈ [1, L] at time cated language models than simple n-best rescoring as well.
t, respectively; h0t ∈ Rn is an input word vector at time t;
T2n,4n : R2n → R4n is a linear transform with a bias; D is the
dropout operator that sets a random subset of its argument to
zero; symbol denotes element-wise multiplication. Logistic
(sigm) and hyperbolic tangent (tanh) activation functions in
these equations are applied element-wise. Activations hL t ∈ R
n

are used to predict the word at time t.


We used the Tensorflow toolkit [23] to train this model. We
trained two LSTM-LMs: “medium” (2 layers with 650 units
each, 50% dropout on the non-recurrent connections) and
“large” (2 layers with 1500 units each, 65% dropout on the non-
recurrent connections) configurations from the paper [22]. For
the “large” model forget gate biases were initialized with value
of 1.0. Training on NVIDIA GTX Titan X GPU took 40 hours
for the “medium” network and 146 hours for the “large” one.
The perplexity values of these LMs on the train, valid and
test data are given in Table 5. Note that valid PPL of the baseline
4-gram model is low due to the presence of valid texts in the
training data for this LM.
Both the trained NNLMs were applied for the hypothesis
rescoring. We generated 100-best lists from the 4-gram rescored
lattices using Kaldi scripts. For the rescoring we took the
weighted sum of n-gram LM and NNLM scores. The results of
the rescoring are given in Table 6. It can be seen that RNNLM
provided substantial improvement over n-gram LM, as well as
LSTM-LM over RNNLM.

4. Discussion
The architecture of our system is depicted in Figure 2. In Figure 2: System architecture
Table 7 we present the results of comparison with existing
English CTS recognition systems. For clarity, we also specify
the data used for training acoustic and language models for 5. Acknowledgements
each system. Our system achieves the competitive results on
the HUB5 2000 evaluation set: 7.8% WER on the Switchboard The work was partially financially supported by the Govern-
part (which is the state-of-the-art result at the moment as far ment of the Russian Federation, Grant 074-U01.
as we know) and 16.0% WER on the CallHome part. Note that
acoustic models used in the system were trained only on the 300 6. References
hour Switchboard English CTS corpus. [1] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
We consider several ways of further improvement of our using context-dependent deep neural networks,” Proceedings of
system. First, a great accuracy gain can be obtained by adding the Annual Conference of International Speech Communication
Fisher and CallHome corpora into the AM training set. Second, Association (INTERSPEECH), 2011.
sequence-discriminative training of BLSTM acoustic models [2] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
can lead to substantial WER reduction [24]. Third, retraining discriminative training of deep neural networks,” Proceedings of
the SDBN extractor with the augmented data can provide ad- the Annual Conference of International Speech Communication
ditional improvement. Last but not least, we plan to carry out Association (INTERSPEECH), 2013.
experiments with other promising language model architectures [3] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker
such as Character-Aware Neural Language Models [25], End- adaptation of neural network acoustic models using i-vectors,”
Proc. Automatic Speech Recognition and Understanding (ASRU), [20] Y. Shi, W.-Q. Zhang, M. Cai, and J. Liu, “Empirically combining
pp. 55–59, 2013. unnormalized NNLM and back-off n-gram for fast n-best rescor-
ing in speech recognition,” EURASIP Journal on Audio, Speech,
[4] H. Soltau, G. Saon, and T. Sainath, “Joint training of convolu- and Music Processing, vol. 19, 2014.
tional and non-convolutional neural networks,” Proceedings of
International Conference on Acoustics, Speech and Signal Pro- [21] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cer-
cessing (ICASSP), 2014. nocky, “RNNLM — recurrent neural network language modeling
toolkit,” ASRU 2011 Demo Session, 2011.
[5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and [22] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural
A. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” network regularization,” arXiv preprint arXiv:1409.2329, 2014.
arXiv preprint arXiv:1412.5567, 2014. [23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
[6] G. Saon, H.-K. Kuo, S. Rennie, and M. Picheny, “The IBM G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
2015 english conversational telephone speech recognition sys- I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
tem,” Proceedings of the Annual Conference of International L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
Speech Communication Association (INTERSPEECH), 2015. S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
[7] A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
and G. Penn, “Deep bi-directional recurrent networks over spec- Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
tral windows,” Proc. Automatic Speech Recognition and Under- learning on heterogeneous systems,” 2015, software available
standing (ASRU), pp. 78–83, 2015. from tensorflow.org. [Online]. Available: http://tensorflow.org/
[8] R. P. Lippmann, “Speech recognition by machines and humans,” [24] H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott,
Speech communication, vol. 22, no. 1, pp. 1–15, 1997. R. Monga, and M. Mao, “Sequence discriminative distributed
training of long short-term memory recurrent neural networks,”
[9] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic Proceedings of the Annual Conference of International Speech
and bottle-neck features for LVCSR of meetings,” Proceedings Communication Association (INTERSPEECH), 2014.
of International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 4, pp. 757–760, 2007. [25] Y. Kim, Y. Jernite, D. Sontag, and A. Rush, “Character-aware
neural language models,” arXiv preprint arXiv:1508.06615, 2015.
[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deep
bottleneck features using stacked auto-encoders,” Proceedings of [26] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end
International Conference on Acoustics, Speech and Signal Pro- memory networks,” arXiv preprint arXiv:1503.08895, 2015.
cessing (ICASSP), pp. 3377–3381, 2013.
[11] A. Prudnikov, I. Medennikov, V. Mendelev, M. Korenevsky, and
Y. Khokhlov, “Improving acoustic models for russian spontaneous
speech recognition,” Speech and Computer, Lecture Notes in
Computer Science, vol. 9319, pp. 234–242, 2015.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
recognition toolkit,” Proc. IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), pp. 1–4, 2011.
[13] A. Kozlov, O. Kudashev, Y. Matveev, T. Pekhovsky, K. Si-
monchik, and A. Shulipa, “SVID speaker recognition system
for NIST SRE 2012,” Speech and Computer, Lecture Notes in
Computer Science, vol. 8113, pp. 278–285, 2013.
[14] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” Proc. ICML, pp. 1764–1772,
2014.
[15] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate
recurrent neural network acoustic models for speech recognition,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), pp. 1468–1472,
2015.
[16] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re-
current neural network architectures for large scale acoustic mod-
eling,” Proceedings of the Annual Conference of International
Speech Communication Association (INTERSPEECH), 2014.
[17] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-
mentation for speech recognition,” Proceedings of the Annual
Conference of International Speech Communication Association
(INTERSPEECH), 2015.
[18] V. Peddinti, D. Povey, and K. S., “A time delay neural network
architecture for efficient modeling of long temporal contexts,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), 2015.
[19] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudan-
pur, “Recurrent neural network based language model,” Proceed-
ings of the Annual Conference of International Speech Communi-
cation Association (INTERSPEECH), pp. 1045–1048, 2010.

View publication stats

You might also like