0% found this document useful (0 votes)

63 views6 pages

Improving English Conversational Telephone Speech Recognition

This document describes research into improving English conversational telephone speech recognition. It investigates techniques like speaker-dependent bottleneck features, deep bidirectional long short-term memory acoustic models, data augmentation, and score fusion of deep neural network and BLSTM models. The resulting system achieves a competitive 7.8% word error rate on the Switchboard portion of the HUB5 2000 evaluation set, which is the state-of-the-art at the time. Recurrent neural network language models are also used for hypothesis rescoring.

Uploaded by

Singitan Yomiyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

63 views6 pages

Improving English Conversational Telephone Speech Recognition

Uploaded by

Singitan Yomiyu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/307889145

Improving English Conversational Telephone Speech Recognition

Conference Paper · September 2016

DOI: 10.21437/Interspeech.2016-473

CITATIONS READS

19 712

3 authors, including:

Ivan Medennikov Alexander Zatvornitskiy

Speech Technology Center Speech Technology Center
27 PUBLICATIONS 131 CITATIONS 13 PUBLICATIONS 65 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

OpenKWS 2016 View project

Exploring End-to-End Techniques for Low-Resource Speech Recognition View project

All content following this page was uploaded by Ivan Medennikov on 01 November 2016.

The user has requested enhancement of the downloaded file.

Improving English Conversational Telephone Speech Recognition
Ivan Medennikov1,2 , Alexey Prudnikov2,3 , Alexander Zatvornitskiy1,2,3
1
STC-innovations Ltd, St. Petersburg, Russia
2
ITMO University, St. Petersburg, Russia
3
Speech Technology Center Ltd, St. Petersburg, Russia
{medennikov,prudnikov,zatvornitskiy}@speechpro.com

Abstract The rest of this paper is organized as follows. Section 2

presents the investigation of several techniques of acoustic
The goal of this work is to build a state-of-the-art English con- modeling improvement, namely speaker-dependent bottleneck
versational telephone speech recognition system. We investi- features, deep BLSTM acoustic models, data augmentation and
gated several techniques to improve acoustic modeling, namely score fusion of DNN and BLSTM acoustic models. Section 3
speaker-dependent bottleneck features, deep Bidirectional Long describes the experiments on hypothesis rescoring with RNN-
Short-Term Memory (BLSTM) recurrent neural networks, data based language models. Finally, Section 4 concludes the paper
augmentation and score fusion of DNN and BLSTM models. and discusses future work.
Training set consisted of the 300 hour Switchboard English
speech corpus. We also examined the hypothesis rescoring
using language models based on recurrent neural networks. The
2. Acoustic modeling
resulting system achieves a word error rate of 7.8% on the In this section we study several acoustic modeling techniques
Switchboard part of the HUB5 2000 evaluation set which is the which are perspective for improving English CTS recognition.
competitive result. All experiments were performed on Switchboard-1 Release 2
Index Terms: conversational telephone speech recognition, (LDC97S62) training set. We report results in terms of word
deep neural networks, recurrent neural networks error rate on both Switchboard and CallHome subsets of the
HUB5 2000 evaluation set.
1. Introduction
2.1. Speaker-dependent bottleneck features
English conversational telephone speech (CTS) recognition sys-
Bottleneck features are widely used in ASR systems [9, 10].
tems are becoming better and better each year. This is caused
Here we present the acoustic modeling approach based on
by a large number of studies carried out on the Switchboard
speaker-dependent bottleneck (SDBN) features. This approach
English task, such as [1–7]. In recent years major improve-
was proposed in our previous work [11] for Russian sponta-
ment of English CTS recognition systems has been obtained
neous speech recognition and demonstrated high effectiveness.
by the use of the techniques listed below. First, acoustic mod-
The idea is to extract high-level features from DNN model,
els (AM) based on deep neural networks (DNN) significantly
which is adapted to the speaker and acoustic environment by the
outperformed Gaussian mixture models (GMM) [1]. Sequence-
use of i-vectors. The extracted features are applied to training
discriminative training of DNN acoustic models [2] also led to
another acoustic model (see Figure 1).
substantial recognition accuracy improvement. Second, apply-
Our approach consists of the following main steps:
ing acoustic models based on convolutional neural networks or
recurrent neural networks in combination with DNN acoustic 1. Training the DNN model on the source features using the
models showed high effectiveness. Last but not least, sophisti- Cross-Entropy (CE) criterion.
cated language models (LM) based on feedforward or recurrent 2. Expanding an input layer of the DNN trained at the first
neural networks demonstrated their superiority over n-gram step and retraining using input feature vector appended
language models. with i-vector. The regularizing term
So, the state-of-the-art results in terms of word error rate
(WER) on the Switchboard subset of the HUB5 2000 evaluation Nl Nl−1
L X
X X l l 2
set were improved from about 16% in 2011 to about 12% in R=λ (Wij − W̄ij ) (1)
2013, 10.4% in 2014 and 8% in 2015. The impressive WER l=1 i=1 j=1
of 8% reported by IBM researchers [6] is not too far from
the human word error rate on the Switchboard English CTS is added to the CE criterion for penalizing parameters
recognition task, which was estimated to be around 4% in [8]. deviation from the source model. Here Wl and W̄l are
In this work we present the study on building a state-of- weight matrices of l-th layer (1 ≤ l ≤ L) of the current
the-art English CTS recognition system. We used the approach and the source DNNs, Nl is the size of l-th layer, and N0
of finding and investigating the effective techniques and com- is the dimension of the input feature vector.
bining them. The resulting system achieves the competitive 3. Transforming the last hidden layer into two layers. The
results on the HUB5 2000 evaluation set: 7.8% WER on the first one is a bottleneck layer with weight matrix Wbn ,
Switchboard subset (which is the state-of-the-art result at the zero bias vector and linear activation function. The sec-
moment as far as we know) and 16.0% WER on the CallHome ond one is a non-linear layer with the dimension of the
subset. source layer, with weight matrix Wout and the original
Gaussian, which was trained with our toolset [13] on the full
Switchboard corpus. DNN training with the constructed SDBN
features (SDBN-DNN) was performed using the temporal con-
text of 31 frames taking every 5th frame. We applied the
following DNN configuration: 4 sigmoidal hidden layers with
2048 neurons in each, the output softmax layer with about 9000
neurons corresponding to senones of the GMM-HMM model,
which was trained using the same SDBN features. The training
was carried out with the sMBR criterion. For the comparison,
we also performed sMBR training of the speaker-adapted with
i-vectors DNN model (DNN-ivec). The results given in Table 1
demonstrate effectiveness of the presented approach.

Table 1: Speaker-dependent bottleneck approach results on the

HUB5 2000 evaluation set
Acoustic model SWB WER, % CH WER, %
DNN-baseline 12.9 24.5
DNN-ivec 12.5 (-0.4) 24.2 (-0.3)
SDBN-DNN 12.1 (-0.8) 23.3 (-1.2)
Figure 1: Speaker-dependent bottleneck approach scheme

2.2. Bidirectional Long Short-Term Memory recurrent

bias vector b, activation function f and the dimension of neural networks
the source layer. Acoustic models based on deep Bidirectional Long Short-Term
y = f (Wx + b) ≈ f (Wout (Wbn x + 0) + b). (2) Memory (BLSTM) recurrent neural networks demonstrate high
effectiveness in various ASR tasks [7,14,15]. In this subsection
These layers are formed by applying Singular Value we describe our experiments with these models carried out with
Decomposition (SVD) to the weight matrix W of the nnet3 setup of the Kaldi speech recognition toolkit.
source layer: We used BLSTM architecture with projection layers de-
scribed in paper [16]. The following configuration of the
W = USVT ≈ Ũbn Ṽbn
T
= Wout Wbn , (3) network was applied: 3 forward and 3 backward layers, cell
and hidden dimensions are 1024, recurrent and non-recurrent
where bn designates reduced dimension. projection dimensions are 128, input features are taken with the
4. Retraining the network formed at the previous step using temporal context of 5 frames. Training examples consisted of
the CE criterion with the penalty (1) for parameters chunks of 20 frames with additional left context of 40 frames
deviation from original values. and right context of 40 frames. We performed 8 epochs of
cross-entropy training with initial learning rate of 0.0003 and
5. Discarding all layers after the bottleneck and extracting
final learning rate of 0.00003. Model parameters were updated
high-level SDBN features using the resulting DNN.
using BPTT algorithm with the momentum value equal to 0.5.
6. Training the GMM-HMM acoustic model using the con- We tried a few input features configurations and chose
structed SDBN features and generating the senone align- 23-dimensional log mel filterbank energy (FBANK) features.
ment of the training data. First, we found that training data alignments prepared using
7. Training the final DNN-HMM acoustic model using SDBN-DNN acoustic model provide substantial improvement
SDBN features and the generated alignment. compared with GMM-derived alignments. Second, cepstral
mean normalization (CMN) of input features granted an addi-
For experiments we used the Kaldi speech recognition
tional improvement of the acoustic model. Third, we applied
toolkit [12], which contains the recipe for the Switchboard task.
speaker adaptation of BLSTM acoustic model using i-vectors
The performance of models was evaluated on the Switchboard
and obtained significant WER reduction. Lastly, the resulting
part of HUB5 2000 evaluation set. In this experiment we used
BLSTM was retrained using a wider chunk (80 frames) and
3-gram language model (750K n-grams, vocabulary of 30.3K
the same left and right contexts as used before. The retraining
words) from the Kaldi recipe. This model was trained on the
provided a substantial gain, we suppose this is due to the better
transcriptions of the Switchboard corpus only.
network performance on longer sequences.
The DNN-HMM model from this recipe
The main results of the experiments are summarized in
(local/run dnn.sh) [2] was considered to be a baseline
Table 2.
(DNN-baseline). This DNN with 6 hidden layers with 2048
sigmoidal neurons in each and the output softmax layer
2.3. Data augmentation
with about 9000 neurons was trained using 11 spliced 40-
dimensional fMLLR-adapted features and state-level Minimum For further improvement of our acoustic models, we tried the
Bayes Risk (sMBR) sequence-discriminative criterion. data augmentation approach presented in [17]. Two additional
80-dimensional SDBN features were constructed using the copies of the training data were created by modifying the speed
presented approach. We applied 100-dimensional i-vectors to 90% and 110% of the original speed. The alignments for
extracted by the use of Universal Background Model with 512 the speed perturbed data were generated using SDBN-DNN
3. Language modeling
Table 2: BLSTM results on the HUB5 2000 evaluation set
In this section we describe the experiments with language mod-
Acoustic model SWB WER, % CH WER, %
els. Word lattices obtained on the decoding pass with 3-gram
baseline BLSTM 12.6 23.8 LM and the best DNN+BLSTM models fusion in subsection 2.4
+DNN alignment 12.2 (-0.4) 22.6 (-1.2) were taken as a starting point for these experiments.
+CMN 12.1 (-0.5) 21.7 (-2.1) At the first stage, we applied lattice rescoring with the 4-
+i-vectors 11.3 (-1.3) 21.4 (-2.4) gram language model (4.7M n-grams) from the Kaldi recipe.
+retraining 11.1 (-1.5) 20.9 (-2.9) 4-gram LM was obtained by the linear interpolation of 4-gram
models trained on the transcriptions of Switchboard and Fisher
corpora. This LM had the same vocabulary as the 3-gram model
acoustic model from subsection 2.1. We applied the augmen- used in our previous experiments.
tation of the training data for both SDBN-DNN and BLSTM We also built two neural network LMs (NNLMs). We took
acoustic models. For BLSTM model, we also applied volume utterances from the transcriptions of Switchboard and Fisher
perturbation of the training data [18]: each
recording was scaled corpora, shuffled them and replaced Out-Of-Vocabulary words
with a factor chosen randomly in range 18 , 2 .

with <UNK> token. These utterances were divided into two
As can be seen in Table 3, data augmentation provided parts: a valid set (20K utterances) and a train set (all other,
a considerable gain on the HUB5 2000 evaluation set. Note about 2.5M utterances). The transcriptions of the HUB5 2000
evaluation set were used as a test set.

Table 3: Data augmentation results on the HUB5 2000 evalua- Table 5: Perplexity results on the train, valid and test data
tion set
Acoustic model SWB WER, % CH WER, % Language model PPL train PPL valid PPL test
SDBN-DNN 12.1 23.3 4-gram (baseline) 66.366 62.946 87.039
SDBN-DNN + augm 11.8 (-0.3) 22.5 (-0.8) RNNLM 57.982 78.578 76.123
LSTM-LM (medium) 51.104 58.964 56.822
BLSTM 11.1 20.9
LSTM-LM (large) 46.033 54.821 52.892
BLSTM + augm 10.8 (-0.3) 20.4 (-0.5)

The first NNLM was Recurrent Neural Network Language

that for SDBN-DNN model we did not retrain the bottleneck
Model (RNNLM) [19]. It was shown that RNNLM significantly
extractor with the augmented data.
outperforms n-gram LM in various speech recognition tasks. In
particular, the results demonstrated by RNNLM in the English
2.4. Score fusion of SDBN-DNN and BLSTM acoustic mod- CTS recognition task can be found in the paper [20]. We trained
els our model using Mikolov’s RNNLM Toolkit [21]. We applied
Score fusion of acoustic models is a well known technique. the following RNNLM configuration: 256 neurons in the hid-
Its underlying idea is in combining the benefits of both dif- den layer, 4 × 200 MB of direct connections. To speed-up the
ferent model architectures and different input features. In this training we used the factorized output layer with 200 classes.
subsection we analyze effectiveness of this technique applied
to SDBN-DNN and BLSTM acoustic models. We used log-
likelihoods (LLH) determined by the formula Table 6: Rescoring results on the HUB5 2000 evaluation set
Language model SWB WER, % CH WER, %
P1 (s|x) P2 (s|x) 3-gram (SWB) 9.9 18.9
LLH = α log + (1 − α) log (4)
P1 (s) P2 (s) 4-gram (SWB+FSH) 9.1 (-0.8) 17.6 (-1.3)
RNNLM 8.4 (-1.5) 16.8 (-2.1)
for the decoding with fusion of these acoustic models. Here LSTM-LM (medium) 8.0 (-1.9) 16.2 (-2.7)
P1 (s|x) and P2 (s|x) are posterior probabilities of state s given LSTM-LM (large) 7.8 (-2.1) 16.0 (-2.9)
input vector x on the current frame, P1 (s) and P2 (s) are prior
probabilities of state s for SDBN-DNN and BLSTM models re-
The second NNLM was LSTM recurrent neural network
spectively. We estimated prior probability of state s as average
LM (LSTM-LM) trained with dropout regularization [22]. This
posterior probability calculated with the corresponding model
model demonstrated state-of-the-art results in terms of perplex-
on the training data. α value was chosen equal to 0.5. The
ity (PPL) on the English Penn Treebank data set.
results of the experiments are given in Table 4. One can see the
The architecture of this LSTM-LM model with L layers is
given by the following equations [22]:
Table 4: Score fusion results on the HUB5 2000 evaluation set LSTM : hl−1
t , hlt−1 , clt−1 → hlt , clt , (5)
Acoustic model SWB WER, % CH WER, %  l 
it sigm

SDBN-DNN + augm 11.8 22.5 l
ft  sigm

D(hl−1 )

 l =  t
BLSTM + augm 10.8 20.4 T , (6)
 ot  sigm 2n,4n hlt−1
score fusion 9.9 (-0.9) 18.9 (-1.5)
gtl tanh
clt = ftl clt−1 + ilt gtl , (7)
significant WER improvement obtained by the score fusion of
SDBN-DNN and BLSTM acoustic models. hlt = olt tanh(clt ). (8)
Table 7: WER comparison with existing English CTS recognition systems on the HUB5 2000 evaluation set
System AM training data LM training data SWB CH
Vesely et al. [2] SWB SWB,FSH-1 12.6 24.1
Hannun et al. [5] SWB,FSH SWB,FSH 12.6 19.3
Peddinti et al. [18] SWB SWB,FSH 11.0 —
Soltau et al. [4] SWB SWB,FSH 10.4 19.1
Mohamed et al. [7] SWB,FSH,other SWB,FSH,other 9.2 —
Saon et al. [6] SWB,FSH,CH SWB,FSH,CH 8.0 14.1
This system SWB SWB,FSH 7.8 16.0

Here hlt , clt , ilt , ftl , olt , gtl ∈ Rn denote hidden state, memory To-End Memory Networks [26] and others. We are going to
cell state and the activations of input gate, forget gate, output investigate more complicated approaches of applying sophisti-
gate and input modulation gate in layer l ∈ [1, L] at time cated language models than simple n-best rescoring as well.
t, respectively; h0t ∈ Rn is an input word vector at time t;
T2n,4n : R2n → R4n is a linear transform with a bias; D is the
dropout operator that sets a random subset of its argument to
zero; symbol denotes element-wise multiplication. Logistic
(sigm) and hyperbolic tangent (tanh) activation functions in
these equations are applied element-wise. Activations hL t ∈ R
n

are used to predict the word at time t.

We used the Tensorflow toolkit [23] to train this model. We
trained two LSTM-LMs: “medium” (2 layers with 650 units
each, 50% dropout on the non-recurrent connections) and
“large” (2 layers with 1500 units each, 65% dropout on the non-
recurrent connections) configurations from the paper [22]. For
the “large” model forget gate biases were initialized with value
of 1.0. Training on NVIDIA GTX Titan X GPU took 40 hours
for the “medium” network and 146 hours for the “large” one.
The perplexity values of these LMs on the train, valid and
test data are given in Table 5. Note that valid PPL of the baseline
4-gram model is low due to the presence of valid texts in the
training data for this LM.
Both the trained NNLMs were applied for the hypothesis
rescoring. We generated 100-best lists from the 4-gram rescored
lattices using Kaldi scripts. For the rescoring we took the
weighted sum of n-gram LM and NNLM scores. The results of
the rescoring are given in Table 6. It can be seen that RNNLM
provided substantial improvement over n-gram LM, as well as
LSTM-LM over RNNLM.

4. Discussion
The architecture of our system is depicted in Figure 2. In Figure 2: System architecture
Table 7 we present the results of comparison with existing
English CTS recognition systems. For clarity, we also specify
the data used for training acoustic and language models for 5. Acknowledgements
each system. Our system achieves the competitive results on
the HUB5 2000 evaluation set: 7.8% WER on the Switchboard The work was partially financially supported by the Govern-
part (which is the state-of-the-art result at the moment as far ment of the Russian Federation, Grant 074-U01.
as we know) and 16.0% WER on the CallHome part. Note that
acoustic models used in the system were trained only on the 300 6. References
hour Switchboard English CTS corpus. [1] F. Seide, G. Li, and D. Yu, “Conversational speech transcription
We consider several ways of further improvement of our using context-dependent deep neural networks,” Proceedings of
system. First, a great accuracy gain can be obtained by adding the Annual Conference of International Speech Communication
Fisher and CallHome corpora into the AM training set. Second, Association (INTERSPEECH), 2011.
sequence-discriminative training of BLSTM acoustic models [2] K. Vesely, A. Ghoshal, L. Burget, and D. Povey, “Sequence-
can lead to substantial WER reduction [24]. Third, retraining discriminative training of deep neural networks,” Proceedings of
the SDBN extractor with the augmented data can provide ad- the Annual Conference of International Speech Communication
ditional improvement. Last but not least, we plan to carry out Association (INTERSPEECH), 2013.
experiments with other promising language model architectures [3] G. Saon, H. Soltau, D. Nahamoo, and M. Picheny, “Speaker
such as Character-Aware Neural Language Models [25], End- adaptation of neural network acoustic models using i-vectors,”
Proc. Automatic Speech Recognition and Understanding (ASRU), [20] Y. Shi, W.-Q. Zhang, M. Cai, and J. Liu, “Empirically combining
pp. 55–59, 2013. unnormalized NNLM and back-off n-gram for fast n-best rescor-
ing in speech recognition,” EURASIP Journal on Audio, Speech,
[4] H. Soltau, G. Saon, and T. Sainath, “Joint training of convolu- and Music Processing, vol. 19, 2014.
tional and non-convolutional neural networks,” Proceedings of
International Conference on Acoustics, Speech and Signal Pro- [21] T. Mikolov, S. Kombrink, A. Deoras, L. Burget, and J. Cer-
cessing (ICASSP), 2014. nocky, “RNNLM — recurrent neural network language modeling
toolkit,” ASRU 2011 Demo Session, 2011.
[5] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,
E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates, and [22] W. Zaremba, I. Sutskever, and O. Vinyals, “Recurrent neural
A. Ng, “Deep Speech: Scaling up end-to-end speech recognition,” network regularization,” arXiv preprint arXiv:1409.2329, 2014.
arXiv preprint arXiv:1412.5567, 2014. [23] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro,
[6] G. Saon, H.-K. Kuo, S. Rennie, and M. Picheny, “The IBM G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat,
2015 english conversational telephone speech recognition sys- I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz,
tem,” Proceedings of the Annual Conference of International L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga,
Speech Communication Association (INTERSPEECH), 2015. S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner,
I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan,
[7] A. Mohamed, F. Seide, D. Yu, J. Droppo, A. Stolcke, G. Zweig, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke,
and G. Penn, “Deep bi-directional recurrent networks over spec- Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine
tral windows,” Proc. Automatic Speech Recognition and Under- learning on heterogeneous systems,” 2015, software available
standing (ASRU), pp. 78–83, 2015. from tensorflow.org. [Online]. Available: http://tensorflow.org/
[8] R. P. Lippmann, “Speech recognition by machines and humans,” [24] H. Sak, O. Vinyals, G. Heigold, A. Senior, E. McDermott,
Speech communication, vol. 22, no. 1, pp. 1–15, 1997. R. Monga, and M. Mao, “Sequence discriminative distributed
training of long short-term memory recurrent neural networks,”
[9] F. Grezl, M. Karafiat, S. Kontar, and J. Cernocky, “Probabilistic Proceedings of the Annual Conference of International Speech
and bottle-neck features for LVCSR of meetings,” Proceedings Communication Association (INTERSPEECH), 2014.
of International Conference on Acoustics, Speech and Signal
Processing (ICASSP), vol. 4, pp. 757–760, 2007. [25] Y. Kim, Y. Jernite, D. Sontag, and A. Rush, “Character-aware
neural language models,” arXiv preprint arXiv:1508.06615, 2015.
[10] J. Gehring, Y. Miao, F. Metze, and A. Waibel, “Extracting deep
bottleneck features using stacked auto-encoders,” Proceedings of [26] S. Sukhbaatar, A. Szlam, J. Weston, and R. Fergus, “End-to-end
International Conference on Acoustics, Speech and Signal Pro- memory networks,” arXiv preprint arXiv:1503.08895, 2015.
cessing (ICASSP), pp. 3377–3381, 2013.
[11] A. Prudnikov, I. Medennikov, V. Mendelev, M. Korenevsky, and
Y. Khokhlov, “Improving acoustic models for russian spontaneous
speech recognition,” Speech and Computer, Lecture Notes in
Computer Science, vol. 9319, pp. 234–242, 2015.
[12] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek,
N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz,
J. Silovsky, G. Stemmer, and K. Vesely, “The kaldi speech
recognition toolkit,” Proc. IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU), pp. 1–4, 2011.
[13] A. Kozlov, O. Kudashev, Y. Matveev, T. Pekhovsky, K. Si-
monchik, and A. Shulipa, “SVID speaker recognition system
for NIST SRE 2012,” Speech and Computer, Lecture Notes in
Computer Science, vol. 8113, pp. 278–285, 2013.
[14] A. Graves and N. Jaitly, “Towards end-to-end speech recognition
with recurrent neural networks,” Proc. ICML, pp. 1764–1772,
2014.
[15] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate
recurrent neural network acoustic models for speech recognition,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), pp. 1468–1472,
2015.
[16] H. Sak, A. Senior, and F. Beaufays, “Long short-term memory re-
current neural network architectures for large scale acoustic mod-
eling,” Proceedings of the Annual Conference of International
Speech Communication Association (INTERSPEECH), 2014.
[17] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-
mentation for speech recognition,” Proceedings of the Annual
Conference of International Speech Communication Association
(INTERSPEECH), 2015.
[18] V. Peddinti, D. Povey, and K. S., “A time delay neural network
architecture for efficient modeling of long temporal contexts,”
Proceedings of the Annual Conference of International Speech
Communication Association (INTERSPEECH), 2015.
[19] T. Mikolov, M. Karafiat, L. Burget, J. Cernocky, and S. Khudan-
pur, “Recurrent neural network based language model,” Proceed-
ings of the Annual Conference of International Speech Communi-
cation Association (INTERSPEECH), pp. 1045–1048, 2010.

View publication stats

Speech Recognition
No ratings yet
Speech Recognition
188 pages
Julian David Echeverry Correa
No ratings yet
Julian David Echeverry Correa
161 pages
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
100% (1)
Robust Speech Recognition Using Articulatory Information: Der Technischen Fakult at Der Universit at Bielefeld
148 pages
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
No ratings yet
Adaptation Algorithms For Neural Network-Based Speech Recognition An Overview
34 pages
Representation Analysis Methods - For Translation
No ratings yet
Representation Analysis Methods - For Translation
218 pages
Christoph Bensch Master Thesis
No ratings yet
Christoph Bensch Master Thesis
67 pages
ISM Report Final
No ratings yet
ISM Report Final
33 pages
AAM Sample Paper
100% (2)
AAM Sample Paper
4 pages
Alemayehu Yilma
No ratings yet
Alemayehu Yilma
67 pages
Applsci 12 01091
No ratings yet
Applsci 12 01091
18 pages
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
No ratings yet
Multitask Learning of Deep Neural Networks For Low-Resource Speech Recognition
12 pages
Voice Assistant
No ratings yet
Voice Assistant
34 pages
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
No ratings yet
International Journal of Cognitive Computing in Engineering: Harsh Ahlawat, Naveen Aggarwal, Deepti Gupta
37 pages
End-To-End Speech Recognition Models
No ratings yet
End-To-End Speech Recognition Models
94 pages
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
No ratings yet
Exploration of English Speech Translation Recognition Based On The LSTM RNN Algorithm
10 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Data-Driven Neural Network Based Feature - Phd-Thesis
No ratings yet
Data-Driven Neural Network Based Feature - Phd-Thesis
155 pages
2208.12666v1 Feature Extraction
No ratings yet
2208.12666v1 Feature Extraction
13 pages
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
No ratings yet
PHD Thesis Deep Learning For Automatic Assessment and Feedback of Spoken English
282 pages
8.5 Multilingual Speech Processing
No ratings yet
8.5 Multilingual Speech Processing
24 pages
Feature Learning For Efficient ASR-free Keyword Spotting in Low-Resource Languages
No ratings yet
Feature Learning For Efficient ASR-free Keyword Spotting in Low-Resource Languages
37 pages
Report Sample
No ratings yet
Report Sample
61 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Domain Adap Asr 5
No ratings yet
Domain Adap Asr 5
6 pages
Thesis-Speech Recognition Markov
No ratings yet
Thesis-Speech Recognition Markov
65 pages
Cross-Language Transfer Learning, Continuous Learning, and Domain
No ratings yet
Cross-Language Transfer Learning, Continuous Learning, and Domain
5 pages
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
No ratings yet
Speaker Adaptation For End-To-End Speech Recognition Systems in Noisy Environments
6 pages
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
No ratings yet
Contextualized Streaming End-to-End Speech Recognition With Trie-Based Deep Biasing and Shallow Fusion
5 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Improvements On Speech Recogniton For Fast Talkers
No ratings yet
Improvements On Speech Recogniton For Fast Talkers
5 pages
Sagae Lehr ET AL 2012 Hallucinated N Best Lists For Discriminative Language Modeling
No ratings yet
Sagae Lehr ET AL 2012 Hallucinated N Best Lists For Discriminative Language Modeling
5 pages
Technical Seminar - Report UPDATED
No ratings yet
Technical Seminar - Report UPDATED
22 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
Cmu Sphinx Audio To Text
No ratings yet
Cmu Sphinx Audio To Text
9 pages
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
No ratings yet
Indonesian Continuous Speech Recognition Optimization With Convolution Bidirectional Long Short-Term Memory Architecture
9 pages
Accounting Information Systems Basic Con
91% (11)
Accounting Information Systems Basic Con
385 pages
1 s2.0 S0957417424009850 Main
No ratings yet
1 s2.0 S0957417424009850 Main
11 pages
IJCRT2204469
No ratings yet
IJCRT2204469
5 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Speech Recognition On Mobile Devices
No ratings yet
Speech Recognition On Mobile Devices
27 pages
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
No ratings yet
A Review On Speech Recognition Approaches and Challenges For Portuguese: Exploring The Feasibility of Fine-Tuning Large-Scale End-To-End Models
13 pages
Lower Frame Rate Neural Network Acoustic Models
No ratings yet
Lower Frame Rate Neural Network Acoustic Models
5 pages
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
No ratings yet
CCS369 TEXT AND SPEECH ANALYSIS - Syllabus
4 pages
Speech Recognition Seminar
No ratings yet
Speech Recognition Seminar
19 pages
Extra Paper
No ratings yet
Extra Paper
11 pages
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
No ratings yet
Study On Speech Recognition Method of Artificial Intelligence Deep Learning
6 pages
Comparative Analysis of Automatic Speech Recognition Techniques
No ratings yet
Comparative Analysis of Automatic Speech Recognition Techniques
8 pages
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
No ratings yet
Speech Recognition Using Neural Networks: A. Types of Speech Utterance
24 pages
Speech Recognition Using HMM ANN Hybrid Model
No ratings yet
Speech Recognition Using HMM ANN Hybrid Model
4 pages
Abstract For Language Classification
No ratings yet
Abstract For Language Classification
1 page
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
No ratings yet
Speech Representation Models For Speech Synthesis and Multimodal Speech Recognition
63 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
No ratings yet
Easychair Preprint: Adnene Noughreche, Sabri Boulouma and Mohammed Benbaghdad
8 pages
Redaction HTK Amazigh Speech
No ratings yet
Redaction HTK Amazigh Speech
15 pages
Advances in Speech Transcription at IBM Under The DARPA EARS Program
No ratings yet
Advances in Speech Transcription at IBM Under The DARPA EARS Program
13 pages
Build Automatic Speech Recognition System: Bachelor of Technology
No ratings yet
Build Automatic Speech Recognition System: Bachelor of Technology
25 pages
Static Dictionary For Pronunciation Modeling
No ratings yet
Static Dictionary For Pronunciation Modeling
5 pages
Tandem Connectionist Feature Extraction For Conventional HMM Systems
No ratings yet
Tandem Connectionist Feature Extraction For Conventional HMM Systems
1 page
The Complete Guide To Artificial Intelligence in Radiology (2022)
100% (2)
The Complete Guide To Artificial Intelligence in Radiology (2022)
34 pages
On A Mathematical Understanding of Deep Neural Networks
No ratings yet
On A Mathematical Understanding of Deep Neural Networks
170 pages
Hand
No ratings yet
Hand
5 pages
Ann LA2 Project
No ratings yet
Ann LA2 Project
23 pages
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
No ratings yet
Jarvis Digital Life Assistant IJERTV2IS1237 PDF
6 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
Final Report Kist Final Year Project
No ratings yet
Final Report Kist Final Year Project
78 pages
Gelan Tulu 2020 PDF
No ratings yet
Gelan Tulu 2020 PDF
103 pages
Future Risks of Frontier Ai Annex A
No ratings yet
Future Risks of Frontier Ai Annex A
44 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Delil Thesis Paper Final
No ratings yet
Delil Thesis Paper Final
88 pages
Automatic Modulation Classification Using Different Neural Network and PCA Combinations
No ratings yet
Automatic Modulation Classification Using Different Neural Network and PCA Combinations
17 pages
Artificial Intelligence in Social Media
No ratings yet
Artificial Intelligence in Social Media
6 pages
20241024111806transparency and Privacy The Role of Explainable AI and Federated Learning in Financial Fraud Detection
No ratings yet
20241024111806transparency and Privacy The Role of Explainable AI and Federated Learning in Financial Fraud Detection
11 pages
St. Mary'S University Business Faculty Departement of Accounting
No ratings yet
St. Mary'S University Business Faculty Departement of Accounting
57 pages
Azeze Mulugojam FINAL
No ratings yet
Azeze Mulugojam FINAL
66 pages
Yeabsira Asefa PDF
No ratings yet
Yeabsira Asefa PDF
81 pages
Fpsyg 14 1126994
No ratings yet
Fpsyg 14 1126994
16 pages
Design and Implementation of Morphology Based Spell Checker
No ratings yet
Design and Implementation of Morphology Based Spell Checker
9 pages
ML Paper Review
No ratings yet
ML Paper Review
6 pages
Deep Learning For Smartphone
No ratings yet
Deep Learning For Smartphone
52 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
Data Science Learning Path
No ratings yet
Data Science Learning Path
62 pages
Arabic Sign Language Recognition: A Deep Learning Approach
No ratings yet
Arabic Sign Language Recognition: A Deep Learning Approach
107 pages
(IJCST-V12I3P8) :annie Florance V, Fathima G
No ratings yet
(IJCST-V12I3P8) :annie Florance V, Fathima G
6 pages
Entropy 25 01440
No ratings yet
Entropy 25 01440
33 pages
Malware Classification Using Deep Learning: Mohd Shahril
No ratings yet
Malware Classification Using Deep Learning: Mohd Shahril
48 pages
Arc Prize 2024 Technical Report
No ratings yet
Arc Prize 2024 Technical Report
12 pages
Plant Disease Detection and It's Health Monitoring Using CNN and Arduino
No ratings yet
Plant Disease Detection and It's Health Monitoring Using CNN and Arduino
9 pages
Explainable Artificial Intelligence For Manufacturing Cost Estimation and Machining Feature Visualization
No ratings yet
Explainable Artificial Intelligence For Manufacturing Cost Estimation and Machining Feature Visualization
20 pages
Opportunities For Neuromoriphic Computing
No ratings yet
Opportunities For Neuromoriphic Computing
10 pages
An LED Detection and Recognition Method Based On D
No ratings yet
An LED Detection and Recognition Method Based On D
10 pages
Presented By, Shobha C.Hiremath (01FE17MCS019)
No ratings yet
Presented By, Shobha C.Hiremath (01FE17MCS019)
25 pages
T2D2 ICRI CRB 2021 Article
No ratings yet
T2D2 ICRI CRB 2021 Article
5 pages
Ai Introduction
No ratings yet
Ai Introduction
2 pages
Video 4 The Terminology of AI
No ratings yet
Video 4 The Terminology of AI
5 pages
Jama Howell 2024 SC 230009 1704388744.36262
No ratings yet
Jama Howell 2024 SC 230009 1704388744.36262
3 pages
Machine Learning Trends Perspectives and Prospects
No ratings yet
Machine Learning Trends Perspectives and Prospects
7 pages
Diabetes Detection Using Deep Learning Algorithms 2018
No ratings yet
Diabetes Detection Using Deep Learning Algorithms 2018
4 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet

Improving English Conversational Telephone Speech Recognition

Uploaded by

Improving English Conversational Telephone Speech Recognition

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Improving English Conversational Telephone Speech Recognition

Conference Paper · September 2016

Ivan Medennikov Alexander Zatvornitskiy

SEE PROFILE SEE PROFILE

OpenKWS 2016 View project

Exploring End-to-End Techniques for Low-Resource Speech Recognition View project

The user has requested enhancement of the downloaded file.

Abstract The rest of this paper is organized as follows. Section 2

Table 1: Speaker-dependent bottleneck approach results on the

2.2. Bidirectional Long Short-Term Memory recurrent

The first NNLM was Recurrent Neural Network Language

are used to predict the word at time t.

View publication stats

You might also like