0% found this document useful (0 votes)
989 views12 pages

Presentation 2

The document discusses end-to-end automatic speech recognition. It mentions two popular open-source toolkits, ESPnet and Eesen, for building end-to-end ASR systems. ESPnet is based on Chainer and PyTorch and follows the Kaldi toolkit for data processing and recipes. Eesen is based on Kaldi but uses bidirectional RNNs/LSTMs with CTC training. It also discusses using Kaldi for end-to-end ASR with TensorFlow integration. Finally, it reviews several papers on end-to-end approaches using CNNs and RNNs with different features.

Uploaded by

api-332129590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
989 views12 pages

Presentation 2

The document discusses end-to-end automatic speech recognition. It mentions two popular open-source toolkits, ESPnet and Eesen, for building end-to-end ASR systems. ESPnet is based on Chainer and PyTorch and follows the Kaldi toolkit for data processing and recipes. Eesen is based on Kaldi but uses bidirectional RNNs/LSTMs with CTC training. It also discusses using Kaldi for end-to-end ASR with TensorFlow integration. Finally, it reviews several papers on end-to-end approaches using CNNs and RNNs with different features.

Uploaded by

api-332129590
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

End-to-End Automatic KUNAL DHAWAN

Speech Recognition KUMAR PRIYADARSHI


Meeting 1
End to End ASR:
online libraries and
open source code
ESPnet: end-to-
end speech
processing toolkit
 Based on Chainer and
PyTorch
 Follows Kaldi ASR toolkit style
for data processing, feature
extraction/format, and
recipes to provide a
complete setup for speech
recognition
 Paper:
https://arxiv.org/pdf/1804.00
015.pdf
 Pretty recent , thus has some bugs, but contributors active in solving
them:
2)Eesen
 Based on Kaldi
 Acoustic Model -- Bi-directional RNNs with LSTM units.
 Training -- Connectionist temporal classification (CTC) as the training
objective.
 Decoding -- A principled decoding approach based on Weighted
Finite-State Transducers (WFSTs).
 Paper: https://arxiv.org/pdf/1507.08240.pdf
 Problems : Difficult to
modify and try out new
things using this library
Kaldi

No current implementation
specifically for end to end ASR

But Kaldi now offers tensorflow


integration. This means it would
be easy to try out our own
ideas
Literature Review
• End-to-End Deep Neural Network for Automatic Speech Recognition (2016)
William Song, Jim Cai, Stanford University

 Approach
 CNN for frame level Classification
 RNN with CTC loss for decoding
 Traditioinal Hidden Markov Model not used
 Used Mel logged-filter bank features as input

 Results
 Frame level classification satisfactory
 Decoding scheme needs improvement
Literature Review
• Towards End-To-End Speech Recognition with Deep Convolutional Neural
Networks Bengio et al., Interspeech 2016

 Approach
 CNN for frame level Classification
 No RNN used at all
 CTC loss used for decoding
 Traditioinal Hidden Markov Model not used
 Used Mel logged-filter bank features as input

 Results
 CNN able to capture temporal relations
 Training faster as comapred to RNN models
Literature Review
• End-To-End Speech Recognition from the Raw Waveform (2018)
Zeghidour et al., Facebook A.I.

 Approach
 End-to-End system trained directly from Raw Waveform
 Uses trainable filterbanks in place of log mel-filterbanks
 Uses CNN architecture

 Results
 Improved performance over log mel-filterbanks
Thank you!

You might also like