0% found this document useful (0 votes)

136 views

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

136 views

CTC Loss Function

Uploaded by

Ivan Fadillah

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Steve Renals

Automatic Speech Recognition – ASR Lecture 15

11 March 2019

ASR Lecture 15 End-to-end systems 1: CTC 1

End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches (next lecture)
ASR Lecture 15 End-to-end systems 1: CTC 2
Here Wk and bk denote the k’th column of the weight matrix and k’th bias, respectively.
Once we have computed a prediction for P(ct |x), we compute the CTC loss [13] L(ŷ, y) to measure
Deep Speech
the error in prediction. During training, we can evaluate the gradient rŷ L(ŷ, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Output: character probabilities (a-z, <apostrophe>, <space>, <blank>)
Trained using CTC

Softmax output layer

Bidirectional recurrent
hidden layer

3 feed-forward
hidden layers

Input: Filter bank features (spectrogram)

Figure 1: Structure of our RNN model and notation.
Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler
https://arxiv.org/abs/1412.5567.
than related models from the literature [14]—we have limited ourselves to a single recurrent layer
(which is the hardest to parallelize) andASR
we do not use
Lecture 15 Long-Short-Term-Memory
End-to-end systems 1: (LSTM)
CTC circuits. 3
Deep Speech: Results

Model SWB CH Full

Vesely et al. (GMM-HMM BMMI) [44] 18.6 33.0 25.8
Vesely et al. (DNN-HMM sMBR) [44] 12.6 24.1 18.4
Maas et al. (DNN-HMM SWB) [28] 14.6 26.3 20.5
Maas et al. (DNN-HMM FSH) [28] 16.0 23.7 19.9
Seide et al. (CD-DNN) [39] 16.1 n/a n/a
Kingsbury et al. (DNN-HMM sMBR HF) [22] 13.3 n/a n/a
Sainath et al. (CNN-HMM) [36] 11.5 n/a n/a
Soltau et al. (MLP/CNN+I-Vector) [40] 10.4 n/a n/a
Deep Speech SWB 20.0 31.8 25.9
Deep Speech SWB + FSH 12.6 19.3 16.0

Table 3: Published error rates (%WER) on Switchboard dataset splits. The columns labeled “SWB”
and “CH” are respectively the easy and hard subsets of Hub5’00.

5.2 Noisy speech

ASR Lecture 15 End-to-end systems 1: CTC 4
Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of

characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5

Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Connectionist Temporal Classification (CTC)

Train a recurrent network to map from input sequence X to output sequence S

sequences can be different lengths – for speech, input sequence X (acoustic frames)
is much longer than output sequence S (characters or phonemes)
CTC does not require frame-level alignment (matching each input frame to an
output token)
CTC sums over all possible alignments (similar to forward-backward algorithm) –
“alignment free”
Possible to back-propagate gradients through CTC

Gopod overview of CTC: Awni Hannun, “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc

ASR Lecture 15 End-to-end systems 1: CTC 6

CTC: Alignment
Imagine mapping (x1 , x2 , x3 , x4 , x5 , x6 ) to [a, b, c]
Possible alignments: aaabbc, aabbcc, abbbbc,. . .
However
Don’t always want to map every input frame to an output symbol (e.g. if there is
“inter-symbol silence”)
Want to be able to have two identical symbols adjacent to each other – keep the
difference between
Solve this using an additional blank symbol ()
CTC output compression
1 Merge repeating characters
2 Remove blanks
Thus to model the same character successively, separate with a blank
Some possible alignments for [h, e, l, l, o] and [h, e, l, o] given a 10-element input
sequence
[h, e, l, l, o]: helllo; hellloo
[h, e, l, o]: hellllo; hhelo
ASR Lecture 15 End-to-end systems 1: CTC 7
CTC: Alignment example

h h e ϵ ϵ l l l ϵ l l o
First, merge repeat
characters.
h e ϵ l ϵ l o
Then, remove any ϵ
tokens.
h e l l o
The remaining characters
are the output.
h e l l o

ASR Lecture 15 End-to-end systems 1: CTC 8

CTC: Valid and invalid alignments

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments

corresponds to
ϵ c c ϵ a t c ϵ c ϵ a t Y = [c, c, a, t]

c c a a t t c c a a t has length 5

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9

CTC: Alignment properties

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering

(unlike neural machine translation)
Many-to-one – Alignments are many-to-one; many inputs can map to the same
output (however a single input cannot map to many outputs)
CTC doesn’t find a single alignment: it sums over all possible alignments

ASR Lecture 15 End-to-end systems 1: CTC 10

CTC: Loss function (1)

Let C be an output label sequence, including blanks and repetitions – same length
as input sequence X
Posterior probability of output labels C = (c1 , . . . ct , . . . cT ) given the input
sequence X = (x1 , . . . xt , . . . xT ):
T
Y
P(C |X ) = y (ct , t)
t=1

where y (ct , t) is the output for label ct at time t

This is the probability of a single alignment

ASR Lecture 15 End-to-end systems 1: CTC 11

CTC: Loss function (2)

Let S be the target output sequence after compression

Compute the posterior probability of the target sequence S = (s1 , . . . sm , . . . sM )
(M ≤ T ) given X by summing over the possible CTC alignments:
X
P(S|X ) = P(C |X )
c∈A(S)

where A is the set of possible output label sequences c that can be mapped to S
using the CTC compression rules (merge repeated labels, then remove blanks)
The CTC loss function LCTC is given by the negative log likelihood of the sum of
CTC alignments:
LCTC = − log P(S|X )
Perform the sum over alignments using dynamic programming – similar structure
as used in forward-backward algorithm and Viterbi (see Hannun for details)
Various NN architectures can be used for CTC – usually use a deep bidirectional
LSTM RNN
ASR Lecture 15 End-to-end systems 1: CTC 12
CTC: Distribution over alignments
We start with an input sequence,
like a spectrogram of audio.

The input is fed into an RNN,

for example.

h h h h h h h h h h
e e e e e e e e e e The network gives pt (a | X ),
a distribution over the outputs
l l l l l l l l l l {h, e, l, o, ϵ} for each input step.

o o o o o o o o o o
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ

h e ϵ l l ϵ l l o o With the per time-step output

distribution, we compute the
h h e l l ϵ ϵ l ϵ o probability of different sequences

ϵ e ϵ l l ϵ ϵ l o o

h e l l o By marginalizing over alignments,

we get a distribution over outputs.
e l l o
h e l o
ASR Lecture 15 End-to-end systems 1: CTC 13
Understanding CTC: Conditional independence assumption

Each output is dependent on the entire input sequence (in Deep Speech this is
achieved using a bidirectional recurrent layer)
Given the inputs, each output is independent of the other outputs (conditional
independence)
CTC does not learn a language model over the outputs, although a language
model can be applied later
Graphical model showing dependences in CTC:

a1 a2 aT

ASR Lecture 15 End-to-end systems 1: CTC 14

Understanding CTC: CTC and HMM

a b ϵ a ϵ b ϵ

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,

trained discriminatively

ASR Lecture 15 End-to-end systems 1: CTC 15

Applying language models to CTC

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))

Only consider word sequences W which correspond to the subword sequence S

(using a lexicon)
α is an empirically determined scale factor to match the acoustic model to the
language model
Lexicon-free CTC: use a “subword language model” P(S) (Maas et al, 2015)
WFST implementation: create an FST T which transforms a framewise label
sequence c into the subword sequence S, then compose with L and G :
T ◦ min(det(L ◦ G )) (Miao et al, 2015)

ASR Lecture 15 End-to-end systems 1: CTC 16

Mozilla Deep Speech

Mozilla have released an Open Source TensorFlow implementation of the Deep

Speech architecture:
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/
https://github.com/mozilla/DeepSpeech
Close to state-of-the-art results on librispeech
Mozilla Common Voice project: https://voice.mozilla.org/en

ASR Lecture 15 End-to-end systems 1: CTC 17

Summary and reading

CTC is an alternative approach to sequence discriminative training, typically

applied to RNN systems
Used in “Deep Speech” architecture for end-to-end speech recognition
Reading
A Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
ArXiV:1412.5567. https://arxiv.org/abs/1412.5567
A Hannun (2017), “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
Background reading
Y Miao et al (2015), “EESEN: End-to-end speech recognition using deep RNN
models and WFST-based decoding”, ASRU-2105.
https://ieeexplore.ieee.org/abstract/document/7404790
A Maas et al (2015). “Lexicon-free conversational speech recognition with neural
networks”, NAACL HLT 2015, http://www.aclweb.org/anthology/N15-1038

ASR Lecture 15 End-to-end systems 1: CTC 18

Simulation of Digital Communication Systems Using Matlab
From Everand
Simulation of Digital Communication Systems Using Matlab
Mathuranathan Viswanathan
3.5/5 (22)
GE Healthcare Project Management: Lunar Idxa - Rendering
No ratings yet
GE Healthcare Project Management: Lunar Idxa - Rendering
3 pages
40 Java Inheritance Practice Coding Questions
100% (1)
40 Java Inheritance Practice Coding Questions
24 pages
DIP5000 Teleprotection: Application Note
No ratings yet
DIP5000 Teleprotection: Application Note
5 pages
ed3book[347-520]
No ratings yet
ed3book[347-520]
174 pages
T4 - Towards End-To-End Speech Recognition PDF
No ratings yet
T4 - Towards End-To-End Speech Recognition PDF
177 pages
Deep Speech 3 1707.07413
No ratings yet
Deep Speech 3 1707.07413
8 pages
2021 DLA-09 ZalkowMueller CTC
No ratings yet
2021 DLA-09 ZalkowMueller CTC
46 pages
Lexicon-Free Conversational Speech Recognition With Neural Networks
No ratings yet
Lexicon-Free Conversational Speech Recognition With Neural Networks
10 pages
O S T R N N C T C: Nline Equence Raining OF Ecurrent Eural Etworks With Onnectionist Emporal Lassification
No ratings yet
O S T R N N C T C: Nline Equence Raining OF Ecurrent Eural Etworks With Onnectionist Emporal Lassification
16 pages
Comparison of Decoding Strategies For CTC Acoustic Models
No ratings yet
Comparison of Decoding Strategies For CTC Acoustic Models
5 pages
An Intuitive Explanation of Connectionist Temporal Classification
No ratings yet
An Intuitive Explanation of Connectionist Temporal Classification
7 pages
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
No ratings yet
Sequence Transduction With Recurrent Neural Networks: Hochreiter Et Al. 2001
9 pages
Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science
No ratings yet
Build A Handwritten Text Recognition System Using TensorFlow - by Harald Scheidl - Towards Data Science
11 pages
1507 08240
No ratings yet
1507 08240
8 pages
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
No ratings yet
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
8 pages
Deep Speech - Scaling Up End-To-End Speech Recognition
No ratings yet
Deep Speech - Scaling Up End-To-End Speech Recognition
12 pages
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
No ratings yet
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
6 pages
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
No ratings yet
Hybrid CTC/Attention Architecture For End-to-End Speech Recognition
16 pages
connectionist temporal classification
No ratings yet
connectionist temporal classification
6 pages
Speech Recognition With Deep Recurrent Neural Networks
No ratings yet
Speech Recognition With Deep Recurrent Neural Networks
2 pages
Advancing RNN Transducer Technology For Speech Recognition
No ratings yet
Advancing RNN Transducer Technology For Speech Recognition
5 pages
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
No ratings yet
Joint CTC-attention Decoding For End-To-End Speech Recognitionhori Et Al - 2017
12 pages
End-to-End Automatic Speech Recognition
No ratings yet
End-to-End Automatic Speech Recognition
19 pages
116.00000050
No ratings yet
116.00000050
64 pages
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
No ratings yet
Connectionist Temporal Classification: Labelling Unsegmented Sequence Data With Recurrent Neural Networks
8 pages
Presentation 2
No ratings yet
Presentation 2
12 pages
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
No ratings yet
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin
10 pages
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
No ratings yet
End-To-End Neural Architectures For Asr: Instructor: Preethi Jyothi
16 pages
An End-to-End Architecture For Keyword Spotting and Voice Activity Detection
No ratings yet
An End-to-End Architecture For Keyword Spotting and Voice Activity Detection
5 pages
End To End CSR
No ratings yet
End To End CSR
5 pages
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
No ratings yet
Li - 2022 - Recent Advances in End-to-End Automatic Speech Recognition（语音识别综述）
26 pages
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
SVTRv2
No ratings yet
SVTRv2
17 pages
Deep LSTM Networks For Online Chinese Handwriting Recognition
No ratings yet
Deep LSTM Networks For Online Chinese Handwriting Recognition
6 pages
Baskar Is2019 193167
No ratings yet
Baskar Is2019 193167
5 pages
Continual Training of Language Models For Few-Shot Learning
No ratings yet
Continual Training of Language Models For Few-Shot Learning
13 pages
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
No ratings yet
Multi-View Self-Supervised Learning and Multi-Scale Feature Fusion For Automatic Speech Recognition
20 pages
Mod 4-RNN Deep Learning
No ratings yet
Mod 4-RNN Deep Learning
63 pages
Listen, Attend and Spell
No ratings yet
Listen, Attend and Spell
16 pages
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
No ratings yet
Sequence Modeling RNN-LSTM-APPL-Anand Kumar JUNE2021
71 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
DRNN-AM
No ratings yet
DRNN-AM
5 pages
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
No ratings yet
Survey On Speech Imitation Using Machine Learning: Rahul Kumar, Jaybrata Chakraborty and Bappaditya Chakraborty
5 pages
Digital Coding Lecture Slide 4
No ratings yet
Digital Coding Lecture Slide 4
25 pages
CHAPTER THREE Prental Fix-1
No ratings yet
CHAPTER THREE Prental Fix-1
8 pages
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
No ratings yet
L - B S R G C N: Etter Ased Peech Ecognition With Ated ONV ETS
10 pages
Temporal Convolutional Network (TCN)
100% (1)
Temporal Convolutional Network (TCN)
21 pages
Scaling-up-online-speech-recognition-using-ConvNets
No ratings yet
Scaling-up-online-speech-recognition-using-ConvNets
8 pages
ASSAR, Richard - SampleRNN - Article
No ratings yet
ASSAR, Richard - SampleRNN - Article
11 pages
Signlanguage Detection 2
No ratings yet
Signlanguage Detection 2
30 pages
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
No ratings yet
Jointly Training Speech Recognition and Synthesis With Cycle Consistency
3 pages
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
No ratings yet
D V 3: S T - S C S L: EEP Oice Caling EXT TO Peech With Onvolutional Equence Earning
16 pages
Paper 2
No ratings yet
Paper 2
4 pages
A Review On Sign Language Recognition Using CNN: Abstract. in Sign Language, Hand Gestures Are Used As One Type of Non
No ratings yet
A Review On Sign Language Recognition Using CNN: Abstract. in Sign Language, Hand Gestures Are Used As One Type of Non
9 pages
BTP Thesis rs1 End-To-End-Asr
No ratings yet
BTP Thesis rs1 End-To-End-Asr
51 pages
Digital Communication Coding: En. Mohd Nazri Mahmud Mphil (Cambridge, Uk) Beng (Essex, Uk)
No ratings yet
Digital Communication Coding: En. Mohd Nazri Mahmud Mphil (Cambridge, Uk) Beng (Essex, Uk)
25 pages
Audio Adversarial Examples: Targeted Attacks On Speech-to-Text
No ratings yet
Audio Adversarial Examples: Targeted Attacks On Speech-to-Text
7 pages
CTC
No ratings yet
CTC
49 pages
A5 Batch
No ratings yet
A5 Batch
29 pages
3 D-T Signals Systems
No ratings yet
3 D-T Signals Systems
45 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning-Based Cross-Layer Resource Allocation For Wired Communication Systems
No ratings yet
Deep Learning-Based Cross-Layer Resource Allocation For Wired Communication Systems
5 pages
Object Tracking Methods-A Review
No ratings yet
Object Tracking Methods-A Review
7 pages
Tkinter Tuitorial
No ratings yet
Tkinter Tuitorial
24 pages
Frequent Pattern For Classification
No ratings yet
Frequent Pattern For Classification
10 pages
Liveness Detection in Face Recognition Using Deep Learning
No ratings yet
Liveness Detection in Face Recognition Using Deep Learning
4 pages
Exploring Engineering
No ratings yet
Exploring Engineering
4 pages
Face Mask Detection in The Era of The COVID-19: How Can Machine Learning Help The Authorities?
No ratings yet
Face Mask Detection in The Era of The COVID-19: How Can Machine Learning Help The Authorities?
7 pages
Tingkat Signifikansi Untuk Uji 1 Arah
No ratings yet
Tingkat Signifikansi Untuk Uji 1 Arah
5 pages
im (x −2) δ ϵ - x −9 - ϵ ⟺ - x−3 - δ
No ratings yet
im (x −2) δ ϵ - x −9 - ϵ ⟺ - x−3 - δ
2 pages
Exam 3
No ratings yet
Exam 3
3 pages
Volumetric Calculation
No ratings yet
Volumetric Calculation
3 pages
Managing Information Technology
No ratings yet
Managing Information Technology
18 pages
Skeletal Tracking Test Manual
No ratings yet
Skeletal Tracking Test Manual
10 pages
Adding Sub Screen in CS01: Requirement
No ratings yet
Adding Sub Screen in CS01: Requirement
10 pages
02-SS-Backup Policy V. 0.10
No ratings yet
02-SS-Backup Policy V. 0.10
7 pages
Hall Ticket Insurance - Ibu
No ratings yet
Hall Ticket Insurance - Ibu
2 pages
9300 Decoder User Manual-V1.0
No ratings yet
9300 Decoder User Manual-V1.0
30 pages
CPE Report-1
No ratings yet
CPE Report-1
54 pages
What Is Knowledge Managment PDF
No ratings yet
What Is Knowledge Managment PDF
32 pages
Two-Dimensional Arrays
No ratings yet
Two-Dimensional Arrays
16 pages
Scanv 1.ru - en
No ratings yet
Scanv 1.ru - en
51 pages
Oa
No ratings yet
Oa
23 pages
Steve Jobs Review
No ratings yet
Steve Jobs Review
3 pages
ControlSystemDesignwithMATLABandSimulink
No ratings yet
ControlSystemDesignwithMATLABandSimulink
4 pages
Graphql
No ratings yet
Graphql
12 pages
X-Camera Version 2.3: User's Guide
No ratings yet
X-Camera Version 2.3: User's Guide
76 pages
Lab 10 -Build a Chat Bot Using Azure OpenAI, Azure Cosmos DB for NoSQL, And Blazor (Optional))
No ratings yet
Lab 10 -Build a Chat Bot Using Azure OpenAI, Azure Cosmos DB for NoSQL, And Blazor (Optional))
96 pages
Midsemester Exam Solutions
No ratings yet
Midsemester Exam Solutions
6 pages
Apple Outline
50% (2)
Apple Outline
4 pages
TimHartridgeResume Ipaper
No ratings yet
TimHartridgeResume Ipaper
3 pages
Electrical Technical Officer PDF
No ratings yet
Electrical Technical Officer PDF
35 pages
Password Cracking
No ratings yet
Password Cracking
21 pages
Facom Innovations2009
No ratings yet
Facom Innovations2009
24 pages
I Robotricks PDF
No ratings yet
I Robotricks PDF
2 pages
Log Cat 1708752783167
No ratings yet
Log Cat 1708752783167
11 pages
SAND6221 Test
No ratings yet
SAND6221 Test
4 pages
Business Analyst Oracle (6237) PDF
No ratings yet
Business Analyst Oracle (6237) PDF
2 pages

CTC Loss Function

Uploaded by

CTC Loss Function

Uploaded by

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Automatic Speech Recognition – ASR Lecture 15

ASR Lecture 15 End-to-end systems 1: CTC 1

Softmax output layer

Input: Filter bank features (spectrogram)

Model SWB CH Full

5.2 Noisy speech

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Maps from acoustic frames X to subword sequences S, where S is a sequence of

ASR Lecture 15 End-to-end systems 1: CTC 5

Train a recurrent network to map from input sequence X to output sequence S

ASR Lecture 15 End-to-end systems 1: CTC 6

ASR Lecture 15 End-to-end systems 1: CTC 8

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering

ASR Lecture 15 End-to-end systems 1: CTC 10

where y (ct , t) is the output for label ct at time t

ASR Lecture 15 End-to-end systems 1: CTC 11

Let S be the target output sequence after compression

The input is fed into an RNN,

h e ϵ l l ϵ l l o o With the per time-step output

h e l l o By marginalizing over alignments,

ASR Lecture 15 End-to-end systems 1: CTC 14

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,

ASR Lecture 15 End-to-end systems 1: CTC 15

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))

Only consider word sequences W which correspond to the subword sequence S

ASR Lecture 15 End-to-end systems 1: CTC 16

Mozilla have released an Open Source TensorFlow implementation of the Deep

ASR Lecture 15 End-to-end systems 1: CTC 17

CTC is an alternative approach to sequence discriminative training, typically

ASR Lecture 15 End-to-end systems 1: CTC 18

You might also like