0% found this document useful (0 votes)
136 views

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views

CTC Loss Function

The document discusses Connectionist Temporal Classification (CTC) as an end-to-end approach for automatic speech recognition. CTC trains a recurrent neural network to map acoustic input sequences directly to character output sequences without requiring frame-level alignment. It handles variable input/output lengths by introducing a "blank" symbol and summing over all possible alignments. CTC has been shown to produce competitive results on standard speech recognition benchmarks.

Uploaded by

Ivan Fadillah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

End-to-end systems 1: CTC

(Connectionist Temporal Classification)

Steve Renals

Automatic Speech Recognition – ASR Lecture 15


11 March 2019

ASR Lecture 15 End-to-end systems 1: CTC 1


End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches
ASR Lecture 15 End-to-end systems 1: CTC 2
End-to-end systems

End-to-end systems are systems which learn to directly map from an input
sequence X to an output sequence Y , estimating P(Y |X )
Y can be a sequence of words or subwords
ML trained HMMs are kind of end-to-end system – the HMM estimates P(X |Y ),
and when combined with a language model gives an estimate of P(Y |X )
Sequence discriminative training of HMMs (using GMMs or DNNs) can be
regarded as end-to-end
But training is quite complicated – need to estimate the denominator (total
likelihood) using lattices, first train conventionally (ML for GMMs, CE for NNs) then
finetune using sequence discriminative training
Lattice-free MMI is one way to address these issues
Other approaches based on recurrent networks which directly map input to output
sequences
CTC – Connectionist Temporal Classification
Encoder-decoder approaches (next lecture)
ASR Lecture 15 End-to-end systems 1: CTC 2
Here Wk and bk denote the k’th column of the weight matrix and k’th bias, respectively.
Once we have computed a prediction for P(ct |x), we compute the CTC loss [13] L(ŷ, y) to measure
Deep Speech
the error in prediction. During training, we can evaluate the gradient rŷ L(ŷ, y) with respect to
the network outputs given the ground-truth character sequence y. From this point, computing the
gradient with respect to all of the model parameters may be done via back-propagation through the
rest of the network. We use Nesterov’s Accelerated gradient method for training [41].3
Output: character probabilities (a-z, <apostrophe>, <space>, <blank>)
Trained using CTC

Softmax output layer

Bidirectional recurrent
hidden layer

3 feed-forward
hidden layers

Input: Filter bank features (spectrogram)


Figure 1: Structure of our RNN model and notation.
Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
The complete RNN model is illustrated in Figure 1. Note that its structure is considerably simpler
https://arxiv.org/abs/1412.5567.
than related models from the literature [14]—we have limited ourselves to a single recurrent layer
(which is the hardest to parallelize) andASR
we do not use
Lecture 15 Long-Short-Term-Memory
End-to-end systems 1: (LSTM)
CTC circuits. 3
Deep Speech: Results

Model SWB CH Full


Vesely et al. (GMM-HMM BMMI) [44] 18.6 33.0 25.8
Vesely et al. (DNN-HMM sMBR) [44] 12.6 24.1 18.4
Maas et al. (DNN-HMM SWB) [28] 14.6 26.3 20.5
Maas et al. (DNN-HMM FSH) [28] 16.0 23.7 19.9
Seide et al. (CD-DNN) [39] 16.1 n/a n/a
Kingsbury et al. (DNN-HMM sMBR HF) [22] 13.3 n/a n/a
Sainath et al. (CNN-HMM) [36] 11.5 n/a n/a
Soltau et al. (MLP/CNN+I-Vector) [40] 10.4 n/a n/a
Deep Speech SWB 20.0 31.8 25.9
Deep Speech SWB + FSH 12.6 19.3 16.0

Table 3: Published error rates (%WER) on Switchboard dataset splits. The columns labeled “SWB”
and “CH” are respectively the easy and hard subsets of Hub5’00.

5.2 Noisy speech


ASR Lecture 15 End-to-end systems 1: CTC 4
Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of


characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5


Deep Speech Training

Maps from acoustic frames X to subword sequences S, where S is a sequence of


characters (in some other CTC approaches, S can be a sequence of phones)
CTC loss function
Makes good use of large training data
Synthetic additional training data by jittering the signal and adding noise
Many computational optimisations
n-gram language model to impose word-level constraints
Competitive results on standard tasks

ASR Lecture 15 End-to-end systems 1: CTC 5


Connectionist Temporal Classification (CTC)

Train a recurrent network to map from input sequence X to output sequence S


sequences can be different lengths – for speech, input sequence X (acoustic frames)
is much longer than output sequence S (characters or phonemes)
CTC does not require frame-level alignment (matching each input frame to an
output token)
CTC sums over all possible alignments (similar to forward-backward algorithm) –
“alignment free”
Possible to back-propagate gradients through CTC

Gopod overview of CTC: Awni Hannun, “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc

ASR Lecture 15 End-to-end systems 1: CTC 6


CTC: Alignment
Imagine mapping (x1 , x2 , x3 , x4 , x5 , x6 ) to [a, b, c]
Possible alignments: aaabbc, aabbcc, abbbbc,. . .
However
Don’t always want to map every input frame to an output symbol (e.g. if there is
“inter-symbol silence”)
Want to be able to have two identical symbols adjacent to each other – keep the
difference between
Solve this using an additional blank symbol ()
CTC output compression
1 Merge repeating characters
2 Remove blanks
Thus to model the same character successively, separate with a blank
Some possible alignments for [h, e, l, l, o] and [h, e, l, o] given a 10-element input
sequence
[h, e, l, l, o]: helllo; hellloo
[h, e, l, o]: hellllo; hhelo
ASR Lecture 15 End-to-end systems 1: CTC 7
CTC: Alignment example

h h e ϵ ϵ l l l ϵ l l o
First, merge repeat
characters.
h e ϵ l ϵ l o
Then, remove any ϵ
tokens.
h e l l o
The remaining characters
are the output.
h e l l o

ASR Lecture 15 End-to-end systems 1: CTC 8


CTC: Valid and invalid alignments

Consider an output [c, a, t] with an input of length six

Valid Alignments Invalid Alignments


corresponds to
ϵ c c ϵ a t c ϵ c ϵ a t Y = [c, c, a, t]

c c a a t t c c a a t has length 5

c a ϵ ϵ ϵ t c ϵ ϵ ϵ t t missing the 'a'

ASR Lecture 15 End-to-end systems 1: CTC 9


CTC: Alignment properties

Monotonic – Alignments are monotonic (left-to-right model); no re-ordering


(unlike neural machine translation)
Many-to-one – Alignments are many-to-one; many inputs can map to the same
output (however a single input cannot map to many outputs)
CTC doesn’t find a single alignment: it sums over all possible alignments

ASR Lecture 15 End-to-end systems 1: CTC 10


CTC: Loss function (1)

Let C be an output label sequence, including blanks and repetitions – same length
as input sequence X
Posterior probability of output labels C = (c1 , . . . ct , . . . cT ) given the input
sequence X = (x1 , . . . xt , . . . xT ):
T
Y
P(C |X ) = y (ct , t)
t=1

where y (ct , t) is the output for label ct at time t


This is the probability of a single alignment

ASR Lecture 15 End-to-end systems 1: CTC 11


CTC: Loss function (2)

Let S be the target output sequence after compression


Compute the posterior probability of the target sequence S = (s1 , . . . sm , . . . sM )
(M ≤ T ) given X by summing over the possible CTC alignments:
X
P(S|X ) = P(C |X )
c∈A(S)

where A is the set of possible output label sequences c that can be mapped to S
using the CTC compression rules (merge repeated labels, then remove blanks)
The CTC loss function LCTC is given by the negative log likelihood of the sum of
CTC alignments:
LCTC = − log P(S|X )
Perform the sum over alignments using dynamic programming – similar structure
as used in forward-backward algorithm and Viterbi (see Hannun for details)
Various NN architectures can be used for CTC – usually use a deep bidirectional
LSTM RNN
ASR Lecture 15 End-to-end systems 1: CTC 12
CTC: Distribution over alignments
We start with an input sequence,
like a spectrogram of audio.

The input is fed into an RNN,


for example.

h h h h h h h h h h
e e e e e e e e e e The network gives pt (a | X ),
a distribution over the outputs
l l l l l l l l l l {h, e, l, o, ϵ} for each input step.

o o o o o o o o o o
ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ ϵ

h e ϵ l l ϵ l l o o With the per time-step output


distribution, we compute the
h h e l l ϵ ϵ l ϵ o probability of different sequences

ϵ e ϵ l l ϵ ϵ l o o

h e l l o By marginalizing over alignments,


we get a distribution over outputs.
e l l o
h e l o
ASR Lecture 15 End-to-end systems 1: CTC 13
Understanding CTC: Conditional independence assumption

Each output is dependent on the entire input sequence (in Deep Speech this is
achieved using a bidirectional recurrent layer)
Given the inputs, each output is independent of the other outputs (conditional
independence)
CTC does not learn a language model over the outputs, although a language
model can be applied later
Graphical model showing dependences in CTC:

a1 a2 aT

ASR Lecture 15 End-to-end systems 1: CTC 14


Understanding CTC: CTC and HMM

a b ϵ a ϵ b ϵ

Left-to-right HMM CTC HMM

CTC can be interpreted as an HMM with additional (skippable) blank states,


trained discriminatively

ASR Lecture 15 End-to-end systems 1: CTC 15


Applying language models to CTC

Direct interpolation of a language model with the CTC acoustic model:

Ŵ = arg max(α log P(S|X ) + log P(W ))


W

Only consider word sequences W which correspond to the subword sequence S


(using a lexicon)
α is an empirically determined scale factor to match the acoustic model to the
language model
Lexicon-free CTC: use a “subword language model” P(S) (Maas et al, 2015)
WFST implementation: create an FST T which transforms a framewise label
sequence c into the subword sequence S, then compose with L and G :
T ◦ min(det(L ◦ G )) (Miao et al, 2015)

ASR Lecture 15 End-to-end systems 1: CTC 16


Mozilla Deep Speech

Mozilla have released an Open Source TensorFlow implementation of the Deep


Speech architecture:
https://hacks.mozilla.org/2017/11/a-journey-to-10-word-error-rate/
https://github.com/mozilla/DeepSpeech
Close to state-of-the-art results on librispeech
Mozilla Common Voice project: https://voice.mozilla.org/en

ASR Lecture 15 End-to-end systems 1: CTC 17


Summary and reading

CTC is an alternative approach to sequence discriminative training, typically


applied to RNN systems
Used in “Deep Speech” architecture for end-to-end speech recognition
Reading
A Hannun et al (2014), “Deep Speech: Scaling up end-to-end speech recognition”,
ArXiV:1412.5567. https://arxiv.org/abs/1412.5567
A Hannun (2017), “Sequence Modeling with CTC”, Distill.
https://distill.pub/2017/ctc
Background reading
Y Miao et al (2015), “EESEN: End-to-end speech recognition using deep RNN
models and WFST-based decoding”, ASRU-2105.
https://ieeexplore.ieee.org/abstract/document/7404790
A Maas et al (2015). “Lexicon-free conversational speech recognition with neural
networks”, NAACL HLT 2015, http://www.aclweb.org/anthology/N15-1038

ASR Lecture 15 End-to-end systems 1: CTC 18

You might also like