0% found this document useful (0 votes)
125 views

Word Beam Search A Connectionist Temporal Classification Decoding Algorithm

Uploaded by

Abhishek Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
125 views

Word Beam Search A Connectionist Temporal Classification Decoding Algorithm

Uploaded by

Abhishek Das
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Word Beam Search: A Connectionist Temporal

Classification Decoding Algorithm


Harald Scheidl, Stefan Fiel, Robert Sablatnig
Computer Vision Lab
TU Wien
1040 Vienna, Austria
harald [email protected], {fiel,sab}@cvl.tuwien.ac.at

Abstract—Recurrent Neural Networks (RNNs) are used for dictionaries as shown in Section V. Further, the algorithm
sequence recognition tasks such as Handwritten Text Recognition does not handle non-word character strings. Punctuation-
(HTR) or speech recognition. If trained with the Connectionist marks and large numbers occur in the IAM and Bentham
Temporal Classification (CTC) loss function, the output of such
a RNN is a matrix containing character probabilities for each datasets, however, putting all possible combinations of
time-step. A CTC decoding algorithm maps these character these into the dictionary would enlarge it unnecessarily.
probabilities to the final text. Token passing is such an algorithm WBS uses a prefix tree that is created from a dictionary
and is able to constrain the recognized text to a sequence of to constrain the words in the recognized text. Four different
dictionary words. However, the running time of token passing
depends quadratically on the dictionary size and it is not able methods to score the beams by a word-level LM are proposed:
to decode arbitrary character strings like numbers. This paper (1) only constrain the beams by the dictionary, (2) score when
proposes word beam search decoding, which is able to tackle a word is completely recognized, (3) forecast the score by
these problems. It constrains words to those contained in a calculating possible next words (Ortmanns et. al [5] use this
dictionary, allows arbitrary non-word character strings between idea in the context of hidden Markov models) and (4) forecast
words, optionally integrates a word-level language model and
has a better running time than token passing. The proposed the score with a random sample of possible next words.
algorithm outperforms best path decoding, vanilla beam search Further, there is an operating-state in which arbitrary non-word
decoding and token passing on the IAM and Bentham HTR character strings are recognized. The proposed algorithm is
datasets. An open-source implementation is provided. able to outperform best path decoding, VBS and token passing
Index Terms—connectionist temporal classification, decoding, on the IAM [6] and Bentham [7] datasets. Furthermore, the
language model, recurrent neural network, speech recognition,
handwritten text recognition running time outperforms token passing.
The rest of the paper is organized as follows: in Section II,
I. I NTRODUCTION a brief introduction to CTC loss and CTC decoding is given.
Sequence recognition is the task of transcribing sequences Then, prefix trees and LMs are discussed. Section IV presents
of data with sequences of labels [1]. Well known use-cases the proposed algorithm. The evaluation compares the scoring-
are Handwritten Text Recognition (HTR) and speech recog- modes of the algorithm and further compares the results with
nition. Graves et al. [2] introduce the Connectionist Temporal other decoding algorithms. Finally, the conclusion summarizes
Classification (CTC) operation which enables neural network the paper.
training from pairs of data and target labelings (text). The II. S TATE OF THE ART
neural network is trained to output the labelings in a specific
coding scheme. Decoding algorithms are used to calculate the First, the CTC operation is discussed. Afterwards, two
final labeling. Hwang and Sung [3] present a beam search state-of-the-art decoding algorithms, namely VBS and token
decoding algorithm which can be extended by a character-level passing, are presented.
Language Model (LM). Graves et al. [4] introduce the token A. Connectionist Temporal Classification
passing algorithm, which constraints its output to a sequence
A Recurrent Neural Network (RNN) outputs a sequence
of dictionary words and uses a word-level LM. The motivation
of length T with C + 1 character probabilities per sequence
to propose the Word Beam Search (WBS) decoding algorithm1
element, where C denotes the number of characters [4].
is twofold:
An additional pseudo-character is added to the RNN output
• Vanilla Beam Search (VBS) decoding works on character-
which is called blank and is denoted by “-” in this paper.
level and does not constrain its beams (text candidates) Picking one character per time-step from the RNN output and
to dictionary words. concatenating them form a path π [4]. The probability of a
• The running time of token passing depends quadratically
path is defined as the product of all character probabilities on
on the dictionary size [4], which is not feasible for large this path. A single character from a labeling is encoded by
1 An open-source implementation is available at: one or multiply adjacent occurrences of this character on the
https://github.com/githubharald/CTCWordBeamSearch path, possibly followed by a sequence of blanks [4]. A way

DOI 10.1109/ICFHR-2018.2018.00052
©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Fig. 1. FSM which produces valid paths (encodings) for the labeling “ab”.
The two left-most states are the initial states while the right-most state is the
final state.

Fig. 3. Iteratively extending beam-labelings (from top to bottom) forms this


tree. Only the two best-scoring beams are kept per time-step (i.e. BW = 2),
all others are removed (red). Equal labelings get merged (blue).

given RNN output has a length of 2 and has 2 possible


characters “a” and “b” and further the blank “-”. Taking
the most probable characters yields the best path “- -” and
therefore the empty labeling B(“- -”)=“” with probability
Fig. 2. An example of a RNN output. The sequence has a length of 2 with
time-steps t0 and t1 . For each time-step a probability distribution over the 0.6 · 0.6 = 0.36. However, the correct answer is “a”, this
possible characters (“a” and “b” and blank “-”) is outputted by the RNN. can be seen by summing up the probabilities of all paths
Lines indicate four different paths through the RNN output: the dotted line yielding this labeling: “a -”, “- a” and “a a” with probability
indicates the best path yielding the labeling “” while the dashed lines indicate
paths yielding the labeling “a”. 2 · 0.6 · 0.4 + 0.4 · 0.4 = 0.64. The other decoding algorithms
presented in this paper are able to correctly handle such
situations.
to model this encoding is by using a Finite State Machine
(FSM) [3]. The FSM is created as follows: the labeling is first B. Vanilla Beam Search Decoding
extended by inserting blanks and then a state is created for The VBS decoding algorithm is described in the paper of
each character and blank. Consecutive states are connected by Hwang and Sung [3]. An illustration is shown in Figure 3. The
a transition and a self-loop is added to each state. Further, a RNN output is a matrix of size T × (C + 1) and is fed into the
direct transition skipping the blank is added for consecutive decoding algorithm. Multiple candidates for the final labeling
but different characters. Figure 1 shows a FSM which produces are iteratively calculated and are called beams. At each time-
valid paths for the labeling “ab” by proceeding from a start step, each beam-labeling is extended by all possible characters.
state to the final state on an arbitrary path, e.g. “ - a a - - b -” Additionally, the original beam is also copied to the next
or “a b” among others. time-step. This forms a tree as shown in Figure 3. To avoid
To decode a path into a labeling, the encoding operation exponential growth of the tree, only the best beams are kept at
implemented by the FSM has to be inverted which is done each time-step: the Beam Width (BW) governs the number of
by a collapsing function B [3]. It is applied to a path π and beams to keep. If two beam-labelings at a given time-step are
yields a labeling l by first removing repeated characters and equal, they get merged by first summing up the probabilities
then removing blanks on the path. To give an example, the path and then removing one of them. A character-level LM can
“a - b b - -” is collapsed to B(“a - b b - -”) =“ab”. The loss is optionally be used to score the extension of a beam-labeling
calculated by taking all paths yielding the target labeling l (i.e. by a character. Constrained decoding is possible by removing
all paths π for which B(π) = l holds) and summing over their a beam as soon as an Out-Of-Vocabulary (OOV) word occurs
probabilities. This enables training the RNN without knowing [8]. However, there is the chance that each beam contains an
the character-positions of the target labeling in the input. OOV word which limits the usage of this constrained decoding
After the RNN is trained by the CTC loss, new samples are approach.
presented to the neural network to recognize the handwritten The time-complexity can be derived from the pseudo-
text. A first approximation of decoding the RNN output is to code of Algorithm 1. At each time-step, the beams are
take the most probable character per time-step forming the sorted according to their score. In the previous time-step,
so called best path and then apply the collapsing function B each of the BW beams is extended by C characters, there-
to this path [4]. This approximation algorithm is called best fore BW · C beams have to be sorted which accounts for
path decoding [4]. However, there are situations for which O(BW · C · log(BW · C)). As this sorting happens for
this yields the wrong result as illustrated in Figure 2. The each of the T time-steps, the overall time-complexity is
Fig. 4. Three word models (blue) are put in parallel. Information flow is
implemented by tokens (red) which are passed through the states and between
the words.

O(T · BW · C · log(BW · C)). The two inner loops can be


ignored as they only account for O(BW · C).
Fig. 5. Prefix tree containing the words “a”, “to”, “too”, “this” and “that”.
C. Token Passing Double circles indicate that the word-flag is set.
This algorithm is proposed by Graves et al. [4], however,
the following discussion is based on another publication from
Graves [1]. A dictionary, a LM and a RNN output are given of the prefix, all descendant nodes are collected which have
and the algorithm outputs a sequence of dictionary words. the word-flag set. As an example, the characters and words
For each word a word-model is created, which essentially following the prefix “th” in Figure 5 are determined. The node
is a state machine connecting consecutive characters with is found by starting at the root node and following the edges
respect to the already discussed CTC coding scheme. A word “t” and “h”. Possible following characters are the outgoing
sequence is modeled by putting multiple word-models in edges “i” and “a” of this node. Words starting with the given
parallel, connecting all end-states with all begin-states. The prefix are “this” and “that”.
information flow is implemented by tokens, which are passed Aoe et al. [10] show an efficient implementation of this
from state to state. Each token holds the score and the history datastructure. Finding the node for a prefix with length L needs
of already visited words. The algorithms searches for the most O(L) time in their implementation. The time to find all words
likely sequence of dictionary words by aligning them with containing a given prefix depends on the number of nodes
the RNN output and scoring word-transitions with a word- of the tree. An upper bound for the number of nodes is the
level LM. Figure 4 shows three word-models which are put in number of words W times the maximum number of characters
parallel. The time-complexity of this algorithm is O(T · W 2 ), M of a word, therefore the time to find all words is O(W ·M ).
where T denotes the sequence length and W the dictionary B. Language Model
size [1].
A LM is able to predict upcoming words given previous
III. M ETHODOLOGY words and it is also able to assign probabilities to given
The proposed WBS decoding algorithm uses a prefix tree to sequences of words [11]. It can be queried to give the proba-
query the characters that can extend the current beam-labeling. bility P (w|h) that a word sequence (history) h is followed
Further, words which have the current beam-labeling as prefix by the word w. Such a model is trained from a text by
can also be queried. A LM is used to score the beams on counting how often w follows h. The probability of a sequence
word-level. is then P (h) = P (w1 ) · P (w2 |w1 ) · P (w3 |w1 , w2 ) · ... ·
P (wn |w1 , w2 , ..., wn−1 ) [11].
A. Prefix Tree It is not feasible to learn all possible word sequences,
A prefix tree or trie (from retrieval) is a basic tool in the therefore an approximation called N-gram is used [11]. Instead
domain of string processing [9]. Figure 5 shows a prefix tree of using the complete history, only a few words from the past
containing 5 words. It is a tree-datastructure and therefore are used to predict the next word. N-grams with N=2 are called
consists of edges and nodes. Each edge is labeled by a bigrams. Bigrams only take the last word into account, i.e. they
character and points to the next node. A node has a word- approximate P (wn |h) by P (wn |wn−1 ). The probability of a
Q|h|
flag which indicates if this node represents a word. A word sequence is then given by P (h) = P (w1 )· n=2 P (wn |wn−1 ).
is encoded in the tree by a path starting at the root node and Another special case is the unigram LM, which does not
following the edges labeled with the corresponding characters consider the history at all but only theQ
relative frequency of a
|h|
of the word. Querying characters which follow a given prefix word in the training-text, i.e. P (h) = n=1 P (wn ).
is easy: the node corresponding to the prefix is identified and The N-gram distributions are learned from a training-text.
the outgoing edge labels determine the characters which can For the unigram distribution, the number of occurrences of
follow the prefix. It is also possible to identify all words which a word is counted and normalized by the total number of
contain a given prefix: starting from the corresponding node words in the text. The bigram distribution is calculated by
Fig. 7. A beam currently in the word-state. The “#” character represents a
non-word character. The current prefix “th” can be extended by “i” and “a”
Fig. 6. A beam can be in one of two states. and can be extended to form the words “this” and “that”. The prefix tree from
Figure 5 is assumed in this example.

first counting how often a word w1 is followed by a word


w2 and then normalizing by the total number of words which • N-grams: each time a beam makes a transition from the
follow w1 . If a word is contained in the test-text but not in word-state to the non-word-state, the beam-labeling gets
the training-text, it is called OOV word. In this case a zero scored by the LM.
probability is assigned to the sequence, even if only one OOV • N-grams + Forecast: each time a beam is extended by a
word occurs. To overcome this problem smoothing can be word-character, all possible next words are queried from
applied to the N-gram distribution, more details are available the prefix tree. Figure 7 shows an example for the prefix
in Jurafsky [11]. “th” which can be extended to the words “this” and “that”.
All beam-extensions by possible next words are scored
IV. P ROPOSED ALGORITHM
by the LM and the scores are summed up. This scoring
WBS decoding is a modification of VBS decoding and has scheme can be regarded as a LM forecast.
the following properties: • N-grams + Forecast + Sample: in the worst case, all
• Words are constrained to dictionary words. words of the dictionary have to be taken into account
• Any number of non-word characters is allowed between for the forecast. To limit the number of possible next
words. words, these are randomly sampled before calculating the
• A word-level bigram LM can optionally be integrated. LM score. The sum of the scores must be corrected to
• Better running time than token passing (regarding time- account for the sampling process.
complexity and real running time on a computer). Algorithm 1 shows the pseudo-code for WBS decoding. The
To constrain words to dictionary words and also allow set B holds the beams of the current time-step, and P holds
arbitrary non-word characters between words, each beam is the probabilities for the beams. Pb is the probability that the
in one of two states. If a beam gets extended by a word- paths of a beam end with a blank, Pnb that they end with a
character (typically “a”, “b”, ...), then the beam is in the word- non-blank, and Ptxt is the probability assigned by the LM.
state, otherwise it is in the non-word-state. Figure 6 shows Ptot is an abbreviation for Pb + Pnb . The algorithm iterates
the two beam-states and the transitions. A beam is extended from t = 1 through t = T and creates a tree of beam-labelings
by a set of characters which depends on the beam-state. If as shown in Figure 3. An empty beam is denoted by ∅ and the
the beam is in the non-word-state, the beam-labeling can be last character of a beam is indexed by −1. The best beams
extended by all possible non-word-characters (typically “ ”, are obtained by sorting them with regard to Ptot · Ptxt and
“.”, ...). Furthermore, it can be extended by each character only keep the BW best ones. For each of the beams, the
which occurs as the first character of a word. These characters probability of seeing the beam-labeling at the current time-
are retrieved from the edges which leave the root node of the step is calculated. Separately book-keeping for paths ending
prefix tree. If the beam is in word-state, the prefix tree is with a blank and paths ending with a non-blank accounts for
queried for a list of possible next characters. Figure 7 shows the CTC coding scheme. Each beam gets extended by a set of
an example of a beam currently in the word-state. The last possible next characters, depending on the beam-state. When
word-characters form the prefix “th”, the corresponding node extending a beam, the LM calculates a score Ptxt depending
in the prefix tree is found by following the edges “t” and “h”. on the scoring-mode. The normalization of the LM score is
The outgoing edges of this node determine the next characters, achieved by taking Ptxt to the power of 1/numW ords(b),
which are “i” and “a” in this example. In addition, if the where numW ords(b) is the number of words contained in
current prefix represents a complete word (e.g. “to”), then the the beam b. After the algorithm finished its iteration through
next characters also include all non-word-characters. time, the beam-labelings get completed if necessary: if a beam-
Optionally, a word-level LM can be integrated. A bigram labeling ends with a prefix not representing a complete word,
LM is assumed in the following text. The more words a beam the prefix tree is queried to give a list of possible words
contains, the more often it gets scored by the LM. To account which contain the prefix. Two different ways to implement the
for this, the score gets normalized by the number of words. completion exist: either the beam-labeling is extended by the
Four possible LM scoring-modes exist and names are assigned most likely word (according to the LM), or the beam-labeling
to them which are used throughout the paper: is only completed if the list of possible words contains exactly
• Words: only a dictionary but no LM is used. one entry.
Algorithm 1: Word Beam Search mented using the TensorFlow framework. It consists of seven
Data: RNN output matrix mat, BW and LM convolutional layers, two RNN layers and a final CTC layer.
Result: most probable labeling The output sequence of the neural network has a length of
1 B = {∅}; 100 time-steps. IAM consists of 79 different characters while
2 Pb (∅, 0) = 1; Bentham has 93 characters, therefore the output of the RNN
3 for t = 1...T do is a matrix of size T × (C + 1) = 100 × 80 and 100 × 94
4 bestBeams = bestBeams(B, BW ); respectively. The tested algorithms are: best path decoding,
5 B={}; token passing, VBS and WBS. The last three algorithms
6 for b ∈ bestBeams do are implemented as custom TensorFlow operations in C++.
7 if b != ∅ then For WBS decoding, the four scoring modes are evaluated.
8 Pnb (b, t) += Pnb (b, t − 1) · mat(b(−1), t); Experiments for the BW values 15, 30 and 50 are conducted.
9 end The LM is either trained with the text of the test-set (denoted
10 Pb (b, t) += Ptot (b, t − 1) · mat(blank, t); as Te) or the text of the training-set concatenated with a word
11 B = B ∪ b; list2 (denoted as Tr+L) which consists of 370,099 words. The
12 nextChars = nextChars(b); resulting dictionary sizes are as follows: 3,707 and 373,412
13 for c ∈ nextChars do unique words for IAM and 1,911 and 372,933 unique words
14 b0 = b + c; for Bentham. Training the LM with the text that has to be
15 Ptxt (b0 ) = scoreBeam(LM, b, c); recognized can be seen as the best case and is of course a
16 if b(t) == c then simplification (LM has zero OOV words). Using the training-
17 Pnb (b0 , t) += mat(c, t) · Pb (b, t − 1); set concatenated with the word list, on the other hand, is a very
18 else rudimentary training-text for the LM. In practice, results can
19 Pnb (b0 , t) += mat(c, t) · Ptot (b, t − 1); be expected to be in between these two extreme cases. The LM
20 end uses add-k smoothing with a smoothing value of k = 0.01.
21 B = B ∪ b0 ; The results of the experiments are shown in Table I and are
22 end given in the format CER (%) / WER (%) / time per sample
23 end (ms). The latter value includes the time needed to evaluate
24 end the neural network. WBS is always able to outperform best
25 B = completeBeams(B); path decoding and token passing in at least one of its scoring-
26 return bestBeams(B, 1); modes. Regarding the BW, increasing this value for VBS only
marginally changes the results. The CER does not change at
all except when using the Tr+L training-text for IAM, while
the WER varies by around 0.1%. This suggests that a BW of
The time-complexity depends on the scoring-mode used. If
15 is large enough for this algorithm. In contrast, both error
no LM is used, the only difference to VBS is to query the
measures improve when increasing the BW of WBS. CER
next possible characters. This takes O(M · C) time, where
gets improved by up to 0.76% and WER by up to 0.86%.
M is the maximum length of a word and C is the number
WBS using Words mode outperforms VBS as long as the BW
of unique characters. The overall running time therefore is
is large enough (30 or 50). A possible explanation for the
O(T · BW · C · (log(BW · C) + M )). If a LM is used, then a
different impact of this parameter can be derived from the
lookup in a unigram and/or bigram table is performed when
variability of the words contained in the beam-labelings of a
extending a beam. Searching such a table takes O(log(W )).
single time-step. The word-variability of VBS is greater than
This sums to the overall running time of O(T · BW · C ·
the one of WBS and the beam-labelings of the latter algorithm
(log(BW ·C)+M +log(W ))). In the case of LM forecasting,
mainly differ in the punctuation. Therefore WBS needs a larger
the next words have to be searched which takes O(M · W ).
number of beams to allow recovering from a wrong word-
The LM is queried S times, where S is the size of the word
hypothesis. Increasing the BW from 15 to 50 increases the
sample. If no sampling is used, then S = W . The running time
running time by a factor of around 3. Regarding the scoring
sums to O(T · BW · C · (log(BW · C) + S · log(W ) + W · M )).
modes of WBS (BW fixed to 15 from now on), the WER
achieved by the Words mode can always be outperformed by
V. R ESULTS
at least one of the other modes which incorporate N-gram
Evaluation is done using the IAM and Bentham HTR probabilities. For IAM the best CER and WER is obtained
datasets. The goal of the evaluation is to compare the perfor- by N-grams + Forecast mode for the Tr+L training-text and
mance of different decoding algorithms given the same neural N-grams + Forecast + Sample mode for the Te training-text.
network output. It is not about achieving or outperforming For both training-texts of Bentham N-grams mode yields the
state-of-the-art results on the mentioned datasets. Character best WER. The CER also benefits from considering N-gram
Error Rate (CER) and Word Error Rate (WER) are chosen probabilities while decoding. The only exception is when using
as error measures [12]. The neural network is inspired by
the CRNN model proposed by Shi et al. [13] and is imple- 2 Taken from https://github.com/dwyl/english-words
TABLE I
E XPERIMENTAL RESULTS GIVEN AS CER / WER / TIME PER SAMPLE ( MS ). LM TRAINING TEXTS : TRAINING - SET CONCATENATED WITH WORD - LIST
(T R +L), TEST- SET (T E ). WBS SCORING MODES : W ORDS (W), N-G RAMS (N), N-G RAMS + F ORECAST (N+F), N-G RAMS + F ORECAST + S AMPLE
(N+F+S).

IAM Bentham
Algorithm Tr+L Te Tr+L Te
Best Path Decoding 8.77 / 29.07 / 12 8.77 / 29.07 / 12 5.60 / 17.06 / 15 5.60 / 17.06 / 15
Token Passing (not feasible) 10.46 / 12.37 / 762 (not feasible) 8.16 / 9.24 / 1250
VBS, BW=15 8.48 / 28.24 / 56 8.27 / 27.34 / 64 5.55 / 16.39 / 69 5.35 / 16.02 / 63
VBS, BW=30 8.49 / 28.27 / 101 8.27 / 27.36 / 108 5.55 / 16.45 / 125 5.35 / 15.96 / 124
VBS, BW=50 8.49 / 28.27 / 168 8.27 / 27.32 / 184 5.55 / 16.55 / 210 5.35 / 16.06 / 202
WBS, mode=W, BW=15 8.95 / 24.19 / 90 5.62 / 11.01 / 85 5.47 / 14.09 / 77 4.22 / 7.90 / 56
WBS, mode=W, BW=30 8.44 / 23.77 / 145 5.08 / 10.42 / 140 5.18 / 13.92 / 156 3.90 / 7.60 / 104
WBS, mode=W, BW=50 8.25 / 23.67 / 229 4.86 / 10.15 / 217 5.12 / 13.87 / 289 3.73 / 7.50 / 182
WBS, mode=N, BW=15 10.00 / 23.88 / 83 5.33 / 9.77 / 56 6.15 / 13.85 / 99 4.07 / 7.08 / 74
WBS, mode=N+F, BW=15 8.61 / 22.86 / 16388 5.23 / 9.82 / 1040 6.76 / 18.00 / 24465 4.05 / 7.36 / 274
WBS, mode=N+F+S, BW=15 8.62 / 22.91 / 12226 5.21 / 9.78 / 786 6.75 / 18.06 / 18349 4.06 / 7.39 / 223

the Tr+L training-text for Bentham, in which case Words ACKNOWLEDGMENTS


mode achieves the best CER. Best path decoding is the fastest This work has received funding from the European Union’s
algorithm which needs at most 15ms per sample. VBS is Horizon 2020 research and innovation programme under grant
around 5 times slower than best path decoding. The running agreement No 674943 (project READ).
time of WBS mainly depends on the scoring mode and the
dictionary size. It stays below 100ms for the Words and R EFERENCES
N-grams modes. Increasing the dictionary size by a factor [1] A. Graves, Supervised sequence labelling with recurrent neural net-
of 100 increases the running time by a factor of 1.5 for works. Springer, 2012.
[2] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist
N-grams mode on the IAM dataset. However, when using Temporal Classification: labelling unsegmented sequence data with
N-grams + Forecast mode on the same dataset the running recurrent neural networks,” in Proceedings of the 23rd International
time increases by a factor of 15.7. Therefore the forecasting- Conference on Machine Learning. ACM, 2006, pp. 369–376.
[3] K. Hwang and W. Sung, “Character-level incremental speech recognition
modes are only feasible for small dictionaries, while the Words with recurrent neural networks,” in IEEE International Conference on
and N-grams modes can also be used with large dictionaries. Acoustics, Speech and Signal Processing. IEEE, 2016, pp. 5335–5339.
Token passing is only evaluated for the smaller dictionary [4] A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and
J. Schmidhuber, “A novel connectionist system for unconstrained hand-
(because of its quadratic dependence on the dictionary size) writing recognition,” IEEE Transactions on Pattern Analysis and Ma-
created from the Te training-text, for which the algorithm takes chine Intelligence, vol. 31, no. 5, pp. 855–868, 2009.
762ms per sample for IAM and 1250ms for Bentham. This [5] S. Ortmanns, A. Eiden, H. Ney, and N. Coenen, “Look-ahead techniques
for fast beam search,” Computer Speech & Language, vol. 14, no. 1, pp.
proves the claim that WBS is faster than token passing when 15–32, 2000.
used with N-grams mode (which matches the type of LM [6] U. Marti and H. Bunke, “The IAM-database: an English sentence
integrated into token passing). database for offline handwriting recognition,” International Journal on
Document Analysis and Recognition, vol. 5, no. 1, pp. 39–46, 2002.
[7] J. Sánchez, V. Romero, A. Toselli, and E. Vidal, “ICFHR2014 com-
petition on handwritten text recognition on transcriptorium datasets,” in
VI. C ONCLUSION 14th International Conference on Frontiers in Handwriting Recognition.
IEEE, 2014, pp. 785–790.
[8] A. Graves and N. Jaitly, “Towards end-to-end speech recognition with
A decoding algorithm for CTC-trained neural networks was recurrent neural networks,” in Proceedings of the 31st International
proposed which restricts words to dictionary words, allows Conference on Machine Learning, 2014, pp. 1764–1772.
arbitrary character strings between words, can optionally in- [9] P. Brass, Advanced data structures. Cambridge University Press
Cambridge, 2008.
tegrate a word-level LM and is faster than token passing. It [10] J. Aoe, K. Morimoto, and T. Sato, “An efficient implementation of trie
comes with four different scoring-modes which govern the structures,” Software: Practice and Experience, vol. 22, no. 9, pp. 695–
effect of the LM and can also be used to control the running 721, 1992.
[11] D. Jurafsky and J. Martin, Speech and Language Processing. Pearson
time. The algorithm was evaluated on the IAM and Bentham London, 2014.
HTR datasets. Experiments have shown that the algorithm is [12] T. Bluche, “Deep Neural Networks for Large Vocabulary Handwritten
able to outperform best path decoding, VBS and token passing Text Recognition,” Ph.D. dissertation, Université Paris Sud-Paris XI,
2015.
for both an ideal and a rudimentary LM. The running time of [13] B. Shi, X. Bai, and C. Yao, “An End-to-End Trainable Neural Network
the Words and N-grams mode is in the order-of-magnitude for Image-based Sequence Recognition and its Application to Scene
of VBS. In case only a dictionary but no word-level LM is Text Recognition,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no. 4, pp. 2298–2304, 2016.
available, the Words mode is well suited which constrains the
words of the beam-labelings.

You might also like