Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
Word Beam Search A Connectionist Temporal Classification Decoding Algorithm
Abstract—Recurrent Neural Networks (RNNs) are used for dictionaries as shown in Section V. Further, the algorithm
sequence recognition tasks such as Handwritten Text Recognition does not handle non-word character strings. Punctuation-
(HTR) or speech recognition. If trained with the Connectionist marks and large numbers occur in the IAM and Bentham
Temporal Classification (CTC) loss function, the output of such
a RNN is a matrix containing character probabilities for each datasets, however, putting all possible combinations of
time-step. A CTC decoding algorithm maps these character these into the dictionary would enlarge it unnecessarily.
probabilities to the final text. Token passing is such an algorithm WBS uses a prefix tree that is created from a dictionary
and is able to constrain the recognized text to a sequence of to constrain the words in the recognized text. Four different
dictionary words. However, the running time of token passing
depends quadratically on the dictionary size and it is not able methods to score the beams by a word-level LM are proposed:
to decode arbitrary character strings like numbers. This paper (1) only constrain the beams by the dictionary, (2) score when
proposes word beam search decoding, which is able to tackle a word is completely recognized, (3) forecast the score by
these problems. It constrains words to those contained in a calculating possible next words (Ortmanns et. al [5] use this
dictionary, allows arbitrary non-word character strings between idea in the context of hidden Markov models) and (4) forecast
words, optionally integrates a word-level language model and
has a better running time than token passing. The proposed the score with a random sample of possible next words.
algorithm outperforms best path decoding, vanilla beam search Further, there is an operating-state in which arbitrary non-word
decoding and token passing on the IAM and Bentham HTR character strings are recognized. The proposed algorithm is
datasets. An open-source implementation is provided. able to outperform best path decoding, VBS and token passing
Index Terms—connectionist temporal classification, decoding, on the IAM [6] and Bentham [7] datasets. Furthermore, the
language model, recurrent neural network, speech recognition,
handwritten text recognition running time outperforms token passing.
The rest of the paper is organized as follows: in Section II,
I. I NTRODUCTION a brief introduction to CTC loss and CTC decoding is given.
Sequence recognition is the task of transcribing sequences Then, prefix trees and LMs are discussed. Section IV presents
of data with sequences of labels [1]. Well known use-cases the proposed algorithm. The evaluation compares the scoring-
are Handwritten Text Recognition (HTR) and speech recog- modes of the algorithm and further compares the results with
nition. Graves et al. [2] introduce the Connectionist Temporal other decoding algorithms. Finally, the conclusion summarizes
Classification (CTC) operation which enables neural network the paper.
training from pairs of data and target labelings (text). The II. S TATE OF THE ART
neural network is trained to output the labelings in a specific
coding scheme. Decoding algorithms are used to calculate the First, the CTC operation is discussed. Afterwards, two
final labeling. Hwang and Sung [3] present a beam search state-of-the-art decoding algorithms, namely VBS and token
decoding algorithm which can be extended by a character-level passing, are presented.
Language Model (LM). Graves et al. [4] introduce the token A. Connectionist Temporal Classification
passing algorithm, which constraints its output to a sequence
A Recurrent Neural Network (RNN) outputs a sequence
of dictionary words and uses a word-level LM. The motivation
of length T with C + 1 character probabilities per sequence
to propose the Word Beam Search (WBS) decoding algorithm1
element, where C denotes the number of characters [4].
is twofold:
An additional pseudo-character is added to the RNN output
• Vanilla Beam Search (VBS) decoding works on character-
which is called blank and is denoted by “-” in this paper.
level and does not constrain its beams (text candidates) Picking one character per time-step from the RNN output and
to dictionary words. concatenating them form a path π [4]. The probability of a
• The running time of token passing depends quadratically
path is defined as the product of all character probabilities on
on the dictionary size [4], which is not feasible for large this path. A single character from a labeling is encoded by
1 An open-source implementation is available at: one or multiply adjacent occurrences of this character on the
https://github.com/githubharald/CTCWordBeamSearch path, possibly followed by a sequence of blanks [4]. A way
DOI 10.1109/ICFHR-2018.2018.00052
©2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including
reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or
reuse of any copyrighted component of this work in other works.
Fig. 1. FSM which produces valid paths (encodings) for the labeling “ab”.
The two left-most states are the initial states while the right-most state is the
final state.
IAM Bentham
Algorithm Tr+L Te Tr+L Te
Best Path Decoding 8.77 / 29.07 / 12 8.77 / 29.07 / 12 5.60 / 17.06 / 15 5.60 / 17.06 / 15
Token Passing (not feasible) 10.46 / 12.37 / 762 (not feasible) 8.16 / 9.24 / 1250
VBS, BW=15 8.48 / 28.24 / 56 8.27 / 27.34 / 64 5.55 / 16.39 / 69 5.35 / 16.02 / 63
VBS, BW=30 8.49 / 28.27 / 101 8.27 / 27.36 / 108 5.55 / 16.45 / 125 5.35 / 15.96 / 124
VBS, BW=50 8.49 / 28.27 / 168 8.27 / 27.32 / 184 5.55 / 16.55 / 210 5.35 / 16.06 / 202
WBS, mode=W, BW=15 8.95 / 24.19 / 90 5.62 / 11.01 / 85 5.47 / 14.09 / 77 4.22 / 7.90 / 56
WBS, mode=W, BW=30 8.44 / 23.77 / 145 5.08 / 10.42 / 140 5.18 / 13.92 / 156 3.90 / 7.60 / 104
WBS, mode=W, BW=50 8.25 / 23.67 / 229 4.86 / 10.15 / 217 5.12 / 13.87 / 289 3.73 / 7.50 / 182
WBS, mode=N, BW=15 10.00 / 23.88 / 83 5.33 / 9.77 / 56 6.15 / 13.85 / 99 4.07 / 7.08 / 74
WBS, mode=N+F, BW=15 8.61 / 22.86 / 16388 5.23 / 9.82 / 1040 6.76 / 18.00 / 24465 4.05 / 7.36 / 274
WBS, mode=N+F+S, BW=15 8.62 / 22.91 / 12226 5.21 / 9.78 / 786 6.75 / 18.06 / 18349 4.06 / 7.39 / 223