0% found this document useful (0 votes)
153 views20 pages

CNN-BiLSTM Model For English Handwriting Recognition, Comprehensiv Evalution On The IAM Dataset 2307.00664v1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
153 views20 pages

CNN-BiLSTM Model For English Handwriting Recognition, Comprehensiv Evalution On The IAM Dataset 2307.00664v1

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Springer Nature 2021 LATEX template

CNN-BiLSTM model for English Handwriting Recognition:


Comprehensive Evaluation on the IAM Dataset
Firat Kizilirmak1* and Berrin Yanikoglu1,2*
arXiv:2307.00664v1 [cs.CV] 2 Jul 2023

1* Faculty of Engineering and Nat. Sciences, Sabanci University, Istanbul, Türkiye, 34956.
2* Center of Excellence in Data Analytics (VERIM), Istanbul, Türkiye, 34956.

*Corresponding author(s). E-mail(s): [email protected];


[email protected];

Abstract
We present a CNN-BiLSTM system for the problem of offline English handwriting recognition, with
extensive evaluations on the public IAM dataset, including the effects of model size, data augmen-
tation and the lexicon. Our best model achieves 3.59% CER and 9.44% WER using CNN-BiLSTM
network with CTC layer. Test time augmentation with rotation and shear transformations applied
to the input image, is proposed to increase recognition of difficult cases and found to reduce the
word error rate by 2.5% points. We also conduct an error analysis of our proposed method on IAM
dataset, show hard cases of handwriting images and explore samples with erroneous labels. We provide
our source code as public-domain, to foster further research to encourage scientific reproducibility.

Keywords: offline handwriting; English; LSTM; deep learning

1 Introduction at the output, allowing characters that do not fit


in a single time step to be recognized.
Deep learning models have become the method Due to the success of attention based
of choice for handwriting text recognition (HTR) approaches in sequence related areas [2, 44], more
problem, especially in the last decade [4, 7, 13, recent works suggested the use of sequence-to-
27, 31, 39]. Most of the recent works in offline sequence and proposed attention based encoder-
handwriting recognition have used Convolutional decoder models [22, 24, 31]. Among these, ear-
Neural Networks (CNNs) in combination with lier methods have combined a CNN with a
recurrent neural networks (RNNs), using Connec- bi-directional RNN as the encoder and one-
tionist Temporal Classification (CTC) loss func- directional RNN with attention mechanism as the
tion [14] for training the network. This approach decoder. More recent approaches utilized Trans-
allows the model to be end-to-end trainable with- former decoders [7] for decoding or Vision Trans-
out requiring explicit image-character alignment. formers [27] as a sequence-to-sequence architec-
The CTC algorithm maximizes the total proba- ture to exploit their applicability over handwriting
bility of different segmentations over the output recognition.
of the RNN. Moreover, the CTC algorithm makes Despite recent progress, handwriting recogni-
many-to-one alignment in which more than one tion technologies fall short in recognizing challeng-
input time frame could map to a single time frame ing writing styles. In fact, there aren’t large public

1
Springer Nature 2021 LATEX template

method we proposed, and experiments we per-


formed. Afterwards, error analysis of our model on
the IAM dataset and comparison with the state of
the art approaches are given.

2 Related Work
Fig. 1 The same text written by different people, showing
the possible dissimilarities of handwriting. Examples from Hidden Markov Models. The fundamental
IAM dataset. approach to handwriting recognition prior to the
deep learning era was with Hidden Markov Models
datasets to cover the wide variability in people’s (HMMs) [35]. HMMs are doubly statistical mod-
handwritings (some examples are given in Figure els, where there is an underlying stochastic process
1). This leads us to a data sparseness problem governing the state transitions and another one for
in terms of capturing distinct writing styles. In generating observations during state transitions.
addition, most of the existing datasets comprise Formally the aim is to maximize the posterior
historical handwriting images, making it imprac- probability P (S | X), taking into account these
tical to train and evaluate a model for modern stochastic models of state transitions and output
handwriting cases. generation [3].
There are two common approaches in order to Preliminary studies explored offline handwrit-
alleviate the data sparseness: augmenting images ing recognition with a sliding window approach
at train time and generating synthetic handwrit- for feature extraction and Hidden Markov Mod-
ing images to pretrain the models [9, 27, 28, 31, els for generating character or word sequences
46]. In the former, handwriting images are trans- [5, 34]. They further supported their models with
formed to imitate the same text as if it is written external language models and investigated the
in a different style by changing letter shapes effects of lexicon during decoding.
while preserving the readability. In the latter case,
synthetic handwritten images are created. This CTC Based Methods. Connectionist Temporal
approach has the advantage of being able to push Classification (CTC) method was introduced for
the diversity of handwriting styles further. Both speech recognition task [14] in the last decade,
strategies are widely used for introducing differ- allowing RNN models to be trained end-to-end
ent writing styles to increase the generalization with the backpropagation [45] algorithm for
capacity of models. sequence classification without any need of pre-
In this paper, we have (1) conducted compre- segmented data. The method was adopted for
hensive analysis using deep learning models on handwriting recognition [13]. In CTC based mod-
line-level IAM [29] dataset, (2) proposed a sim- els, (1) a sequence of image features are extracted
ple yet effective test time augmentation method, using a sliding window (1 pixel wide), (2) the
(3) provided insightful error analysis regarding extracted features are then fed to a bi-directional
dataset related issues, (4) explored the state-of- LSTM (BiLSTM) [18], and (3) produced character
art approaches with their pros/cons and stated sequences via the CTC layer. The CTC algorithm
directions for future work, (5) made our train, takes sequence of probability distributions and
evaluation and benchmarking code public1 for generates a character sequence consisting of rec-
reproducibility. ognizable characters. These models significantly
The rest of the paper is organized as fol- outperformed HMM and HMM-RNN based
lows. First the related work regarding the offline approaches. Later, Graves et al. applied multi-
handwriting recognition is presented. Then our dimensional LSTM (MDLSTM) layers instead
proposed deep learning model comes, which is of BiLSTM to incorporate more context around
followed by data augmentation and synthetic letters and to obtain better transcriptions [15].
data generation phases, test time augmentation With the rise of deep learning and great per-
formances achieved by CNN models on image
1
processing tasks [17, 25], researchers have con-
https://github.com/firatkizilirmakk/handwriting-
recognition sidered deep learning methods for handwriting
Springer Nature 2021 LATEX template

recognition problem as well. Instead of using Apart from these, Kang et al [22] incorporated
hand-crafted image features [13, 15], Shi et al a character level language model into training
integrated a CNN network to produce more phase where they feed the attention based LSTM
robust image features [39]. The method first pro- decoder with the concatenation of encoder and
cesses input handwriting image with the CNN, language model outputs.
generates sequence of image features, and passes More recently, transformer [44] based models
them through BiLSTM-CTC layers to obtain have been used, due to their substantial achieve-
final transcription. Inspiring from these, Bluche ments on sequence related tasks [7, 27]. Diaz
et al proposed a gated convolutional model et al compared CTC and sequence-to-sequence
for computing more generic image features [4]. approaches, experimented with the transformer
Puigcerver [37] showed the effectiveness of single decoder and found the best performing model
dimensional LSTM layers over multi dimensional as a self-attention encoder with the CTC decod-
ones. Further, Dutta et. al. [9] made a com- ing. Li et al, on the other hand, employed the
prehensive study demonstrating effects of data Vision Transformer (ViT) [8] as an encoder and
augmentations, pretraining and use of Spatial vanilla transformer decoder, taking advantage of
Transformer Network (STN) [20]. pretrained ViT and transformer decoder. They
obtained the state of the art results and further
Sequence-to-Sequence Approaches. The showed the effectiveness of their model and pre-
CTC method has the drawback that it prevents training scheme, without any post processing or
the generation of sequences longer than the input external language.
sequence, which is a sequence of feature maps
[22, 31]. As feature maps get smaller, due to Data Augmentation & Synthetic Data
convolution and max pooling operations, the gen- Generation. In addition to model related devel-
erated sequence becomes shorter, which in turn opments, most of the studies proposed solutions
could result in missing transcriptions. Hence, for dealing with data sparseness issue. There
attention based sequence-to-sequence methods are two common approaches: (1) applying data
have been developed in order to overcome the augmentation techniques at train time for intro-
shortcomings of the CTC and to leverage their ducing a broad range of handwriting styles, (2)
sequence learning capabilities on handwriting generating synthetic handwriting images to pro-
recognition [22–24, 31]. The fundamental idea vide enough and diverse handwriting styles to
with these methods is to use CNN-RNN (usu- deep learning models.
ally a BiLSTM) to encode the input image as a Affine transformations such as rotating, scal-
sequence of features, and then an attention based ing, and shearing are heavily applied and are
RNN, usually an LSTM or a GRU, to decode the shown to be effective methods for mimicking hand-
encoded sequence. The overall network is opti- writing styles [9, 36, 47]. More complex augmen-
mized with the cross entropy loss function, applied tation techniques are also proposed. Wigington
over each frame through the output sequence. et al [46] developed a distortion method along
Among such approaches, Micheal et al [31] with a profile normalization technique to vary
utilized the CNN-BiLSTM approach where the letter shapes, which in turn generates more dis-
CNN model learned to encode the handwrit- criminative letter styles. Further, Luo et al [28]
ing image and the attention-based LSTM model proposed to learn augmentation as a joint-task
learned to decode the encoded representation into during training of the networks.
sequence of characters. Authors compared differ- The second approach to reducing data sparse-
ent attention mechanisms such as content-based, ness involves generating synthetic handwritten
location-based and penalized attention, and fur- samples. Most of the studies have generated
ther combined the CTC loss along with the cross their own synthetic data, either in word or line
entropy loss to increase model capabilities. Kass level, and experimentally shown their effective-
et al [24] integrated the spatial transformer net- ness. While some have synthesized a few millions
work at the forefront of their sequence-to-sequence of handwriting images using words or sentences
model to reduce handwriting variations before from known, large corpora [9, 22, 47], others used
feeding images to the rest of the architecture.
Springer Nature 2021 LATEX template

Column Wise
Max Pooling Convolution Max Pooling
Max Pooling

Encoded
Representation Impossible to say.

BiLSTM Fully Connected + CTC

Fig. 2 Our proposed deep learning network consisting of CNN-BiLSTM models as encoder and CTC as decoder.

synthetic data generated for scene-text recogni- with the help of CTC decoding (Section 3.3).
tion problem [24], or cropped lines from pdf files These modules are described in detail below and
containing handwriting text [27]. However, though the network structure is depicted in Figure 2.
the above methods report performance increase,
none of them have published their dataset, which 3.1 Feature Extraction
leaves the problem unsolved for other researchers.
As for test time augmentation, there are only The network uses 12 convolutional layers with 3×3
a few studies [9, 36, 46] in which they follow sim- kernels for feature extraction. Max pooling oper-
ilar approaches. The idea is to augment the input ation, with 2 × 2 kernel, is applied two times in
image at the test phase and then generate a new total; after the first two and following four con-
transcription based on the outputs of the origi- volution layer blocks. The convolution layers in
nal image and of the augmented ones. Poznanski these blocks have 32, 64 and 128 number of filters.
et al [36] applied 36 different transformations on ReLU [32] activation function and batch normal-
an image at test time, retrieved model outputs ization [19] are applied after each convolution for
for these augmented images and the original one, faster convergence and easier training. The CNN
and afterwards took the mean of these outputs as network produces a feature map tensor in the form
the final outcome. While Dutta et al [9] followed of D × H × W where D corresponds to number
the same approach in [36], Wigington et al [46] of output filters or depth of feature maps; H and
employed 20 transformations, obtained the corre- W stand for height and width of feature maps,
sponding transcriptions and picked the final one respectively. Then, we apply a max-pooling func-
with respect to the lowest CTC loss value. tion over the height dimension and interchange the
dimensions, resulting a tensor of W × D.
This output represents D dimensional feature
3 CNN-BiLSTM Model vectors for a sequence of length W . More formally,
W could be written as W = [w1 , w2 , ..., wt ], indi-
Inspired from [39], and its later accomplished
cating a sequence of t time frames where wi ∈ RD
successors [4, 9, 37], we followed a similar deep
for i ∈ {1, 2, . . . , t}.
learning architecture and more details could be
We intentionally avoided using deep CNN net-
found in [26]. The architecture consists of three
works in order not to narrow down the feature
fundamental blocks: feature extraction from the
maps too much. Otherwise, the CTC algorithm
input image using convolutional layers (Section
would only allow producing shorter sequences
3.1); processing the extracted deep features as a
due to its constraints. Nonetheless, we conducted
sequence using bidirectional LSTMs (Section 3.2);
and producing the sequence of output characters
Springer Nature 2021 LATEX template

experiments with a different number of convo- of probability distributions defined over the alpha-
lution and max-pooling layers and even tried bet. Then, CTC function processes the output for
well-known image feature extractors like ResNet18 labelling the sequence, allowing the whole network
[17] architecture to decide the baseline model. to be end-to-end trainable.
We explain the outcomes of these experiments in
Section 6.2. 3.4 Word Level Decoding
We employed three different decoding methods at
3.2 Sequence Encoding
the test phase to obtain the final transcription
This phase processes the sequence of features from the model output, which is a sequence of
crafted by the feature extraction step. The idea probability distributions: greedy, beam search [12]
is to incorporate sequence learning capabilities of and word beam search [38]. Greedy decoding is
Recurrent Neural Networks to obtain sequences of the simplest one which picks the most likely char-
features representing the input handwritten text acter at each time frame. Although it is intuitive
better. to select a character with the highest probability,
Following [4, 9, 37, 39], we utilised a bi- it is not guaranteed to generate the most likely
directional LSTM model as well, leveraging transcription sequence.
learned contexts from both directions. We used Instead of selecting one character at a time,
two BiLSTM layers having 256 hidden nodes, beam search keeps k most likely transcriptions at
without any dropout [42] applied. We fed the each time frame and produces a transcription with
BiLSTM layers with the delivered sequence of fea- the highest score among these k alternatives [12].
tures, W × D, and the BiLSTM produces a new Word Beam Search (WBS), on the other hand, is
tensor of size W × (2 × K), where K stands for the an extension of the beam search, which constrains
hidden dimension size of the BiLSTM layer which words to be in a given lexicon and allows non-word
is equal to 256. The produced tensor, sequence of characters to occur inside words [38]. Moreover,
feature vectors, holds an encoded representation the algorithm incorporates a 2-gram word-level
of the input handwriting image in a feature space. language model for scoring words through decod-
Additionally, we explored bi-directional Gated ing. We used WBS algorithm to output the final
Recurrent Unit (GRU) [6] based models and transcription; set beam width as 150 and con-
conducted experiments with different number of figured word beam search to utilize the 2-gram
recurrent layers and hidden dimension sizes. Sub- language model.
section 6.2 illustrates the effect of these experi-
ments. 4 Data Augmentation &
3.3 CTC Decoding
Synthetic Data Generation
After passing the input image through the first 4.1 Public-Domain Datasets
two phases, feature extraction and sequence learn- The handwriting recognition domain lacks high-
ing, the encoded representation of the input is quality, open-access datasets with a high number
obtained in the following form W × (2 × K); for- of samples, even though the problem has been
mally W = [w1 , w2 , ..., wt ] where wi ∈ RH for studied for decades. There are a few common
i ∈ {1, 2, . . . , t}. Here, H stands for 2×K or a real datasets that contain full sentences; namely, IAM,
value 512 in our case of K being set to 256. Rimes, Bentham and Washington [10, 16, 29, 43]
Then, the encoded sequence representation, where the last two consist of only historical hand-
W ×(2×K), is mapped to a sequence of probability writing images, which are not suitable for modern
distributions over alphabets of recognizable char- cases.
acters, resulting in W × C. To do so, we utilised IAM dataset was first published in 1999 [29],
two fully connected layers; the first one is fol- and still is the most commonly used English hand-
lowed by a ReLU activation and the second one writing dataset. It contains handwriting texts in
outputs through C number of cells. Further, the English written by 657 different people. There
SoftMax function is applied to produce a sequence are 1,539 scanned pages, partitioned into 10,373
Springer Nature 2021 LATEX template

Following [9, 36, 47], we employed affine con-


(a) versions including shear and rotation to generate
natural looking handwriting images. Moreover, we
applied elastic distortion [46] and geometric trans-
formations [28] to further increase the handwriting
variability. Augmented examples with these con-
versions are displayed in Figure 3.
Shear is a linear affine transformation that
(b) takes a point (x, y) ∈ R2 and maps it to (x +
ky, y) ∈ R2 , if it conducts horizontal conversion.
Here k denotes the scaling factor determining how
many pixels to move in an image. Rotation is
another linear affine transformation that rotates
a point (x, y) in the counter-clockwise direction
(c)
by θ degree. Using these shear and rotation trans-
Fig. 3 Original image along with the augmented samples. formations makes it possible to mimic slanted
(a) Original image. (b) Images augmented by affine trans-
form: first two lines are sheared, the last one is rotated. (c)
handwriting styles as illustrated in Figure 3 (b).
Images applied elastic distortion and perspective transform Elastic distortion, on the other hand, is a non-
from geometric conversions. linear transformation that elastically changes the
shape of a letter. We used the method in [41] where
labelled text lines and 79 different characters. displacement fields ∆x and ∆y are sampled from a
To the best of our knowledge, there is no other probability distribution. First, they sample x and
public dataset in English with non-historical and y positions of the field from a uniform distribu-
line-level handwriting images. tion between −1, +1; then, convolve these fields
Rimes dataset [16] consist of French handwrit- separately with a 1-dimensional Gaussian kernel
ing and has been used for evaluating generaliza- of σ standard deviation; finally, the fields get mul-
tion abilities of models for a long time. However, tiplicated by α value to determine the distortion
we have not been able to access to the dataset intensity. Overall, the method is able to distort
using official channels and their websites, as also handwriting letters using σ and α parameters.
indicated here 2 . Therefore, our experiments will We also use geometric transformations, as
not include results on this dataset. proposed in [28], which transform an image by
Due to this data sparseness, current methods moving set of predefined points to new posi-
usually exploit data augmentation and synthetic tions. The movement of the points is determined
data generation techniques. Fundamental idea is via an agent network, where this transformation
to alleviate the lack of data problems and present method is optimized jointly with text recogni-
various handwriting styles for better generaliza- tion module to craft readable yet hard samples.
tion. Yet, these works have not published the We used closed-form versions of their augmenta-
synthetic datasets they have crafted, which in turn tions3 , namely distortion, stretch and perspective,
disallows a common evaluation protocol. and grouped them under the name of geometric
transformations.
4.2 Data Augmentation These transformations were applied to the
input image more than once with different param-
Although data augmentation for visual tasks has eters to generate distinct handwriting styles. We
been explored extensively [40], there are a cou- decided on the range of parameters of the augmen-
ple of studies that proposed novel approaches to tation methods by visual inspection. We deformed
augment handwriting images [28, 41, 46], while the letter shapes to some degree while considering
others utilized common methods such as affine the readability of the handwriting. During train-
transformations [9, 36, 46, 47]. ing, input images undergo a transformation with

2 3
https://github.com/jpuigcerver/Laia/issues/44 publicly shared in https://github.com/Canjie-Luo/Text-
Image-Augmentation
Springer Nature 2021 LATEX template

Fig. 4 Examples of synthetically generated handwriting line images.

a probability p = 0.5, which makes it practically 5 Test Time Augmentation


impossible to introduce the same image again.
We trained our base model with and with- Based on our initial observation that a large
out these augmentations in order to evaluate their majority of character and word errors originated
effects. Outcomes of these experiments are given from a small portion of text lines, coupled with
in detail in Section 6.3. the fact that text lines may be recognized better if
they were not slanted or rotated, we hypothesized
that test time augmentation would help.
4.3 Synthetic Data Generation
For the offline handwriting recognition prob-
Many state-of-art systems craft their synthetic lem, the idea of the test time augmentation is
dataset following similar steps, such as employing to apply transformations to an input handwrit-
texts written in widely used large corpora [1, 11] or ing image at the test phase, obtain transcriptions
crawling text from the internet [22] to synthesize of the corresponding augmented images, and then
artificially written handwritten images. decide on the best decoding. The idea was used
We combined WikiText-2 [30], Brown [11], before in [9, 36, 46], however we evaluated differ-
and training text of IAM dataset (which cor- ent methods for choosing or constructing the final
responds to LOB corpus [1]) as the corpus to recognition output, given the recognition outputs
generate images from. Then the texts were pre- of different augmentations. While it is a simple
processed to have a unified form with correctly and effective idea, test time augmentation is not
tokenized words; sentences were tokenized into widely used or evaluated due to the added decod-
words and punctuation was removed. At most 8 ing time; however, the method could be preferable
word-sentences are picked to generate lines syn- in the case of batch processing, where time is not
thetically. More than 200 handwriting TrueType too critical.
fonts were collected from the internet and then For test augmentation, we applied 16 different
utilized to synthesize images. We took advantage shear and rotation augmentations in total to the
of the trdg tool 4 to generate synthetic images original image at test time and retrieved transcrip-
using these corpora and fonts. Overall, we have tions of both augmented images and the original
generated almost 2.5 million synthetic handwrit- image. We then obtained the overall score by com-
ing image lines, example images are illustrated in bining the optical score of the deep learning model,
Figure 4. OPs , and a language model score, LMs over the
decoded transcription:
4
https://github.com/Belval/TextRecognitionDataGenerator score = (λ × OPs ) + (ω × LMs ) (1)

where OPs is obtained from the WBS decoding


algorithm and LMs denotes the language model
Springer Nature 2021 LATEX template

oat it 's a good start .

Original Prediction

(a)
0.4 -5.55 0.6 -17.77 -12.88

Optical Score LM Score Final Score

bat it's a good start

Sheared -0.3 degree Prediction


(b)
0.4 -12.13 0.6 -15.87 -14.37

Optical Score LM Score Final Score

but it's a good start .

Rotated 2.5 degree Prediction


(c)
0.4 -4.50 0.6 -10.31 -7.98

Optical Score LM Score Final Score

Fig. 5 An example application of the scoring transcriptions. (a) Output of the original image is scored using the equation
1. The optical score is obtained from the deep learning network. The language model score is computed using the 4-gram
language model. (b) and (c) shows the scoring steps of the transformed images. The output with the lowest final score is
picked as the final output, which is the correct transcription of the handwriting.

score obtained from the below defined 4-gram computed for the corresponding transcription are
language model. Then the transcription with the combined using the above-defined formula. The
highest score was picked as the final outcome. best transcription with the highest score (−7.98
Through the experiments, we fit values for λ and for the example in the figure) is picked as the final
ω. output, which is the correct decoding obtained
We trained the language model as a 4-gram from the image rotated by 2.5 degrees for this case.
model on WikiText-103 [30] corpus with Kneser- In addition to the simplicity of this method,
Ney smoothing [33] using KenLM 5 tool. Before it is applicable to any network generating charac-
training the language model, we applied prepro- ter sequences from handwriting images. Further,
cessing steps including tokenization of the words, the method achieves superior results (see Table 9)
lowering the letters and removing punctuation. once the best decoding is selected with respect to
Figure 5 illustrates the proposed method in the lowest error, as if an oracle telling the truth.
a simplified fashion. For each image (original or This suggests that a better scoring function will
transformed), the optical score, produced by the yield greater success.
model for the image and the language model score

5
https://github.com/kpu/kenlm
Springer Nature 2021 LATEX template

6 Experiments Backbone. We first experimented with the num-


ber of convolution layers in the CNN backbone
We have run comprehensive evaluations, including network. Keeping the rest of the layers fixed, we
the input scaling (6.1), the baseline architecture employed separate CNN networks with 8, 10, 12,
(6.2), the data augmentation (6.3), pretraining and 14 layer convolutions while applying 2 or 3
with synthetic data (6.4), effect of lexicon, let- max pooling operations. In addition, a smaller
ter case and punctuation (6.5), and the test time ResNet [17] variant, ResNet18, was utilized to
augmentation (6.6). experiment with a deeper model compared to
Through the experiments, models were trained previous ones. For the sequence encoding, a two-
and evaluated on the corresponding Aachen splits layered bi-directional LSTM network was used
of the IAM dataset. Training was set to last by with a hidden dimension of 256. During these
200 epochs at most and stopped after 10 epochs of backbone and sequence encoding experiments no
non-decreasing validation loss. Initial learning rate data augmentation was applied to measure the
was 1e - 3 for most of the experiments though it direct effect of these model choices.
was tailored in some cases. Rmsprop optimizer was
used with weight decay being set to 1e-5. Batch Table 1 Impact of backbone configuration on the IAM
size was kept fixed at 16. validation split.
We assessed the performance of the models
using character error rate (CER) and word error Backbone # Max Pooling CER % WER %
rate (WER), which corresponds to normalized 8 conv. layers 2 5.49 16.34
Levenshtein Distances between the predicted and 10 conv. layers 2 5.95 17.17
ground truth character sequences. Though we uti- 12 conv. layers 2 5.01 15.14
14 conv. layers 3 5.37 15.51
lized the three decoding methods, we share CER
and WER scores from the word beam search ResNet18 default 6.08 18.27
algorithm in the tables.

6.1 Input Scaling As displayed in Table 1, a 12-layer CNN net-


work with two max pooling operations surpassed
Input images were resized to 100×960 before feed- the other ones. Though there is no significant
ing to recognition models. It is intuitive to keep performance gap, shallower or deeper the model
image height as fixed and resize the width, pre- obtains lower scores; therefore, we preferred con-
serving the aspect ratio. However, in our prelimi- tinuing with the best scored model with the least
nary experiments we acquired worse performance parameters and built the baseline model on top
with this approach and other fixed image sizes. of the 12-layer CNN network.
We also tried keeping images as is and feeding
to models one-by-one as keeping batch size set to Sequence Encoding. After deciding on the
1, yet this approach was neither effective nor effi- backbone network, our next experiment is to
cient. Thus, we resized images to 100 × 960 and select a sequence-encoder model. To this end,
feed through the deep learning network. we evaluated the performance of bi-directional
LSTM and GRU networks. We varied the number
6.2 Model Experiments of recurrent layers and hidden layer sizes of these
Our preliminary experiments were designed to deep learning networks. Table 2 illustrates the
decide on the baseline model that would be model configurations and the corresponding error
used in upcoming experiments. Inspired from rates on the validation split of the IAM dataset.
[4, 9, 37, 39], we followed a similar CNN-RNN- Two recurrent layers fit better for this prob-
CTC architecture as explained in Section 3. We lem as seen in Table 2. Models competed with
explored combinations of CNN and RNN mod- different hidden dimension sizes yet there was
els with different hyperparameters through the no exact winner of this experiment. However, we
experiments. intended to keep the model simple while preserv-
ing its capabilities. Thus, we preferred to complete
Springer Nature 2021 LATEX template

Table 3 Effect of train time data augmentations applied


the baseline model with 2-layered BiLSTM hav-
in separate and in combination using the baseline network
ing hidden dimensions of 256, which outperformed defined above. For the combined augmentations, only one
the other networks. of the transformations is used on an image to keep
The rest of the experiments were built upon readability of handwritings while deforming handwriting
styles. An augmentation is applied with a probability p =
this baseline model consisting of 12 layered CNN 0.5.
with two max pooling functions and two-layered
BiLSTM with 256 hidden dimensions followed by Augmentation Method CER % WER %
a CTC layer.
Baseline from Table 2 4.62 14.62
(No Augmentation)
Table 2 Performance of sequence encoding models,
BiLSTM and BiGRU, with different configurations on the Shear 4.28 13.44
IAM validation split. Here the backbone network is the Rotate 4.67 14.06
one with the 12 convolutional layers and 2 max-pooling Elastic 4.48 14.72
operations as displayed in Table 1. Geometric 4.52 14.29
Shear + Elastic + Geometric 4.20 13.18
Shear + Rotate + Elastic 4.17 13.21
Hidden Shear + Rotate + Elastic
RNN Model # Layers CER % WER % 4.06 13.02
Dim. Size + Geometric

256 4.95 15.42


1
512 5.26 15.99 inside image borders. Elastic distortion variables
BiLSTM 256 4.62 14.62 were picked from the following sets σ ∈ {3, 4}, α ∈
2
512 4.93 14.93 {15, 20}. Once the parameters set, we trained the
models using these augmentations and evaluated
256 5.07 15.64
3
512 5.24 16.14 without any transformation applied.
Shear conversion is the most convenient way to
256 5.27 16.54 mimic handwriting thereby the model with only
1
512 4.90 15.48
shear augmentation obtained better scores com-
BiGRU 256 5.01 15.14 pared to other single transformation experiments,
2
512 4.85 14.94 as revealed in Table 3. Even though the other
256 5.05 15.53
methods were not as effective on their own, their
3 combination achieved better performance due to
512 5.70 16.33
generating more diverse handwriting styles, as
observed over the validation set. Hence, we per-
formed the next set of experiments using these
6.3 Data Augmentation augmentations at train time.
After establishing the baseline model, we explored
the effect of data augmentation techniques, which 6.4 Pretraining with Synthetic Data
are described in Section 4. The transformations Generating synthetic handwriting images is
were either applied separately or in combination, another commonly used strategy to reduce the
as shown in Table 3, to assess the effect of each data sparseness and increase the generalization
method. We only performed one conversion on an capacity of models. Section 4 describes in detail
image so as not to deform handwritings too much how we produced images from the utilized corpora
and it was only applied with a probability p = 0.5. using an online image generation tool.
We determined to parameters of the transfor- We, firstly, trained the baseline deep learn-
mations by visually inspecting augmented images. ing network on the generated data consisting of
For the shear transformation, k is sampled from a almost 2.5 million synthetic handwriting images.
uniform distribution between −0.6 and +0.6 with We applied the data augmentation methods,
a fixed random seed to apply the same conver- explained in Section 4, to the synthetic images in
sions through all the experiments. Rotation angle order to push the diversity further. The synthetic
was kept in small degrees from −2.5 to +2.5 to
mimic slanted handwritings while keeping letters
Springer Nature 2021 LATEX template

11

dataset was not split into validation or test parti- word beam search algorithm in all result tables.
tions; the model was directly trained on the whole We used greedy and beam search methods as is; in
dataset by 5 epochs. other words, without adding an external language
model or considering a lexicon during decod-
Table 4 Results of our pretrained deep learning model, ing. However, word beam search builds a prefix
on the IAM validation and test splits. Linear layer : tree from a corpus i.e. lexicon to keep decoded
parameters of the output linear layers are updated while words in the lexicon. Therefore, we explored the
fine-tuning. All layers: all the parameters are updated.
effect of lexicon size and out-of-vocabulary rate on
Validation Test decoding performance.
Pretraining
We experimented with three different English
CER WER CER WER corpora and their combinations for building the
(%) (%) (%) (%)
lexicon. The base lexicon consists of the texts from
Baseline from Table 3 the IAM training partition, which is part of the
4.06 13.02 5.20 14.86
(No Pretraining) LOB corpus [1]. On top of it, we appended the
Pretrained: Brown corpus, which is a million-word English
Linear layer 4.21 14.16 5.44 15.58 text corpus consisting of texts from 15 different
All layers 3.88 12.71 5.05 14.46 categories. We also added the WikiText-2 corpus
comprising over two million English words. Fur-
ther, we assessed their combination as well as the
We reduced the character errors by around validation partition of the IAM dataset. The lex-
0.15% and word errors around 0.4% (from 5.20% icon built upon the IAM validation split has zero
to 5.05% for character errors and from 14.86% out-of-vocabulary rate, which is not possible for
to 14.46% for word errors) by only pretrain- real-time scenarios yet it underlines the effect of
ing the model on more than 2 million synthetic the lexicon.
handwriting images. However, there is room for
improvement once there is more synthetic and Table 5 Reflection of lexicon size with number of unique
natural handwriting data for pretraining and fine- words (after tokenization, lowering and discarding
tuning. We performed the next set of experiments punctuation) and out-of-vocabulary (OOV.) rates on
errors. B stands for the Brown corpus and W2 for the
using this pretrained model.
WikiText2. corpus
Afterwards, the pretrained network was fine-
tuned on the IAM dataset train partitions in two
ways: (1) freezing all the parameters of the model # OOV. CER WER
Lexicon
Words Rate (%) (%)
except the output linear layers. (2) updating all
the parameters. Then, the fine-tuned model was Baseline from Table 4
7.845 0.54 3.88 12.71
evaluated on the IAM validation and test parti- (IAM Train)
tions and the corresponding scores are given in +B 57.606 0.15 3.49 11.77
+ W2 60.244 0.21 3.56 11.92
Table 4. + B + W2 82.449 0.13 3.46 11.70
Updating only the parameters of the final lin- + IAM Validation 2.251 0.00 2.2 6.35
ear layers did not reveal better scores, as expected,
due to the difference in handwriting styles between
synthetic and natural ones. Hence, we adopted The error rate depends on the context of the
pretraining all layers, in the remainder of the words in the lexicon. As the rate of the out
experiments. The results of this experiment are of vocabulary (OOV) words decreased, the error
shown in Table 4. rates gets reduced as well, as illustrated in Table
5. In a normal test scenario, it is not possible to
6.5 Effect of Lexicon, Letter Case, contain all likely words in a lexicon, which is the
and Punctuation IAM Test case in the table. Nonetheless, most of
the time, we obtained lower error rates when using
We decoded model outputs using three different larger lexicons. However, one needs to account the
methods (greedy, beam search, and word beam trade-off between the running time with a bigger
search) and gave the best results obtained by the lexicon and obtained performance. We preferred
Springer Nature 2021 LATEX template

Table 7 Validation and test scores with and without test


going with the lexicon consisting of train partition
time augmentation, in case sensitive recognition, as well
of IAM, Brown and WikiText-2, since there is no as ignoring punctuation mistakes. Oracle indicates the
significant difference of running time compared to case where the final transcription was selected so as to
smaller lexicons in Table 5. minimize the character error, among all transcriptions
obtained with test augmentation.
The error rate depends on the context of the
words in the lexicon. As the rate of the out Validation Test
of vocabulary (OOV) words decreased, the error Test Time CER WER CER WER
rates gets reduced as well, as illustrated in Table Augmentation (%) (%) (%) (%)
5. In a normal test scenario, it is not possible - 3.46 11.70 4.80 13.85
to contain all likely words in a lexicon, which ✓ 3.22 10.63 4.37 12.03
is the IAM Test case in the table. Nonetheless,
most of the time, bigger the lexicon is lower the Oracle 2.28 8.54 3.02 9.35
errors achieved. However, one needs to account the
trade-off between the running time with a bigger of using test time augmentation to obtain better
lexicon and obtained performance. We preferred transcriptions in Section 5.
going with the lexicon consisting of train partition Once the final transcription is obtained, it
of IAM, Brown and WikiText-2, since there was is compared with the ground truth character
no significant difference of running time compared sequence and error score in terms of CER and
to smaller lexicons in Table 5. WER metrics are measured. We assessed the pro-
We decoded and compared the transcriptions posed test time augmentation method on ignoring
with the ground truth text in a case-sensitive errors involving case or punctuation mistakes, to
manner and also counting punctuation errors. gain further insight.
However, some work in the literature report that Table 7 shows that applying test time augmen-
they generated case insensitive text without any tation reduced the character errors by almost 0.5%
punctuation [9, 46], while others do not specify and word errors by around 2%, when decoding
this. As a matter of fact, the IAM dataset contains with case-sensitive letters and punctuation. The
sentences with several quotes and punctuation error rate gets decreased further (almost 1% for
around them. As this is an important distinc- character errors and 2.5% for word errors) when
tion, we performed experiments considering both decoding with case insensitive letters without
strategies, with results given in Table 6. However, punctuation, as displayed in Table 8.
unless otherwise stated explicitly, we shared all In addition to our scoring function, we per-
the errors measured with case sensitive letters and formed test time augmentation and selected the
punctuation through the tables. final transcription with respect to the lowest CER
score, among all transcriptions. We named this
Table 6 Effect of the letter case and the punctuation method as Oracle since it knows in advance the
when decoding over the validation split of the IAM error due to the comparison with the ground truth
dataset. Decoding method up to this experiment involved
case sensitive letters with punctuation. sequence, which is not a possible in real life. How-
ever, our aim was to show the how much gain
Case & Punctuation
CER WER is possible with this Oracle method. As Tables
(%) (%)
Table 8 Validation and test scores with and without test
Baseline (Case Sensitive + Punctuation) 3.46 11.70
time augmentation considering case insensitive letters
Case Insensitive + Punctuation 3.16 10.58
ignoring punctuation errors.
Case Insensitive + No Punctuation 2.87 8.27
Validation Test
Test Time CER WER CER WER
Augmentation (%) (%) (%) (%)
6.6 Test Time Augmentation - 3.38 11.29 4.38 12.00
Test time augmentation is a simple and effective ✓ 2.84 9.05 3.59 9.44
technique which offers a feasible time complex- Oracle 1.61 6.42 2.63 7.74
ity for batch processes. We described our method
Springer Nature 2021 LATEX template

13

7,8 indicates, there is room for improvement once challenging handwriting samples; and share mis-
a better scoring function is employed. Moreover, labelled example lines from the IAM dataset. We
this strategy is applicable to any model generat- analyzed the outputs of the best models without
ing character sequences from handwriting images, the test time augmentation, for simplicity.
which makes it considerable for other models as One can see in Figure 6 that in 42% of all the
well. lines, all the characters were recognized correctly
We also analyzed the time running times with and in 80.89% of all the images in the IAM test
decoding of original images. We run our proposed set, there were at most 3 character mistakes. Note
methods on two different machines with the fol- that the dataset has an average of 35.2 characters
lowing configurations: (1) AMD Ryzen 3970X, 64 per line image with an 8.72 standard deviation.
GB ram with Geforce RTX 3090. (2) Intel i7 This is a rather surprising statistics, which we
10700K, 32 GB ram with Gefore RTX 3080. The believe can indicate certain future research direc-
time taken for each image was measured and its tions. Based on this observation, we focused on
average with the standard deviation for all the test time augmentation, for example. Similarly,
samples in the IAM test partition is computed for the model produced at most one incorrect word in
both of our methods, as shared in Table 9. Even 68.75% of all the samples in the test split, where
though this test time augmentation method slows the mean number of words in lines is 8.90.
down obtaining the final transcription, it could be Some of the most challenging writing samples
preferred for batch processing of documents when are given in Figure 7. The figure shows exam-
there is no real time requirements. Additionally, ples from IAM test dataset where our model made
better scoring function would reveal greater suc- at least five character errors, which corresponds
cess without needing any more model parameters to 7% of the test set samples. Note that these
or more data. handwritings are difficult for people as well.
Furthermore, the IAM dataset contains par-
Table 9 Running times of test time augmentation tially or totally wrong ground truth labels which,
method on two different machines with CPU & GPU in fact, affects the both training and testing
options. Mean and standard deviation of the elapsed time
over the IAM test partition. phases. Therefore, we found and corrected the
erroneous cases on the IAM test split in order
to evaluate better. Example erroneous lines along
Original With Test Time with their issues are presented in the Appendix A.
Time (s) Aug. Time (s)
We will share the new, fixed metadata of the test
Processing split on the link 6 .
Machine Mean Std. Mean Std.
Device

PC 1
CPU 2.58 1.03 21.46 4.08 8 Comparison with the State
GPU 2.27 0.98 19.33 3.84
CPU 3.64 1.12 25.54 4.56
of the Art
PC 2
GPU 3.28 1.02 24.76 4.21
Our approach is compared to state-of-art methods
with varying characteristics, as shown in Table 10.
However, it should be noted that direct compari-
son of the models may not very meaningful, due
7 Error Analysis to differences or omissions in training or testing
criteria.
Up to this point, we measured system performance
Our model is able to compete with the sim-
in terms of CER and WER metrics and computed
ilar approaches proposed in [4, 47]. However, we
their mean over the IAM test partition. Once we
were not able to fully duplicate their results due
examined the erroneous cases, we found out that
to the differences including (1) extra data they
a lot of the errors are contributed by a smaller
used in training, (2) the language model they
number of text lines.
employed to support the deep learning network
In this section, we provide analysis in terms
of the error distribution; explore difficult sam-
ples along with possible ways to overcome these 6
https://github.com/firatkizilirmakk/handwriting-
recognition
Table 10 Comparison of our approach with the state-of-the-art approaches. LM & Lexicon denotes the language model and the lexicon being used (if any) during
decoding. Decoding indicates which method is used to decode network outputs into character sequence. Here WBS denotes the word beam search algorithm and bw
is the beam width. #P is the number of trainable parameters (in millions) of the deep learning networks. CaseInsens. stands for case-insensitive decoding and
NoPunct. denotes the decoding without considering punctuation.
Springer Nature 2021 LATEX template

Authors Encoder Decoder Train Set LM & Lexicon Decoding #P CER % WER %

Bluche [4] GCRNN CTC IAM + Multi. 7-gram LM + 50K - 0.75 3.2 10.5
Xiao [47] CNN + LSTM CTC IAM + Synth. Brown + LOB WBS bw/150 - 3.03 8.66

Michael [31] CNN + LSTM LSTM w. Att. IAM - Beam bw/16 - 4.87 -
Kang [21] Transformer IAM + Synth. - Greedy 100 4.67 15.45
Li [27] Transformer IAM + Synth. - Beam bw/10 334 3.42 -
Li [27] Transformer IAM + Synth. - Beam bw/10 558 2.89 -
Diaz [7] Public 9-gram LM Greedy ∼12.5 3.15 -
Self Att. + CTC
Diaz [7] Internal + Public 9-gram LM Greedy ∼12.5 2.75 -

WBS bw/150 4.80 13.85


WBS + TTA 4.37 12.03
Ours CNN + LSTM CTC IAM + Synth. Brown + W102 ∼3.6
WBS + TTA +
3.59 9.44
CaseInsens. + NoPunct.
Springer Nature 2021 LATEX template

15

(a) (b)

Fig. 6 Cumulative character and word error rates obtained with the baseline model on the IAM dataset. (a) Illustrates
cumulative character errors. Example: All the characters are predicted correctly in 42% of all the lines, which corresponds
to 1226 line images; one or at most two characters are predicted incorrectly in 560 line images, which corresponds to 17%
of all the lines (b) Displays cumulative word error rates following the same manner.

output, and (3) the lexicon their methods select 9 Summary and Conclusions
decoded words from. Yet, our method achieves a
close performance to these works, plus with open We proposed a CNN-BiLSTM model for offline
and comprehensive evaluation. English handwriting and comprehensively evalu-
Others exploited the attention based ated it in regards to model architecture, data aug-
approaches to make use of the advantages of these mentation, synthetic data generation, and data
methods on sequence learning. In addition to deep representation. We have also suggested a sim-
learning architectures, they generated millions of ple yet effective post processing method utilizing
synthetic handwriting images and/or crafted real test time augmentation. Even though the method
handwriting images through the web, which is takes more time compared to decoding of a single
another factor of their success. As summarized in image, it could be preferable in batch processes
Table 10, Li et. al. [27] pretrained a Transformer that do not have time constrains and it is applica-
model on their synthetic dataset, finetuned over ble to any model producing sequence of characters
the IAM and achieved superior results with- from images.
out applying any post processing method e.g. In addition to the usual metrics for assess-
a language model or a lexicon. Diaz et. al. [7] ing handwriting recognition models, we analyzed
obtained the state of the art performance where the error distributions to gain insights regarding
they preferred a simpler model with considerably challenges. We have shown that the majority of
less amount of parameters, trained on datasets the errors stem from a relatively small portion
they collected and supported the outputs with a of the test cases, which can be instructive for
9-gram language model. improvements.
These works suggest that the crucial part of We will share our code openly upon publi-
a successful handwriting recognition system is to cation, to allow for other researchers, as well as
utilize large number of high quality data, either reproducibility.
synthetic or real. Once the data requirement is
satisfied, it is better to employ the state of the References
art deep learning networks such as Transformer,
Vision Transformer or other models with atten- [1] Lancaster-oslo-bergen corpus of modern
tion mechanism. Next, the final output could be english (LOB) : [tagged, horizontal format] /
decided with the help of an n-gram language stig johansson, http://hdl.handle.net/20.500.
model to further increase the success. 12024/0167, oxford Text Archive
Springer Nature 2021 LATEX template

True: with Sir John ? " she enquired cuttingly .


Prediction: wits Sir Son ? " She enquired cutling ly .
CER: 6.0 WER: 5.0

True: got hotter as the day wore on and we rested


Prediction: got hater as the day we on and he rastled
CER: 7.0 WER: 4.0

True: carrying it into effect , and a subordinate


Prediction: crying ie into offeree , anda puler!inte
CER: 13.0 WER: 6.0

True: became great sighs of ecstacy


Prediction: become a.cat sials a ecras
CER: 14.0 WER: 6.0

True: waving , unkissed , from the window . And Dai , on the pavement , knowing in his
Prediction: morsing , hired , from the window . rd dean , on the panime"t . knowing in his
CER: 18.0 WER: 6.0
Fig. 7 Difficult samples from IAM dataset test partition, claiming samples for which model predicted with character errors
bigger than 5 i.e. CER > 5 are not easy to read by people as well.

[2] Bahdanau, D., Cho, K., Bengio, Y.: [4] Bluche, T., Messina, R.: Gated convo-
Neural machine translation by jointly lutional recurrent neural networks for
learning to align and translate (2014). multilingual handwriting recognition. In:
https://doi.org/10.48550/ARXIV.1409.0473, 2017 14th IAPR International Conference
https://arxiv.org/abs/1409.0473 on Document Analysis and Recognition
(ICDAR). vol. 01, pp. 646–651 (2017).
[3] Bahl, L., Brown, P., de Souza, P., Mercer, R.: https://doi.org/10.1109/ICDAR.2017.111
Maximum mutual information estimation of
hidden markov model parameters for speech [5] Bunke, H., Bengio, S., Vinciarelli, A.:
recognition. In: ICASSP ’86. IEEE Interna- Offline recognition of unconstrained
tional Conference on Acoustics, Speech, and handwritten texts using hmms and
Signal Processing. vol. 11, pp. 49–52 (1986). statistical language models. IEEE Trans-
https://doi.org/10.1109/ICASSP.1986.1169179 actions on Pattern Analysis and Machine
Springer Nature 2021 LATEX template

17

Intelligence 26(6), 709–720 (2004). [13] Graves, A., Liwicki, M., Fernández, S.,
https://doi.org/10.1109/TPAMI.2004.14 Bertolami, R., Bunke, H., Schmidhuber, J.:
A novel connectionist system for uncon-
[6] Cho, K., van Merrienboer, B., Gulcehre, strained handwriting recognition. IEEE
C., Bahdanau, D., Bougares, F., Schwenk, Transactions on Pattern Analysis and
H., Bengio, Y.: Learning phrase repre- Machine Intelligence 31(5), 855–868 (2009).
sentations using rnn encoder-decoder for https://doi.org/10.1109/TPAMI.2008.137
statistical machine translation (2014).
https://doi.org/10.48550/ARXIV.1406.1078, [14] Graves, A., Liwicki, M., Fernández, S.,
https://arxiv.org/abs/1406.1078 Bertolami, R., Bunke, H., Schmidhuber, J.:
A novel connectionist system for uncon-
[7] Diaz, D.H., Qin, S., Ingle, R., strained handwriting recognition. IEEE
Fujii, Y., Bissacco, A.: Rethinking Transactions on Pattern Analysis and
text line recognition models (2021). Machine Intelligence 31(5), 855–868 (2009).
https://doi.org/10.48550/ARXIV.2104.07787, https://doi.org/10.1109/TPAMI.2008.137
https://arxiv.org/abs/2104.07787
[15] Graves, A., Schmidhuber, J.: Offline
[8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., handwriting recognition with multidi-
Weissenborn, D., Zhai, X., Unterthiner, T., mensional recurrent neural networks. In:
Dehghani, M., Minderer, M., Heigold, G., Koller, D., Schuurmans, D., Bengio, Y.,
Gelly, S., Uszkoreit, J., Houlsby, N.: An Bottou, L. (eds.) Advances in Neural
image is worth 16x16 words: Transform- Information Processing Systems. vol. 21.
ers for image recognition at scale (2020). Curran Associates, Inc. (2008), https:
https://doi.org/10.48550/ARXIV.2010.11929, //proceedings.neurips.cc/paper/2008/file/
https://arxiv.org/abs/2010.11929 66368270ffd51418ec58bd793f2d9b1b-Paper.
pdf
[9] Dutta, K., Krishnan, P., Mathew, M., Jawa-
har, C.: Improving cnn-rnn hybrid networks [16] Grosicki, E., Carré, M., Brodin, J.M.,
for handwriting recognition. In: 2018 16th Geoffrois, E.: Results of the rimes
International Conference on Frontiers in evaluation campaign for handwritten
Handwriting Recognition (ICFHR). pp. 80– mail processing. In: 2009 10th Interna-
85 (2018). https://doi.org/10.1109/ICFHR- tional Conference on Document Analysis
2018.2018.00023 and Recognition. pp. 941–945 (2009).
https://doi.org/10.1109/ICDAR.2009.224
[10] Fischer, A., Keller, A., Frinken, V., Bunke,
H.: Lexicon-free handwritten word spotting [17] He, K., Zhang, X., Ren, S., Sun,
using character hmms. Pattern Recog- J.: Deep residual learning for image
nition Letters 33(7), 934–942 (2012), recognition. pp. 770–778 (06 2016).
https://www.sciencedirect.com/science/ https://doi.org/10.1109/CVPR.2016.90
article/pii/S0167865511002820, special Issue
on Awards from ICPR 2010 [18] Hochreiter, S., Schmidhuber, J.:
Long short-term memory. Neural
[11] Francis, W.N., Kucera, H.: Brown cor- computation 9, 1735–80 (12 1997).
pus manual. Tech. rep., Department of https://doi.org/10.1162/neco.1997.9.8.1735
Linguistics, Brown University, Providence,
Rhode Island, US (1979), http://icame.uib. [19] Ioffe, S., Szegedy, C.: Batch normalization:
no/brown/bcm.html Accelerating deep network training by reduc-
ing internal covariate shift. In: Proceedings of
[12] Graves, A.: Sequence transduction the 32nd International Conference on Inter-
with recurrent neural networks (2012). national Conference on Machine Learning -
https://doi.org/10.48550/ARXIV.1211.3711, Volume 37. p. 448–456. ICML’15, JMLR.org
https://arxiv.org/abs/1211.3711 (2015)
Springer Nature 2021 LATEX template

[20] Jaderberg, M., Simonyan, K., Zisserman, A., https://doi.org/10.48550/ARXIV.2109.10282,


Kavukcuoglu, K.: Spatial transformer net- https://arxiv.org/abs/2109.10282
works. In: Proceedings of the 28th Inter-
national Conference on Neural Information [28] Luo, C., Zhu, Y., Jin, L., Wang, Y.: Learn
Processing Systems - Volume 2. p. 2017–2025. to augment: Joint data augmentation and
NIPS’15, MIT Press, Cambridge, MA, USA network optimization for text recognition.
(2015) 2020 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR) pp.
[21] Kang, L., Riba, P., Rusiñol, M., Fornés, A., 13743–13752 (2020)
Villegas, M.: Pay attention to what you read:
Non-recurrent handwritten text-line recogni- [29] Marti, U.V., Bunke, H.: The iam-
tion. Pattern Recognition 129, 108766 (2022) database: an english sentence database
for offline handwriting recognition. Inter-
[22] Kang, L., Riba, P., Villegas, M., Fornés, national Journal on Document Analysis
A., Rusiñol, M.: Candidate fusion: Integrat- and Recognition 5(1), 39–46 (Nov 2002).
ing language modelling into a sequence-to- https://doi.org/10.1007/s100320200071,
sequence handwritten word recognition archi- https://doi.org/10.1007/s100320200071
tecture. Pattern Recognition 112, 107790
(2021) [30] Merity, S., Xiong, C., Bradbury, J., Socher,
R.: Pointer sentinel mixture models (2016).
[23] Kang, L., Toledo, J., Riba, P., Villegas, M., https://doi.org/10.48550/ARXIV.1609.07843,
Fornés, A., Rusiñol, M.: Convolve, attend https://arxiv.org/abs/1609.07843
and spell: An attention-based sequence-to-
sequence model for handwritten word recog- [31] Michael, J., Labahn, R., Gruning, T.,
nition: 40th german conference, gcpr 2018, Zollner, J.: Evaluating sequence-to-
stuttgart, germany, october 9-12, 2018, pro- sequence models for handwritten text
ceedings. pp. 459–472 (01 2019) recognition. pp. 1286–1293 (09 2019).
https://doi.org/10.1109/ICDAR.2019.00208
[24] Kass, D., Vats, E.: Attentionhtr: Handwritten
text recognition based on attention encoder- [32] Nair, V., Hinton, G.E.: Rectified linear units
decoder networks. In: Uchida, S., Barney, E., improve restricted boltzmann machines. In:
Eglin, V. (eds.) Document Analysis Systems. Proceedings of the 27th International Confer-
pp. 507–522. Springer International Publish- ence on International Conference on Machine
ing, Cham (2022) Learning. p. 807–814. ICML’10, Omnipress,
Madison, WI, USA (2010)
[25] Krizhevsky, A., Sutskever, I., Hinton, G.E.:
Imagenet classification with deep convolu- [33] Ney, H., Essen, U., Kneser, R.: On struc-
tional neural networks. In: Proceedings of turing probabilistic dependences in stochas-
the 25th International Conference on Neu- tic language modelling. Computer Speech &
ral Information Processing Systems - Volume Language 8(1), 1–38 (1994)
1. p. 1097–1105. NIPS’12, Curran Associates
Inc., Red Hook, NY, USA (2012) [34] Pechwitz, M., Maergner, V.: Hmm based
approach for handwritten arabic word
[26] Kızılırmak, F.: Offline Handwriting Recogni- recognition using the ifn/enit - database.
tion using Deep Learning with Emphasis on In: Seventh International Conference on
Data Augmentation Effects. Master’s thesis, Document Analysis and Recognition,
Sabanci University (2022) 2003. Proceedings. pp. 890–894 (2003).
https://doi.org/10.1109/ICDAR.2003.1227788
[27] Li, M., Lv, T., Cui, L., Lu, Y., Floren-
cio, D., Zhang, C., Li, Z., Wei, F.: Trocr: [35] Ploetz, T., Fink, G.: Markov models
Transformer-based optical character recog- for offline handwriting recognition: A
nition with pre-trained models (2021). survey. IJDAR 12, 269–298 (12 2009).
Springer Nature 2021 LATEX template

19

https://doi.org/10.1007/s10032-009-0098-4 [43] Sánchez, J.A.: Bentham


dataset r0 (Jan 2016).
[36] Poznanski, A., Wolf, L.: Cnn-n-gram for https://doi.org/10.5281/zenodo.44519,
handwriting word recognition. In: 2016 IEEE https://doi.org/10.5281/zenodo.44519
Conference on Computer Vision and Pattern
Recognition (CVPR). pp. 2305–2314 (2016). [44] Vaswani, A., Shazeer, N., Parmar, N., Uszko-
https://doi.org/10.1109/CVPR.2016.253 reit, J., Jones, L., Gomez, A.N., Kaiser, L.,
Polosukhin, I.: Attention is all you need.
[37] Puigcerver, J.: Are multidimensional In: Proceedings of the 31st International
recurrent layers really necessary for Conference on Neural Information Process-
handwritten text recognition? In: 2017 ing Systems. p. 6000–6010. NIPS’17, Curran
14th IAPR International Conference on Associates Inc., Red Hook, NY, USA (2017)
Document Analysis and Recognition
(ICDAR). vol. 01, pp. 67–72 (2017). [45] Werbos, P.: Backpropagation through time:
https://doi.org/10.1109/ICDAR.2017.20 what it does and how to do it. Proceed-
ings of the IEEE 78(10), 1550–1560 (1990).
[38] Scheidl, H., Fiel, S., Sablatnig, R.: Word https://doi.org/10.1109/5.58337
beam search: A connectionist temporal clas-
sification decoding algorithm. In: 2018 16th [46] Wigington, C., Stewart, S., Davis, B., Bar-
International Conference on Frontiers in rett, B., Price, B., Cohen, S.: Data aug-
Handwriting Recognition (ICFHR). pp. 253– mentation for recognition of handwritten
258 (2018). https://doi.org/10.1109/ICFHR- words and lines using a cnn-lstm network.
2018.2018.00052 In: 2017 14th IAPR International Confer-
ence on Document Analysis and Recogni-
[39] Shi, B., Bai, X., Yao, C.: An end-to-end tion (ICDAR). vol. 01, pp. 639–645 (2017).
trainable neural network for image-based https://doi.org/10.1109/ICDAR.2017.110
sequence recognition and its application
to scene text recognition. IEEE Transac- [47] Xiao, S., Peng, L., Yan, R., Wang, S.:
tions on Pattern Analysis and Machine Deep network with pixel-level rectifica-
Intelligence 39(11), 2298–2304 (2017). tion and robust training for handwriting
https://doi.org/10.1109/TPAMI.2016.2646371 recognition. In: 2019 International Con-
ference on Document Analysis and
[40] Shorten, C., Khoshgoftaar, T.: A survey on Recognition (ICDAR). pp. 9–16 (2019).
image data augmentation for deep learn- https://doi.org/10.1109/ICDAR.2019.00012
ing. Journal of Big Data 6 (07 2019).
https://doi.org/10.1186/s40537-019-0197-0

[41] Simard, P., Steinkraus, D., Platt, J.:


Best practices for convolutional neural
networks applied to visual document anal-
ysis. In: Seventh International Conference
on Document Analysis and Recognition,
2003. Proceedings. pp. 958–963 (2003).
https://doi.org/10.1109/ICDAR.2003.1227801

[42] Srivastava, N., Hinton, G., Krizhevsky, A.,


Sutskever, I., Salakhutdinov, R.: Dropout:
A simple way to prevent neural networks
from overfitting. Journal of Machine Learn-
ing Research 15(56), 1929–1958 (2014), http:
//jmlr.org/papers/v15/srivastava14a.html
Springer Nature 2021 LATEX template

Appendix A Incorrect IAM


Test Samples

Label: 12 o'clock at night . At that time pastries and buns

Issue: Crossed-out 'hours'. another word under the line

Label: interest to the Mos , doubtless because their habitat is

Issue: Crossed-out word before 'interest'

Label: Stepping out in their white shorts , they looked glad enough

Issue: 'glad' is not present

Label: was . I 'm Ian Bawley . Does that mean anything

Issue: Labelled as I'm instead of I'am

Label: Sentence Database P02-109

Issue: Completely wrong except the last a few letters

Label: know how you newspaper people value that commodity !-

Issue: 'know' is not present

Label: -----------------------------------------------------

Issue: Completely wrong.

Fig. A1 Samples from lines in the IAM dataset test par-


tition with wrong labels. Label denotes the ground-truth
text written on the image and Issue indicates the error of
the label e.g. a word in a label is not present in the corre-
sponding image. The corrected labels of the IAM test split
will be shared upon acceptance.

You might also like