CNN-BiLSTM Model For English Handwriting Recognition, Comprehensiv Evalution On The IAM Dataset 2307.00664v1
CNN-BiLSTM Model For English Handwriting Recognition, Comprehensiv Evalution On The IAM Dataset 2307.00664v1
1* Faculty of Engineering and Nat. Sciences, Sabanci University, Istanbul, Türkiye, 34956.
2* Center of Excellence in Data Analytics (VERIM), Istanbul, Türkiye, 34956.
Abstract
We present a CNN-BiLSTM system for the problem of offline English handwriting recognition, with
extensive evaluations on the public IAM dataset, including the effects of model size, data augmen-
tation and the lexicon. Our best model achieves 3.59% CER and 9.44% WER using CNN-BiLSTM
network with CTC layer. Test time augmentation with rotation and shear transformations applied
to the input image, is proposed to increase recognition of difficult cases and found to reduce the
word error rate by 2.5% points. We also conduct an error analysis of our proposed method on IAM
dataset, show hard cases of handwriting images and explore samples with erroneous labels. We provide
our source code as public-domain, to foster further research to encourage scientific reproducibility.
1
Springer Nature 2021 LATEX template
2 Related Work
Fig. 1 The same text written by different people, showing
the possible dissimilarities of handwriting. Examples from Hidden Markov Models. The fundamental
IAM dataset. approach to handwriting recognition prior to the
deep learning era was with Hidden Markov Models
datasets to cover the wide variability in people’s (HMMs) [35]. HMMs are doubly statistical mod-
handwritings (some examples are given in Figure els, where there is an underlying stochastic process
1). This leads us to a data sparseness problem governing the state transitions and another one for
in terms of capturing distinct writing styles. In generating observations during state transitions.
addition, most of the existing datasets comprise Formally the aim is to maximize the posterior
historical handwriting images, making it imprac- probability P (S | X), taking into account these
tical to train and evaluate a model for modern stochastic models of state transitions and output
handwriting cases. generation [3].
There are two common approaches in order to Preliminary studies explored offline handwrit-
alleviate the data sparseness: augmenting images ing recognition with a sliding window approach
at train time and generating synthetic handwrit- for feature extraction and Hidden Markov Mod-
ing images to pretrain the models [9, 27, 28, 31, els for generating character or word sequences
46]. In the former, handwriting images are trans- [5, 34]. They further supported their models with
formed to imitate the same text as if it is written external language models and investigated the
in a different style by changing letter shapes effects of lexicon during decoding.
while preserving the readability. In the latter case,
synthetic handwritten images are created. This CTC Based Methods. Connectionist Temporal
approach has the advantage of being able to push Classification (CTC) method was introduced for
the diversity of handwriting styles further. Both speech recognition task [14] in the last decade,
strategies are widely used for introducing differ- allowing RNN models to be trained end-to-end
ent writing styles to increase the generalization with the backpropagation [45] algorithm for
capacity of models. sequence classification without any need of pre-
In this paper, we have (1) conducted compre- segmented data. The method was adopted for
hensive analysis using deep learning models on handwriting recognition [13]. In CTC based mod-
line-level IAM [29] dataset, (2) proposed a sim- els, (1) a sequence of image features are extracted
ple yet effective test time augmentation method, using a sliding window (1 pixel wide), (2) the
(3) provided insightful error analysis regarding extracted features are then fed to a bi-directional
dataset related issues, (4) explored the state-of- LSTM (BiLSTM) [18], and (3) produced character
art approaches with their pros/cons and stated sequences via the CTC layer. The CTC algorithm
directions for future work, (5) made our train, takes sequence of probability distributions and
evaluation and benchmarking code public1 for generates a character sequence consisting of rec-
reproducibility. ognizable characters. These models significantly
The rest of the paper is organized as fol- outperformed HMM and HMM-RNN based
lows. First the related work regarding the offline approaches. Later, Graves et al. applied multi-
handwriting recognition is presented. Then our dimensional LSTM (MDLSTM) layers instead
proposed deep learning model comes, which is of BiLSTM to incorporate more context around
followed by data augmentation and synthetic letters and to obtain better transcriptions [15].
data generation phases, test time augmentation With the rise of deep learning and great per-
formances achieved by CNN models on image
1
processing tasks [17, 25], researchers have con-
https://github.com/firatkizilirmakk/handwriting-
recognition sidered deep learning methods for handwriting
Springer Nature 2021 LATEX template
recognition problem as well. Instead of using Apart from these, Kang et al [22] incorporated
hand-crafted image features [13, 15], Shi et al a character level language model into training
integrated a CNN network to produce more phase where they feed the attention based LSTM
robust image features [39]. The method first pro- decoder with the concatenation of encoder and
cesses input handwriting image with the CNN, language model outputs.
generates sequence of image features, and passes More recently, transformer [44] based models
them through BiLSTM-CTC layers to obtain have been used, due to their substantial achieve-
final transcription. Inspiring from these, Bluche ments on sequence related tasks [7, 27]. Diaz
et al proposed a gated convolutional model et al compared CTC and sequence-to-sequence
for computing more generic image features [4]. approaches, experimented with the transformer
Puigcerver [37] showed the effectiveness of single decoder and found the best performing model
dimensional LSTM layers over multi dimensional as a self-attention encoder with the CTC decod-
ones. Further, Dutta et. al. [9] made a com- ing. Li et al, on the other hand, employed the
prehensive study demonstrating effects of data Vision Transformer (ViT) [8] as an encoder and
augmentations, pretraining and use of Spatial vanilla transformer decoder, taking advantage of
Transformer Network (STN) [20]. pretrained ViT and transformer decoder. They
obtained the state of the art results and further
Sequence-to-Sequence Approaches. The showed the effectiveness of their model and pre-
CTC method has the drawback that it prevents training scheme, without any post processing or
the generation of sequences longer than the input external language.
sequence, which is a sequence of feature maps
[22, 31]. As feature maps get smaller, due to Data Augmentation & Synthetic Data
convolution and max pooling operations, the gen- Generation. In addition to model related devel-
erated sequence becomes shorter, which in turn opments, most of the studies proposed solutions
could result in missing transcriptions. Hence, for dealing with data sparseness issue. There
attention based sequence-to-sequence methods are two common approaches: (1) applying data
have been developed in order to overcome the augmentation techniques at train time for intro-
shortcomings of the CTC and to leverage their ducing a broad range of handwriting styles, (2)
sequence learning capabilities on handwriting generating synthetic handwriting images to pro-
recognition [22–24, 31]. The fundamental idea vide enough and diverse handwriting styles to
with these methods is to use CNN-RNN (usu- deep learning models.
ally a BiLSTM) to encode the input image as a Affine transformations such as rotating, scal-
sequence of features, and then an attention based ing, and shearing are heavily applied and are
RNN, usually an LSTM or a GRU, to decode the shown to be effective methods for mimicking hand-
encoded sequence. The overall network is opti- writing styles [9, 36, 47]. More complex augmen-
mized with the cross entropy loss function, applied tation techniques are also proposed. Wigington
over each frame through the output sequence. et al [46] developed a distortion method along
Among such approaches, Micheal et al [31] with a profile normalization technique to vary
utilized the CNN-BiLSTM approach where the letter shapes, which in turn generates more dis-
CNN model learned to encode the handwrit- criminative letter styles. Further, Luo et al [28]
ing image and the attention-based LSTM model proposed to learn augmentation as a joint-task
learned to decode the encoded representation into during training of the networks.
sequence of characters. Authors compared differ- The second approach to reducing data sparse-
ent attention mechanisms such as content-based, ness involves generating synthetic handwritten
location-based and penalized attention, and fur- samples. Most of the studies have generated
ther combined the CTC loss along with the cross their own synthetic data, either in word or line
entropy loss to increase model capabilities. Kass level, and experimentally shown their effective-
et al [24] integrated the spatial transformer net- ness. While some have synthesized a few millions
work at the forefront of their sequence-to-sequence of handwriting images using words or sentences
model to reduce handwriting variations before from known, large corpora [9, 22, 47], others used
feeding images to the rest of the architecture.
Springer Nature 2021 LATEX template
Column Wise
Max Pooling Convolution Max Pooling
Max Pooling
Encoded
Representation Impossible to say.
Fig. 2 Our proposed deep learning network consisting of CNN-BiLSTM models as encoder and CTC as decoder.
synthetic data generated for scene-text recogni- with the help of CTC decoding (Section 3.3).
tion problem [24], or cropped lines from pdf files These modules are described in detail below and
containing handwriting text [27]. However, though the network structure is depicted in Figure 2.
the above methods report performance increase,
none of them have published their dataset, which 3.1 Feature Extraction
leaves the problem unsolved for other researchers.
As for test time augmentation, there are only The network uses 12 convolutional layers with 3×3
a few studies [9, 36, 46] in which they follow sim- kernels for feature extraction. Max pooling oper-
ilar approaches. The idea is to augment the input ation, with 2 × 2 kernel, is applied two times in
image at the test phase and then generate a new total; after the first two and following four con-
transcription based on the outputs of the origi- volution layer blocks. The convolution layers in
nal image and of the augmented ones. Poznanski these blocks have 32, 64 and 128 number of filters.
et al [36] applied 36 different transformations on ReLU [32] activation function and batch normal-
an image at test time, retrieved model outputs ization [19] are applied after each convolution for
for these augmented images and the original one, faster convergence and easier training. The CNN
and afterwards took the mean of these outputs as network produces a feature map tensor in the form
the final outcome. While Dutta et al [9] followed of D × H × W where D corresponds to number
the same approach in [36], Wigington et al [46] of output filters or depth of feature maps; H and
employed 20 transformations, obtained the corre- W stand for height and width of feature maps,
sponding transcriptions and picked the final one respectively. Then, we apply a max-pooling func-
with respect to the lowest CTC loss value. tion over the height dimension and interchange the
dimensions, resulting a tensor of W × D.
This output represents D dimensional feature
3 CNN-BiLSTM Model vectors for a sequence of length W . More formally,
W could be written as W = [w1 , w2 , ..., wt ], indi-
Inspired from [39], and its later accomplished
cating a sequence of t time frames where wi ∈ RD
successors [4, 9, 37], we followed a similar deep
for i ∈ {1, 2, . . . , t}.
learning architecture and more details could be
We intentionally avoided using deep CNN net-
found in [26]. The architecture consists of three
works in order not to narrow down the feature
fundamental blocks: feature extraction from the
maps too much. Otherwise, the CTC algorithm
input image using convolutional layers (Section
would only allow producing shorter sequences
3.1); processing the extracted deep features as a
due to its constraints. Nonetheless, we conducted
sequence using bidirectional LSTMs (Section 3.2);
and producing the sequence of output characters
Springer Nature 2021 LATEX template
experiments with a different number of convo- of probability distributions defined over the alpha-
lution and max-pooling layers and even tried bet. Then, CTC function processes the output for
well-known image feature extractors like ResNet18 labelling the sequence, allowing the whole network
[17] architecture to decide the baseline model. to be end-to-end trainable.
We explain the outcomes of these experiments in
Section 6.2. 3.4 Word Level Decoding
We employed three different decoding methods at
3.2 Sequence Encoding
the test phase to obtain the final transcription
This phase processes the sequence of features from the model output, which is a sequence of
crafted by the feature extraction step. The idea probability distributions: greedy, beam search [12]
is to incorporate sequence learning capabilities of and word beam search [38]. Greedy decoding is
Recurrent Neural Networks to obtain sequences of the simplest one which picks the most likely char-
features representing the input handwritten text acter at each time frame. Although it is intuitive
better. to select a character with the highest probability,
Following [4, 9, 37, 39], we utilised a bi- it is not guaranteed to generate the most likely
directional LSTM model as well, leveraging transcription sequence.
learned contexts from both directions. We used Instead of selecting one character at a time,
two BiLSTM layers having 256 hidden nodes, beam search keeps k most likely transcriptions at
without any dropout [42] applied. We fed the each time frame and produces a transcription with
BiLSTM layers with the delivered sequence of fea- the highest score among these k alternatives [12].
tures, W × D, and the BiLSTM produces a new Word Beam Search (WBS), on the other hand, is
tensor of size W × (2 × K), where K stands for the an extension of the beam search, which constrains
hidden dimension size of the BiLSTM layer which words to be in a given lexicon and allows non-word
is equal to 256. The produced tensor, sequence of characters to occur inside words [38]. Moreover,
feature vectors, holds an encoded representation the algorithm incorporates a 2-gram word-level
of the input handwriting image in a feature space. language model for scoring words through decod-
Additionally, we explored bi-directional Gated ing. We used WBS algorithm to output the final
Recurrent Unit (GRU) [6] based models and transcription; set beam width as 150 and con-
conducted experiments with different number of figured word beam search to utilize the 2-gram
recurrent layers and hidden dimension sizes. Sub- language model.
section 6.2 illustrates the effect of these experi-
ments. 4 Data Augmentation &
3.3 CTC Decoding
Synthetic Data Generation
After passing the input image through the first 4.1 Public-Domain Datasets
two phases, feature extraction and sequence learn- The handwriting recognition domain lacks high-
ing, the encoded representation of the input is quality, open-access datasets with a high number
obtained in the following form W × (2 × K); for- of samples, even though the problem has been
mally W = [w1 , w2 , ..., wt ] where wi ∈ RH for studied for decades. There are a few common
i ∈ {1, 2, . . . , t}. Here, H stands for 2×K or a real datasets that contain full sentences; namely, IAM,
value 512 in our case of K being set to 256. Rimes, Bentham and Washington [10, 16, 29, 43]
Then, the encoded sequence representation, where the last two consist of only historical hand-
W ×(2×K), is mapped to a sequence of probability writing images, which are not suitable for modern
distributions over alphabets of recognizable char- cases.
acters, resulting in W × C. To do so, we utilised IAM dataset was first published in 1999 [29],
two fully connected layers; the first one is fol- and still is the most commonly used English hand-
lowed by a ReLU activation and the second one writing dataset. It contains handwriting texts in
outputs through C number of cells. Further, the English written by 657 different people. There
SoftMax function is applied to produce a sequence are 1,539 scanned pages, partitioned into 10,373
Springer Nature 2021 LATEX template
2 3
https://github.com/jpuigcerver/Laia/issues/44 publicly shared in https://github.com/Canjie-Luo/Text-
Image-Augmentation
Springer Nature 2021 LATEX template
Original Prediction
(a)
0.4 -5.55 0.6 -17.77 -12.88
Fig. 5 An example application of the scoring transcriptions. (a) Output of the original image is scored using the equation
1. The optical score is obtained from the deep learning network. The language model score is computed using the 4-gram
language model. (b) and (c) shows the scoring steps of the transformed images. The output with the lowest final score is
picked as the final output, which is the correct transcription of the handwriting.
score obtained from the below defined 4-gram computed for the corresponding transcription are
language model. Then the transcription with the combined using the above-defined formula. The
highest score was picked as the final outcome. best transcription with the highest score (−7.98
Through the experiments, we fit values for λ and for the example in the figure) is picked as the final
ω. output, which is the correct decoding obtained
We trained the language model as a 4-gram from the image rotated by 2.5 degrees for this case.
model on WikiText-103 [30] corpus with Kneser- In addition to the simplicity of this method,
Ney smoothing [33] using KenLM 5 tool. Before it is applicable to any network generating charac-
training the language model, we applied prepro- ter sequences from handwriting images. Further,
cessing steps including tokenization of the words, the method achieves superior results (see Table 9)
lowering the letters and removing punctuation. once the best decoding is selected with respect to
Figure 5 illustrates the proposed method in the lowest error, as if an oracle telling the truth.
a simplified fashion. For each image (original or This suggests that a better scoring function will
transformed), the optical score, produced by the yield greater success.
model for the image and the language model score
5
https://github.com/kpu/kenlm
Springer Nature 2021 LATEX template
11
dataset was not split into validation or test parti- word beam search algorithm in all result tables.
tions; the model was directly trained on the whole We used greedy and beam search methods as is; in
dataset by 5 epochs. other words, without adding an external language
model or considering a lexicon during decod-
Table 4 Results of our pretrained deep learning model, ing. However, word beam search builds a prefix
on the IAM validation and test splits. Linear layer : tree from a corpus i.e. lexicon to keep decoded
parameters of the output linear layers are updated while words in the lexicon. Therefore, we explored the
fine-tuning. All layers: all the parameters are updated.
effect of lexicon size and out-of-vocabulary rate on
Validation Test decoding performance.
Pretraining
We experimented with three different English
CER WER CER WER corpora and their combinations for building the
(%) (%) (%) (%)
lexicon. The base lexicon consists of the texts from
Baseline from Table 3 the IAM training partition, which is part of the
4.06 13.02 5.20 14.86
(No Pretraining) LOB corpus [1]. On top of it, we appended the
Pretrained: Brown corpus, which is a million-word English
Linear layer 4.21 14.16 5.44 15.58 text corpus consisting of texts from 15 different
All layers 3.88 12.71 5.05 14.46 categories. We also added the WikiText-2 corpus
comprising over two million English words. Fur-
ther, we assessed their combination as well as the
We reduced the character errors by around validation partition of the IAM dataset. The lex-
0.15% and word errors around 0.4% (from 5.20% icon built upon the IAM validation split has zero
to 5.05% for character errors and from 14.86% out-of-vocabulary rate, which is not possible for
to 14.46% for word errors) by only pretrain- real-time scenarios yet it underlines the effect of
ing the model on more than 2 million synthetic the lexicon.
handwriting images. However, there is room for
improvement once there is more synthetic and Table 5 Reflection of lexicon size with number of unique
natural handwriting data for pretraining and fine- words (after tokenization, lowering and discarding
tuning. We performed the next set of experiments punctuation) and out-of-vocabulary (OOV.) rates on
errors. B stands for the Brown corpus and W2 for the
using this pretrained model.
WikiText2. corpus
Afterwards, the pretrained network was fine-
tuned on the IAM dataset train partitions in two
ways: (1) freezing all the parameters of the model # OOV. CER WER
Lexicon
Words Rate (%) (%)
except the output linear layers. (2) updating all
the parameters. Then, the fine-tuned model was Baseline from Table 4
7.845 0.54 3.88 12.71
evaluated on the IAM validation and test parti- (IAM Train)
tions and the corresponding scores are given in +B 57.606 0.15 3.49 11.77
+ W2 60.244 0.21 3.56 11.92
Table 4. + B + W2 82.449 0.13 3.46 11.70
Updating only the parameters of the final lin- + IAM Validation 2.251 0.00 2.2 6.35
ear layers did not reveal better scores, as expected,
due to the difference in handwriting styles between
synthetic and natural ones. Hence, we adopted The error rate depends on the context of the
pretraining all layers, in the remainder of the words in the lexicon. As the rate of the out
experiments. The results of this experiment are of vocabulary (OOV) words decreased, the error
shown in Table 4. rates gets reduced as well, as illustrated in Table
5. In a normal test scenario, it is not possible to
6.5 Effect of Lexicon, Letter Case, contain all likely words in a lexicon, which is the
and Punctuation IAM Test case in the table. Nonetheless, most of
the time, we obtained lower error rates when using
We decoded model outputs using three different larger lexicons. However, one needs to account the
methods (greedy, beam search, and word beam trade-off between the running time with a bigger
search) and gave the best results obtained by the lexicon and obtained performance. We preferred
Springer Nature 2021 LATEX template
13
7,8 indicates, there is room for improvement once challenging handwriting samples; and share mis-
a better scoring function is employed. Moreover, labelled example lines from the IAM dataset. We
this strategy is applicable to any model generat- analyzed the outputs of the best models without
ing character sequences from handwriting images, the test time augmentation, for simplicity.
which makes it considerable for other models as One can see in Figure 6 that in 42% of all the
well. lines, all the characters were recognized correctly
We also analyzed the time running times with and in 80.89% of all the images in the IAM test
decoding of original images. We run our proposed set, there were at most 3 character mistakes. Note
methods on two different machines with the fol- that the dataset has an average of 35.2 characters
lowing configurations: (1) AMD Ryzen 3970X, 64 per line image with an 8.72 standard deviation.
GB ram with Geforce RTX 3090. (2) Intel i7 This is a rather surprising statistics, which we
10700K, 32 GB ram with Gefore RTX 3080. The believe can indicate certain future research direc-
time taken for each image was measured and its tions. Based on this observation, we focused on
average with the standard deviation for all the test time augmentation, for example. Similarly,
samples in the IAM test partition is computed for the model produced at most one incorrect word in
both of our methods, as shared in Table 9. Even 68.75% of all the samples in the test split, where
though this test time augmentation method slows the mean number of words in lines is 8.90.
down obtaining the final transcription, it could be Some of the most challenging writing samples
preferred for batch processing of documents when are given in Figure 7. The figure shows exam-
there is no real time requirements. Additionally, ples from IAM test dataset where our model made
better scoring function would reveal greater suc- at least five character errors, which corresponds
cess without needing any more model parameters to 7% of the test set samples. Note that these
or more data. handwritings are difficult for people as well.
Furthermore, the IAM dataset contains par-
Table 9 Running times of test time augmentation tially or totally wrong ground truth labels which,
method on two different machines with CPU & GPU in fact, affects the both training and testing
options. Mean and standard deviation of the elapsed time
over the IAM test partition. phases. Therefore, we found and corrected the
erroneous cases on the IAM test split in order
to evaluate better. Example erroneous lines along
Original With Test Time with their issues are presented in the Appendix A.
Time (s) Aug. Time (s)
We will share the new, fixed metadata of the test
Processing split on the link 6 .
Machine Mean Std. Mean Std.
Device
PC 1
CPU 2.58 1.03 21.46 4.08 8 Comparison with the State
GPU 2.27 0.98 19.33 3.84
CPU 3.64 1.12 25.54 4.56
of the Art
PC 2
GPU 3.28 1.02 24.76 4.21
Our approach is compared to state-of-art methods
with varying characteristics, as shown in Table 10.
However, it should be noted that direct compari-
son of the models may not very meaningful, due
7 Error Analysis to differences or omissions in training or testing
criteria.
Up to this point, we measured system performance
Our model is able to compete with the sim-
in terms of CER and WER metrics and computed
ilar approaches proposed in [4, 47]. However, we
their mean over the IAM test partition. Once we
were not able to fully duplicate their results due
examined the erroneous cases, we found out that
to the differences including (1) extra data they
a lot of the errors are contributed by a smaller
used in training, (2) the language model they
number of text lines.
employed to support the deep learning network
In this section, we provide analysis in terms
of the error distribution; explore difficult sam-
ples along with possible ways to overcome these 6
https://github.com/firatkizilirmakk/handwriting-
recognition
Table 10 Comparison of our approach with the state-of-the-art approaches. LM & Lexicon denotes the language model and the lexicon being used (if any) during
decoding. Decoding indicates which method is used to decode network outputs into character sequence. Here WBS denotes the word beam search algorithm and bw
is the beam width. #P is the number of trainable parameters (in millions) of the deep learning networks. CaseInsens. stands for case-insensitive decoding and
NoPunct. denotes the decoding without considering punctuation.
Springer Nature 2021 LATEX template
Authors Encoder Decoder Train Set LM & Lexicon Decoding #P CER % WER %
Bluche [4] GCRNN CTC IAM + Multi. 7-gram LM + 50K - 0.75 3.2 10.5
Xiao [47] CNN + LSTM CTC IAM + Synth. Brown + LOB WBS bw/150 - 3.03 8.66
Michael [31] CNN + LSTM LSTM w. Att. IAM - Beam bw/16 - 4.87 -
Kang [21] Transformer IAM + Synth. - Greedy 100 4.67 15.45
Li [27] Transformer IAM + Synth. - Beam bw/10 334 3.42 -
Li [27] Transformer IAM + Synth. - Beam bw/10 558 2.89 -
Diaz [7] Public 9-gram LM Greedy ∼12.5 3.15 -
Self Att. + CTC
Diaz [7] Internal + Public 9-gram LM Greedy ∼12.5 2.75 -
15
(a) (b)
Fig. 6 Cumulative character and word error rates obtained with the baseline model on the IAM dataset. (a) Illustrates
cumulative character errors. Example: All the characters are predicted correctly in 42% of all the lines, which corresponds
to 1226 line images; one or at most two characters are predicted incorrectly in 560 line images, which corresponds to 17%
of all the lines (b) Displays cumulative word error rates following the same manner.
output, and (3) the lexicon their methods select 9 Summary and Conclusions
decoded words from. Yet, our method achieves a
close performance to these works, plus with open We proposed a CNN-BiLSTM model for offline
and comprehensive evaluation. English handwriting and comprehensively evalu-
Others exploited the attention based ated it in regards to model architecture, data aug-
approaches to make use of the advantages of these mentation, synthetic data generation, and data
methods on sequence learning. In addition to deep representation. We have also suggested a sim-
learning architectures, they generated millions of ple yet effective post processing method utilizing
synthetic handwriting images and/or crafted real test time augmentation. Even though the method
handwriting images through the web, which is takes more time compared to decoding of a single
another factor of their success. As summarized in image, it could be preferable in batch processes
Table 10, Li et. al. [27] pretrained a Transformer that do not have time constrains and it is applica-
model on their synthetic dataset, finetuned over ble to any model producing sequence of characters
the IAM and achieved superior results with- from images.
out applying any post processing method e.g. In addition to the usual metrics for assess-
a language model or a lexicon. Diaz et. al. [7] ing handwriting recognition models, we analyzed
obtained the state of the art performance where the error distributions to gain insights regarding
they preferred a simpler model with considerably challenges. We have shown that the majority of
less amount of parameters, trained on datasets the errors stem from a relatively small portion
they collected and supported the outputs with a of the test cases, which can be instructive for
9-gram language model. improvements.
These works suggest that the crucial part of We will share our code openly upon publi-
a successful handwriting recognition system is to cation, to allow for other researchers, as well as
utilize large number of high quality data, either reproducibility.
synthetic or real. Once the data requirement is
satisfied, it is better to employ the state of the References
art deep learning networks such as Transformer,
Vision Transformer or other models with atten- [1] Lancaster-oslo-bergen corpus of modern
tion mechanism. Next, the final output could be english (LOB) : [tagged, horizontal format] /
decided with the help of an n-gram language stig johansson, http://hdl.handle.net/20.500.
model to further increase the success. 12024/0167, oxford Text Archive
Springer Nature 2021 LATEX template
True: waving , unkissed , from the window . And Dai , on the pavement , knowing in his
Prediction: morsing , hired , from the window . rd dean , on the panime"t . knowing in his
CER: 18.0 WER: 6.0
Fig. 7 Difficult samples from IAM dataset test partition, claiming samples for which model predicted with character errors
bigger than 5 i.e. CER > 5 are not easy to read by people as well.
[2] Bahdanau, D., Cho, K., Bengio, Y.: [4] Bluche, T., Messina, R.: Gated convo-
Neural machine translation by jointly lutional recurrent neural networks for
learning to align and translate (2014). multilingual handwriting recognition. In:
https://doi.org/10.48550/ARXIV.1409.0473, 2017 14th IAPR International Conference
https://arxiv.org/abs/1409.0473 on Document Analysis and Recognition
(ICDAR). vol. 01, pp. 646–651 (2017).
[3] Bahl, L., Brown, P., de Souza, P., Mercer, R.: https://doi.org/10.1109/ICDAR.2017.111
Maximum mutual information estimation of
hidden markov model parameters for speech [5] Bunke, H., Bengio, S., Vinciarelli, A.:
recognition. In: ICASSP ’86. IEEE Interna- Offline recognition of unconstrained
tional Conference on Acoustics, Speech, and handwritten texts using hmms and
Signal Processing. vol. 11, pp. 49–52 (1986). statistical language models. IEEE Trans-
https://doi.org/10.1109/ICASSP.1986.1169179 actions on Pattern Analysis and Machine
Springer Nature 2021 LATEX template
17
Intelligence 26(6), 709–720 (2004). [13] Graves, A., Liwicki, M., Fernández, S.,
https://doi.org/10.1109/TPAMI.2004.14 Bertolami, R., Bunke, H., Schmidhuber, J.:
A novel connectionist system for uncon-
[6] Cho, K., van Merrienboer, B., Gulcehre, strained handwriting recognition. IEEE
C., Bahdanau, D., Bougares, F., Schwenk, Transactions on Pattern Analysis and
H., Bengio, Y.: Learning phrase repre- Machine Intelligence 31(5), 855–868 (2009).
sentations using rnn encoder-decoder for https://doi.org/10.1109/TPAMI.2008.137
statistical machine translation (2014).
https://doi.org/10.48550/ARXIV.1406.1078, [14] Graves, A., Liwicki, M., Fernández, S.,
https://arxiv.org/abs/1406.1078 Bertolami, R., Bunke, H., Schmidhuber, J.:
A novel connectionist system for uncon-
[7] Diaz, D.H., Qin, S., Ingle, R., strained handwriting recognition. IEEE
Fujii, Y., Bissacco, A.: Rethinking Transactions on Pattern Analysis and
text line recognition models (2021). Machine Intelligence 31(5), 855–868 (2009).
https://doi.org/10.48550/ARXIV.2104.07787, https://doi.org/10.1109/TPAMI.2008.137
https://arxiv.org/abs/2104.07787
[15] Graves, A., Schmidhuber, J.: Offline
[8] Dosovitskiy, A., Beyer, L., Kolesnikov, A., handwriting recognition with multidi-
Weissenborn, D., Zhai, X., Unterthiner, T., mensional recurrent neural networks. In:
Dehghani, M., Minderer, M., Heigold, G., Koller, D., Schuurmans, D., Bengio, Y.,
Gelly, S., Uszkoreit, J., Houlsby, N.: An Bottou, L. (eds.) Advances in Neural
image is worth 16x16 words: Transform- Information Processing Systems. vol. 21.
ers for image recognition at scale (2020). Curran Associates, Inc. (2008), https:
https://doi.org/10.48550/ARXIV.2010.11929, //proceedings.neurips.cc/paper/2008/file/
https://arxiv.org/abs/2010.11929 66368270ffd51418ec58bd793f2d9b1b-Paper.
pdf
[9] Dutta, K., Krishnan, P., Mathew, M., Jawa-
har, C.: Improving cnn-rnn hybrid networks [16] Grosicki, E., Carré, M., Brodin, J.M.,
for handwriting recognition. In: 2018 16th Geoffrois, E.: Results of the rimes
International Conference on Frontiers in evaluation campaign for handwritten
Handwriting Recognition (ICFHR). pp. 80– mail processing. In: 2009 10th Interna-
85 (2018). https://doi.org/10.1109/ICFHR- tional Conference on Document Analysis
2018.2018.00023 and Recognition. pp. 941–945 (2009).
https://doi.org/10.1109/ICDAR.2009.224
[10] Fischer, A., Keller, A., Frinken, V., Bunke,
H.: Lexicon-free handwritten word spotting [17] He, K., Zhang, X., Ren, S., Sun,
using character hmms. Pattern Recog- J.: Deep residual learning for image
nition Letters 33(7), 934–942 (2012), recognition. pp. 770–778 (06 2016).
https://www.sciencedirect.com/science/ https://doi.org/10.1109/CVPR.2016.90
article/pii/S0167865511002820, special Issue
on Awards from ICPR 2010 [18] Hochreiter, S., Schmidhuber, J.:
Long short-term memory. Neural
[11] Francis, W.N., Kucera, H.: Brown cor- computation 9, 1735–80 (12 1997).
pus manual. Tech. rep., Department of https://doi.org/10.1162/neco.1997.9.8.1735
Linguistics, Brown University, Providence,
Rhode Island, US (1979), http://icame.uib. [19] Ioffe, S., Szegedy, C.: Batch normalization:
no/brown/bcm.html Accelerating deep network training by reduc-
ing internal covariate shift. In: Proceedings of
[12] Graves, A.: Sequence transduction the 32nd International Conference on Inter-
with recurrent neural networks (2012). national Conference on Machine Learning -
https://doi.org/10.48550/ARXIV.1211.3711, Volume 37. p. 448–456. ICML’15, JMLR.org
https://arxiv.org/abs/1211.3711 (2015)
Springer Nature 2021 LATEX template
19
Label: Stepping out in their white shorts , they looked glad enough
Label: -----------------------------------------------------