0% found this document useful (0 votes)
9 views15 pages

Boosting Offline Handwritten Text Recognition in

ME Hand writting Reference paper

Uploaded by

padmanath2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views15 pages

Boosting Offline Handwritten Text Recognition in

ME Hand writting Reference paper

Uploaded by

padmanath2001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Received May 5, 2021, accepted May 12, 2021, date of publication May 21, 2021, date of current version

June 2, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3082689

Boosting Offline Handwritten Text Recognition in


Historical Documents With Few Labeled Lines
JOSÉ CARLOS ARADILLAS 1 , JUAN JOSÉ MURILLO-FUENTES 1, (Senior Member, IEEE),
AND PABLO M. OLMOS 2,3 , (Member, IEEE)
1 Departamento de Teoría de la Señal y Comunicaciones, Escuela Técnica Superior de Ingeniería, Universidad de Sevilla, 41004 Sevilla, Spain
2 Departamento de Teoría de la Señal y Comunicaciones, Universidad Carlos III de Madrid, 28903 Madrid, Spain
3 Gregorio Marañón Health Research Institute, 28007 Madrid, Spain
Corresponding author: Juan José Murillo-Fuentes ([email protected])
This work was supported in part by the Spanish Government Ministerio de Educación (MEC) under Grant FPU16/04190 and Project
MINECOTEC2016-78434-C3-2-3-R; in part by the Comunidad de Madrid under Grant IND2017/TIC-7618, Grant IND2018/TIC-9649,
Grant IND2020/TIC-17372, and Grant Y2018/TCS-4705; in part by the Banco Bilbao Vizcaya Argentaria (BBVA) Foundation through the
Domain Alignment and Data Wrangling with Deep Generative Models (Deep-DARWiN) Project; and in part by the European Union by the
Fondo Europeo de Desarrollo Regional (FEDER) and the European Research Council (ERC) through the European Union Horizon
2020 Research and Innovation Program under Grant 714161.

ABSTRACT In this paper we address the problem of offline handwritten text recognition (HTR) in historical
documents when few labeled samples are available and some of them contain errors in the train set. Our three
main contributions are: first, we analyze how to perform transfer learning (TL) from a massive database to
a smaller historical database, analyzing which layers of the model need fine-tuning. Second, we analyze
methods to efficiently combine TL and data augmentation (DA). Finally, we propose an algorithm to
mitigate the effects of incorrect labeling in the training set. The methods are analyzed over the ICFHR
2018 competition database, Washington and Parzival. Combining all these techniques, we demonstrate a
remarkable reduction of CER (up to 6 percentage points in some cases) in the test set with little complexity
overhead.

INDEX TERMS Connectionist temporal classification (CTC), convolutional neural networks (CNN), data
augmentation (DA), deep neural networks (DNN), historical documents, long-short-term-memory (LSTM),
offline handwriting text recognition (HTR), outlier detection, transfer learning.

I. INTRODUCTION and historical periods, with differences in the character set,


The transcription of historical manuscripts is paramount for the semantics and the lexicon.
a better understanding of our history, as it allows for direct Usually, the process for automatic transcription of a docu-
access to the contents, quite facilitating searches and studies. ment comprises 4 phases: 1) digitization of the document to
Also, the classification and indexing of transcript text can be obtain an image of every page in the document in electronic
easily automated. Handwritten text recognition (HTR) tasks format; 2) segmentation of each page into corresponding
in historical datasets have been studied by many authors in the regions with lines of text; 3) transcription of each line of text
last few years [1]–[10]. In HTR, transcribing each author can and finally 4) application of a dictionary or language model
be considered a different task, since the distribution of both to correct errors in the transcription of texts as well as in the
model input and output varies from writer to writer. At the composition of the complete texts from the lines obtained in
input, we have variations not only in the calligraphy but also, step 3).
depending on the digitization process, in the image resolu- While segmentation is an important issue in
tion, contrast, color, or background. On the other hand, at the HTR [2], [11]–[13], this paper focuses on the transcription
output, the labels usually correspond to different languages phase. In recent years, there has been a trend towards models
based on deep neural networks (DNNs) [8]. Once the DNN
model to be used has been designed, an enormous number
The associate editor coordinating the review of this manuscript and of training samples are required to minimize the number
approving it for publication was Gang Mei . of transcription errors, measured in character error rate

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
76674 VOLUME 9, 2021
J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

(CER) or word error rate (WER), given by the Levenshtein difficult. On the other hand it is not parallelizable [22]. For
distance [14] between the ground truth (GT) and the output these reasons, it has been discarded here.
of the model. Regarding the state-of-the-art DNN models for HTR, some
However, in reality most of the times we might only have recent works are in the line of avoiding recurrence in the
a limited number of lines for a given author and document. models, developing models based in fully-convolutional net-
Besides, transcription of part of the documents to get labeled works such as the (Gated) Fully Convolutional Networks
samples is expensive either in time or money. Take [1] as an (G)FCN [23]–[26]. This kind of model reduces the number
example, where the manual transcription process of a docu- of parameters in the architecture.
ment by an expert in paleography took an average of 35 min-
utes per page. In this scenario, allowing for a reduction in the B. TRANSFER LEARNING
transcript needed would greatly improve the viability and cost In the HTR problem with a reduced training set, TL was
of the process. applied by Soullard et al. in [7]. The main idea behind TL
The goal of the methods proposed in this paper is to signifi- is initializing the parameters of a model by those learned
cantly reduce the number of annotated lines, thus reducing the from a huge dataset beforehand, denoted as source. Then,
monetary cost of the process. Contributions are as follows: the available labeled set of samples of the dataset of interest,
• A DNN model with transfer learning (TL) by studying the target, is used to refine the parameters of the model,
in detail which layers should be retained and which ones usually just a subset of them [27]–[30]. In [30], they ana-
should be retrained. We found this model to provide the lyze how to reduce the dataset shift and enhance the fea-
best performance within the models using long-short- ture transferability in task-specific layers of deep networks.
term-memory (LSTM). Hence, with TL we start learning a different task to avoid
• A thoughtful analysis of HTR when little labeled data learning the whole set of parameters from scratch, preventing
is available, studying the evolution with the number of overfitting and favoring convergence. In [7], they proposed
transcript lines, highlighting how sensitive the models a method that applies TL in both the optical and the lan-
are to the number of lines in the train set when this guage model. In this and other similar previous proposals
number is low. on TL, the authors applied DA in both training and test
• A model combining TL and data augmentation (DA). steps.
We show that for a reduced number of lines, using DA
C. DATA AUGMENTATION
can be counterproductive.
• An algorithm to detect and avoid mislabeled lines is DA consists in augmenting the training set with synthetically
proposed and tested, with good results in small training generated samples. Similar to TL, it reduces the tendency to
datasets. Furthermore, in some scenarios, wrong labels overfit when training models with a large number of param-
are fixed. eters and limited labeled data. In DA for image classifica-
tion problems, the training set is increased by modifying the
The remaining of the paper is organized as follows: in original images through transformations such as scaling, rota-
Section II previous works reporting solutions to the problem tion, or flipping images, among others [31]. Several authors
of HTR over small datasets are summarized; in Section III have proposed specific DA techniques for HTR: in [32] the
the architecture and models used in this paper are presented; authors apply methods for augmentation and normalization
in Section IV the application of TL and DA for HTR is to improve HTR by allowing the network to be more tolerant
analyzed; in Section V an algorithm is proposed to detect and of variations in handwriting by profile normalization. In [33]
prone mislabeled lines in the training set; the paper ends with they show some affine transformation methods for data aug-
Section VI, drawing main conclusions. mentation in HTR. In [34] and [35] they synthesize new lines
images by concatenating characters from different datasets.
II. RELATED WORK AND CONTRIBUTIONS [34] does it from cursive characters while in [35] they do it
A. DNN MODEL from a database of handwritten Chinese characters. Similar
State-of-the-art architectures combine a convolutional neural to [32], in [36] they also apply some elastic distortions to the
network (CNN) [15] with a recurrent neural network (RNN) original images. In [4] the authors improve the performance
with LSTM cells [16]. This type of network models the condi- by augmenting the training set with specially crafted multi-
tioned probability, p(l|x), of a character sequence of arbitrary scale data. They also propose a model-based normalization
length, l, given an image, x, of fixed height and arbitrary scheme that considers the variability in the writing scale at
width. Note that varying the number of characters yields a the recognition phase. In these works, they apply DA in rel-
new design of the last layer of the DNN model. These models atively large well-known datasets, but here we show that the
are configured to minimize the connectionist temporal clas- regularization effect of any DA technique has no impact when
sification (CTC) cost function proposed by Graves in [17]. doing the fine-tuning adaptation to a singular writer in small
In some works 2D-LSTM [16] networks are used [18]–[21]. databases. Accordingly, we conclude that the combination of
This RNN has two main drawbacks. On the one hand, it has TL and DA applied to small datasets has to be done carefully,
an extremely large number of parameters that make learning to reduce the final error.

VOLUME 9, 2021 76675


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 1. The adapted CRNN architecture from [39] used as baseline. The number of channels of each CNN
layer is shown in this scheme. Pooling layers after the first, second and third CNN layer are also depicted.

D. MISLABELED SAMPLES combination of 2D-LSTM layers and convolutional layers,


Mislabeled detection in HTR has been seldom faced. In [37] with a collapsing stage before the output layer in order
they face a specific problem in the IAM database: crossed to reshape the features tensors from 2D to 1D [18], [19].
out words that are labeled with the symbol ‘‘#’’. The authors The use of 2D-LSTM layers at the first stages has several
propose a method to avoid how this specific label affects drawbacks such as the need of more memory in the allocation
the performance. That method is focused on the specific of activations and buffers during back-propagation and a
problem of crossed-out text and how it is annotated in the longer runtime is required to train the networks since parallel
GT. The algorithm we propose in Section V is more general, computation cannot be implemented in contrast to a CNN
addressing this and other possible problems. In further related [22]. Recently, it has been shown that CNN in the lower
work in [38], the authors apply a method to align the output layers of an HTR system obtains similar features than an
of a segmentation process with the available GT. RNN containing 2D-LSTM units [22].
The CRNN architecture proposed in [39] is comprised of
III. ARCHITECTURE AND DATABASES seven CNN with a max-pooling step at the output of four
In the HTR pipeline, there are several ways to improve per- of them, followed by a stack of two BLSTM layers at the
formance of a DNN model: preprocessing steps, the architec- top of the network. In [6] we have shown that the CRNN in
ture used, regularization techniques, optimization, language Fig. 1, the one used in this work,1 achieves better performance
model and dictionary, among others. The methods proposed than the original one proposed in [39]. It uses a CNN with 5
in this paper are developed for an exemplary state-of-the-art layers at the bottom, with a 3 × 3 and 1 × 1 stride kernel,
DNN architecture but can be easily used in the pipeline of the number of filters are 16, 32, 48, 64 and 80, respectively.
any other HTR system to reduce transcript errors. For fair We use LeakyReLU as the activation function. A 2 × 2 max-
comparison, in this paper, we use the same DNN model for pooling is also applied at the output of the first 3 layers,
all the experiments. Extra correction steps such as adding a to reduce the size of the input sequence. At the output of the
language model (LM) are not included but could be applied CNN, a column-wise concatenation is carried out with the
to further improve the performance. In this section, we focus purpose of transforming the 3D tensors of size w × h × d
on the selection of the model of the DNN and the databases (width × height × depth) into 2D tensors of size w × (h × d)
used. where w and h are the width and height of the input image
divided by 8, i.e., after 3 stages of 2 × 2 max-pooling. The
depth, d = 80, is the number of features of the last CNN
A. ARCHITECTURE
layer. Therefore, at the output of the CNN, we have sequences
In this work, we implement a network architecture based of length w and depth h × 80 features.
on the convolutional recurrent neural network (CRNN) After the CNN stage, 5 1D BLSTM recurrent layers of 256
presented in [39]. This approach avoids the use of units with hyperbolic tangent functions and without peephole
two-dimensional LSTM (2D-LSTM) layers, applying connections.. Since at the output of each BLSTM layer we
convolutional layers as feature extractors and a stack of 1D
bidirectional LSTM (BLSTM) layers to perform classifica- 1 Implementation is publicly available in https://github.com/josarajar/
tion. Previous DNN architectures for HTR consisted of a HTRTF

76676 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 2. IAM handwritten text sample: image of a line and its transcript.

FIGURE 3. Washington handwritten text sample: image of a line and its transcript.

have 256 features in each direction, we perform a depth-wise is partitioned into training, validation, and test sets of 6161,
concatenation to adapt the input of the next layer, to the over- 900, and 2801 lines, respectively.2 Here, the validation and
all size of 512. Dropout regularization [18], [40] is applied test sets provided are merged in a unique test set. There are
at the output of every layer, except for the first convolutional 79 different characters in this database, including capital and
one, with rates 0.2 for the CNN layers and 0.5 for the BLSTM small letters, numbers, some punctuation symbols, and white
layers. space.
Finally, each column of features after the 5th BLSTM
layer, with depth 512, is mapped into the L + 1 output 2) THE RIMES DATABASE
labels with a fully connected layer, where L is the num- The RIMES database is a collection of French letters hand-
ber of characters in the alphabet of each database, e.g., 79, written by 1,300 volunteers who have participated in the
83, 96 or 102 in the IAM, Washington, Parzival or Inter- RIMES database creation by writing up to 5 emails. The
national Conference on Frontiers in Handwriting Recog- RIMES database thus contains 12,723 pages corresponding
nition (ICFHR) 2018 Competition databases, respectively. to 5605 mails of two to three pages. In our experiments,
The additional dimension is needed for the blank symbol we take a set of 12111 lines extracted from the Interna-
of the CTC [17], which concludes this architecture. Overall, tional Conference on Document Analysis and Recognition
this CNN-BLSTM-CTC model has 9.58 × 106 parameters (ICDAR) 2011 line-level competition. There are 100 different
roughly, depending on the number of characters in each characters in this database.
database.
The architecture is implemented in the open-source frame- 3) THE WASHINGTON DATABASE
work TensorFlow in Python, using the GPU-enabled ver- The Washington database contains 565 text lines of the
sion. We use the Adam algorithm, a learning rate of 0.003, George Washington letters, handwritten by two writers in the
β1 = 0.9 and β2 = 0.999. The parameters are updated using 18th century. Although the language is also English, the text
the gradients of the CTC loss on each batch of 16 text lines. is written in longhand script and the images are binarized as
We apply an early stopping criterion of 10 epochs without illustrated in Fig. 3, see [3] for a description of the differences
average improvement. between binarized and binarization-free images when apply-
The selected model was the one with the best performance ing HTR tasks. In this database, four possible partitions are
out of the 7 + 3, 8 + 0, 4 + 4, 5 + 5, and 6 + 6 configurations, provided to train and validate. In this work, we have randomly
where A + B corresponds to A convolutional followed by chosen one of them. The train, validation, and test set contain
B BLSTM layers. On the other hand, for the CTC we tried 325, 168 and 163 handwritten lines, respectively. There are
best path decoding and beam search decoding, with no sig- 83 different characters in the database.
nificant improvement of the later, despite its computational
complexity. 4) THE PARZIVAL DATABASE
The Parzival database contains 4477 text lines handwritten
B. DATABASES by three writers in the 13th century. In this case, the lines are
In this paper we focus on HTR over eight databases: IAM binarized like in the Washington database, but the text is writ-
[41], Washington [42], Parzival [42], and the five ones pro- ten in gothic script. A sample is included in Fig. 4. There are
vided at the ICFHR 2018 Competition [5]. 96 different characters in this database. Note that the Parzival
database has a large number of text lines in comparison to the
1) THE IAM DATABASE Washington one. We have randomly chosen a training set of
The IAM database [41] contains 13353 labeled text lines of the approximately same size as in the Washington training to
modern English handwritten by 657 different writers. The emulate learning with a small dataset, the main goal of this
images were scanned at a resolution of 300 dpi and saved as work.
PNG images with 256 gray levels. An image of this database 2 The names of the images of each set are provided in the Large Writer
is included in Fig. 2 alongside the GT transcript. The database Independent Text Line Recognition Task.

VOLUME 9, 2021 76677


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

TABLE 1. Number of lines available for training and test in each dataset.

FIGURE 4. Parzival handwritten text sample: image of a line and its


transcript.

5) THE ICFHR 2018 COMPETITION OVER READ DATASET


The set of documents of the ICFHR2018 Competition
on Automated Text Recognition on a READ Dataset
(https://readcoop.eu/) was proposed to compare the perfor-
mance of approaches learning with few labeled pages. The datasets. Then, we could apply TL or domain adaptation
dataset provided for the competition consists of 22 documents strategies [43] to tune the learned model to transcript a target
segmented at line level [5]. They are written in Italian and document. As discussed in Section I, we usually deal with
modern and medieval German. Each of them was written by different tasks, where TL has shown useful to share the results
only one writer but in different periods and various languages. of the learning between tasks.
The training data is divided into a general set (of 17 docu- Formally, in HTR, deep learning algorithms have been
ments) and a document-specific set (of 5 documents) called usually used to solve problems over a domain D = {X , P(x)},
Konzilsprotokolle_C, Schiller, Ricordi, Patzig, and Schwerin where P(x) is the marginal probability. Typically x is the
of an equal script as in the test set. Hereafter, general is image for a segmented line in the text. The task consists of
used to denote available source labeled databases different two components: a label space Y and an objective predictive
from the one of interest, while document-specific denotes function f (· ) (denoted by T = {Y, f (· )}), which can be
particular target documents. Also, the Konzilsprotokolle_C learned from the training data. The data consists of pairs
dataset, of the University of Greifswald, will be abbreviated {xi , yi }, where xi ∈ X and yi ∈ Y [44] and f (x) = Q(y|x)
as Konzil. The general database comprises roughly 25 pages can be interpreted as the conditional probability distribution.
per document (the precise number of pages varies such that Given a source domain DS and a learning task TS , transfer
the number of contained characters is almost equal per doc- learning aims to help improve the learning of another target
ument). It will be denoted hereafter by ICFHR18-G. For the predictive function fT (· ) in DT using the knowledge in DS
5 document-specific databases the authors provide 16 labeled and TS . In this work, we are interested in inductive transfer
pages plus 15 unlabeled pages. One can check for the error in learning in which the target task is different from the source
the transcription of these databases by sending the authors the task, as the domains are different (DS 6= DT ). Here we
transcription of these 15 pages. The results of the transcrip- perform TL by retraining a DNN model where 1) all weights
tion are then published on the web of the contest. In Fig. 5, are initialized to the ones of the DNN learned for DS and TS
samples from five specific target documents are displayed. and 2) the parameters of lower layers can be frozen to the
The standard Unicode Normalization Form Compatibility values of the ones obtained after training with other available
Decomposition - NFKD is applied to the GT to provide a source datasets, used as off-the-shelf feature extractors [45].
common character set over such different documents, with In [6] we analyzed preliminary TL results over the
102 characters. The goal of the competition is to fit a model Washington and Parzival databases, by using the IAM
to transcript each of the 5 specific target documents with the database as the source, and we investigated which layers
lowest CER possible, using the 17 source documents avail- should be kept fixed to then apply a fine-tuning process to
able for training. For each document-specific target dataset, the others. We concluded that the best choice is to unfreeze
four experiments are conducted, simulating that you have 0, all the layers, where the first one can be eventually fixed. In
1, 4, or 16 annotated pages available for training. most cases, fixing only the first CNN layer leads to the best
In Table 1 we include the number of training and test lines performance.
available for every database. While all lines in the test sets In Table 2 we extend the analysis in [6] to the five specific
will be used, the number of lines of the training set used vary documents in the ICFHR 2018 Competition dataset, where
through the experiments and will be indicated in every case. the 17 documents of the general set of the database, in the
ICFHR18-G, are used as the source. Results are included
IV. ON THE DATA AUGMENTATION AND TRANSFER when fixing layers 1 to 3 of the CNN, as fixing other layers
LEARNING TRADEOFF provided larger errors in all cases. The lowest achieved errors
As our first contribution, in this section, we analyze the joint are highlighted in boldface. Training set size is given in the
performance of TL and DA methods when applied to HTR. number of lines. It can be observed that, among all databases,
the best performance is achieved when unfreezing all layers
A. TRANSFER LEARNING or, at most, only the first layer is kept frozen. Hereafter the TL
To cope with a reduced set of labeled inputs, we could first is applied by freezing just the first layer of the DNN model.
train the DNN model using as source available labeled large The results shown in all tables hereafter indicate mean values

76678 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 5. From top to bottom: Konzil, Schiller, Ricordi, Patzig and Schwerin handwritten text
samples with their transcripts.

TABLE 2. TL performance: Mean CER (%) and bootstrapped confidence where transcription is made on the word level. Note that these
interval at 95%, in brackets, of the model in Fig. 1 using TL for the
Washington, Parzival, Konzil, Schiller, Ricordi, Patzig and Schwerin databases have a considerably large number of labeled lines.
datasets (see Section III) as target domains. When not applying any augmentation technique, they get
a CER of 5.35 % (IAM) and 3.69 % (RIMES). The best CER
values reported in [32] by using various DA techniques are
3.93 % and 1.36 %, respectively. Which is equivalent to an
improvement of approximately 2 percentage points in both
databases.
Let us now extend the same analysis to scenarios with small
training datasets: Washington, Parzival, Konzil, Schiller,
Ricordi, Patzig, and Schwerin databases. As throughout the
of CER or WER. To get the statistics, the model in Fig. 1 paper, the transcriptions are made on line level. Results for the
is trained 10 times, where the parameters to initialize are IAM, RIMES, and the ICFGH18-G, i.e., the 17 documents
independently and randomly set. In Table 2, a non-parametric of the general dataset in the ICFHR 2018 database, are also
bootstrapped confidence interval at 95% [46] is also included. analyzed as references. In Table 3 we include the CER of our
For the remaining tables, the confidence intervals can be DNN model with no DA and two different DA techniques,
found in the supplementary material. affine transformation [33] and random warp grid distortion
In Table 2 we analyze different strategies of applying TL, (RWGD) [32], for all databases in Subsection III-B.
with no DA, for the Washington and Parzival target datasets We augment the training set by generating ten copies of
with the IAM database as the source and Konzil, Schiller, every line in the training set. One of these copies is the
Ricordi, Patzig, and Schwerin datasets (see Section III) as original line without distortions.
target domains with ICFHR18-G as the source. In each of In Table 3, for the largest databases, the DA improvement
ICFHR 2018 document-specific datasets, 12 pages are used is around 2 percentage points (2 percentage points in IAM,
for training, see the corresponding number of lines in Table 1. 1.9 percentage points in RIMES, and 2.5 percentage points
We conclude that the good choice is to freeze the first convo- in ICFHR18-G). However, in the small databases, the CER
lutional layer of the model (column ‘‘CNN1’’). This solution reduction is remarkable, in the range of 5 percentage points
will be used later in Subsection IV-C. to 23.6 percentage points, see CERs highlighted in boldface.
Note that the results in [32] are different to the ones in
B. DATA AUGMENTATION Table 3 because while in [32] transcription is done at the
In [32] the authors compare various DA approaches using word level here whole lines are processed. This explains that
both RIMES [42] and IAM [41] databases as benchmarks, in IAM without DA we get CER 7.2% while in [32] 5.35%
VOLUME 9, 2021 76679
J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

TABLE 3. DA performance: Mean CER and WER (%) with affine TABLE 4. TL and DA combined performance: Mean CER (%) evaluated for
transformations [33] and RWGD [32] DA approaches evaluated for all Washington and Parzival datasets using TL and DA with IAM database as
datasets in Subsection III-B. The DNN is trained from scratch using the the source. The number of annotated lines used in training is included as
number of lines indicated by ‘Train size’. Largest DA CER reductions are ‘Train size’.
highlighted in boldface.

TABLE 5. TL and DA combined performance: Mean CER (%) evaluated in


ICFHR 2018 Competition Specific datasets as targets using TL and DA with
ICFHR18-G as source. The number of annotated pages used in the training
is included as ‘Train size’.

is reported. In any case, it can be concluded that, since the


DA acts as a regularization technique to avoid overfitting,
the CER reduction is greater as the size of the training set
is reduced. At this point, it is most interesting to compare the
results of TL and DA techniques when applied independently.
It can be observed that TL exhibits, by far, a much larger
CER reduction. Next, we face the design and analysis of both
techniques combined, where RWGD will be used as the DA
approach.

C. COMBINING DATA AUGMENTATION AND TRANSFER


LEARNING
When comparing DA with TL, the large databases are
excluded from the comparison. They play the role of source
databases in the TL approach, specifically, the IAM is the
the ICFHR18-G, and being fine-tuned on the 5 specific target
source dataset when Washington and Parzival are targets
data sets provided. For the sake of completeness we include in
and ICFHR18-G in the Konzil, Schiller, Ricordi, Patzig, and
Table 5 the results for 0 pages in the target dataset, i.e., when
Schwerin case. The RIMES database is only used to enhance
no labeled sampled from the target is used. Note that in this
the comparisons in this section.
case DA-TL-DA cannot be applied.
In the combination of TL and DA techniques, there are
In the light of Table 4 and Table 5, it can be concluded that
several possible designs. Here we propose the following two
applying DA over the target training set once TL is applied,
schemes. In a first approach we perform DA at both the
i.e., DA-TL-DA, either does not reduce the CER or even
learning from the source dataset and the retraining of the
it slightly increases it, compared to the result of the TL
model with the target one:
approach alone or the DA-TL method. Except for Schwerin,
1) Train the model from scratch with a source dataset,
in which DA-TL-DA slightly improves DA-TL. Put in other
applying DA.
words, in general, it is harmful to apply DA to the target
2) Retrain the model with the target dataset, applying DA.
dataset if TL has been applied, when just a reduced number
We name this proposal DA-TL-DA. In a second proposal, of labeled lines are available in the target. On the other hand,
denoted by DA-TL, no DA is applied to the target: DA+TL achieves improvements up to 5 % in the ICFHR
1) Train the model from scratch with a source dataset, 2018 target documents, usually increasing with the reduction
applying DA. of the training set.
2) Retrain the model with the target dataset, without From the discussion above, and bearing Table 4 and
applying DA. Table 5 in mind, it can be concluded that DA-TL is a robust
We perform the same experiments as in Subsection IV-A, approach. When fine-tuning a DNN that has been previ-
obtaining the results included in Table 4. For the sake of ously trained with a similar task (a huge database of HTR
completeness we also report the results in the two previous samples), the starting point is reasonably good as we can
subsections for TL with the first layer frozen and DA with observe in Table 5 for the training set sizes of 0 pages in
RWGD. In the first step of the DA-TL and DA-TL-DA meth- all datasets. We show a good generalization ability of the
ods, the model has been trained from scratch with the IAM model for the TL and DA+TL without further training with
database. After that, we fine-tune the model using data from the target dataset. Afterwards, the DNN model is trained with
the Parzival and Washington databases. In Table 5 the results the target database. Only a few samples are available in the
are shown when training the model in Fig. 1 from scratch with target set, which represent just a limited part of the support

76680 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 6. (a) CER (%) divided by the number of annotated lines, l , used and (b) decrement of
CER (%) divided by the number of new labeled lines added to obtain it, 1l , in the training of the
DNN model with DA-TL approach using the ICFHR18-G dataset as the source and the Konzil
dataset as the target with no artificial errors (×) corrupted with artificial errors ().

of its marginal distribution, PT (x). After TL, the parameters at line level. Similar results were obtained for the other
of the DNN encode information from both the source and the datasets.
target training sets. At this point, we conjecture that by using The ICFHR 2018 target datasets have 16 labeled pages
DA in the target dataset and further refining the parameters, each. Unless otherwise indicated, 4 of them will be used for
the DNN model overfits to the augmented versions of the testing purposes while up to 12 pages will be used for training.
target samples, forgetting the knowledge learned from the Usually, 10 % out of the used training set is devoted to vali-
source one, that very much helps to transcript inputs out of dation. The ICFHR18-G dataset is used as source database in
the support generated by augmenting the target set. the TL-DA approach.
This leads to an increase of the final CER. In Fig. 6.(a) the blue curve in × represents the TL-DA CER
versus the available number of lines, l, of the target training
V. THE CORRUPTED LABEL PURGING (CLP) ALGORITHM set in the range 29-350 lines, corresponding to 1 and 12 pages,
In this section, we focus on the impact of the number of respectively. In the left part of the figure, the CER decreases
lines and their quality in the target dataset on the learning at a rate of 1 percentage point every 4 new lines added to
process of the DNN model. We first analyze the impact of the the training set. After approximately 50 lines, the decreasing
performance on the number of healthy lines, i.e., lines with rate of the CER changes to approximately 1 percentage point
no transcription errors in the training dataset. Then we study every 100 lines. This is evidenced in Fig. 6.(b) where the
how this performance degrades with label errors. Finally, absolute value of the variation of the CER in percentage
we propose an algorithm to detect and remove potential label points is depicted 1CER, with the increment of the number
errors in the dataset. of annotated lines used in the target to achieve it, 1l. It can
be concluded that the sensitivity to the number of samples in
A. PERFORMANCE VARIATION WITH THE NUMBER OF the training set is significantly larger for small training sets.
LABELED SAMPLES In Fig. 6 we also include the ‘‘Training set with errors’’
When a small number of lines is available in the target curve () which corresponds to the analysis above but where
training set, deep learning models are quite sensitive to small labeling errors have been artificially introduced, as follows.
variations in the number of labeled lines. In this subsection, The annotation of a line is modified with probability L. Then,
this sensitivity is evaluated on a specific dataset from the with modified labeling, a character is changed with prob-
ICFHR 2018 Competition [5]. The chosen training dataset ability R. In both cases following a Bernoulli distribution.
consists of 16 pages from the Konzil, which is segmented Every changed character is replaced by an independently and

VOLUME 9, 2021 76681


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 7. Sample of a completely mislabeled text at Ricordi dataset.

FIGURE 8. Sample of special annotations in the GT at the Ricordi dataset.

randomly selected character, following a discrete uniform text. The output of a model trained with samples of the
probability. In Fig. 6, where L = 0.2 and R = 0.3, it is same dataset is shown below the GT. Despite in this
interesting to note that the impact of labeling errors in the line, the CER is about 35%, it can be observed that the
CER value is more dramatic for small training sets while the model output is quite similar to the handwritten text.
rate at which the CER decreases with the number of lines Manually annotating historical documents remains a chal-
added remains roughly unaltered. lenging task that is prone to errors, even for experts in the
field. As discussed in the previous section, when a huge
B. TYPES OF TRANSCRIPTION ERRORS set of annotated samples is available, deep learning models
Before proposing approaches to detect mislabels in the train- do not suffer from a few mislabeled samples, as they better
ing set, we discuss three typical types and causes of errors in generalize. However, when a limited set of annotated lines of
the datasets, as follows. a specific writer is available to train, mislabeled lines induce
1) Mislabeled characters. When labeling a training set, overfitting to transcripts with errors, which are quite hard to
the most common mistake is to confuse a character with tackle via regularization. In the example shown in Fig. 6 we
another, usually a similar one. This can be seen in the illustrate this problem when just a few mislabeled lines are
well-known IAM database [41], where it is indicated introduced.
that some lines could have some annotation errors in
the labels. This type of error is the one simulated in Algorithm 1 Corrupted Labels Purging (CLP)
Fig. 6.
Given inputs: source set xS ∈ XS and yS ∈ YS , target set
2) Label Misalignment. The second kind of detected error
xT ∈ XT and yT ∈ YT and threshold, .
happens due to a misalignment in the labels. This could
1) Fit the prediction function fS (yS |xS ) with the source
be caused by, e.g., a mistake in the name given to
training set {xS , yS }.
some images in the database. This error is encountered
2) Split the target training set into N subsets
several times in the Ricordi dataset from the ICFHR
{xT1 , xT2 , . . . , xTN }, {yT1 , yT2 , . . . , yTN }.
2018 Competition [5] as illustrated in Fig. 7. It can be
observed in this example that the transcript does not
for n = 1, . . . , N do
correspond to the handwritten text in the image above.
3) Initialize the prediction function fTn (·) = fS (·).
On the contrary, it is quite close to the model output,
4) Fine tune the prediction function with the whole target
after being trained with several lines of the dataset.
set except for the nth, {xTi 6=Tn }, {yTi 6=Tn }.
3) Special annotations in the ground truth. Perhaps,
5) Include in the new target set, {xT 0 , yT 0 }, all pairs
the most common source of error is due to special anno- (i) (i) (i) (i)
{xTn , yTn } whose predictions fTn (yTn |xTn ) have errors
tations that some transcribers or database managers
below a CER threshold, .
introduce in some datasets to include some notes inline.
end for
In [37] they found this problem in the IAM database:
6) Initialize the prediction function fT (·) = fS (·).
crossed out words that are labeled with the symbol
7) Fine tune the prediction function, fT 0 (yT 0 |xT 0 ), to the
‘‘#’’ followed by the word behind the blot. Training the
modified target set {xT 0 , yT 0 }.
model with this labeling might lead to unpredictable
Output:
behavior, since the model could replace the text using
Function fT (yT |xT ) over the target domain DT .
‘‘#’’ at different parts of the text. The model will either
be able to recognize the text behind the blot or replace
the word with the symbol ‘‘#’’, or both. Another special C. MISLABEL DETECTION ALGORITHM
annotation is included in Fig. 8, where they write in As one of our main contributions, we propose an algorithm
brackets extra characters that are not in the handwritten to detect and remove mislabeled lines from the training set,

76682 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 9. Corrupted labels purging algorithm. The algorithm applied over target subset n is depicted. The same
procedure should be applied to all the subsets to build the Target dataset modified.

detailed in Algorithm 1. A block diagram of the algorithm is the results for the Schwerin dataset are remarkably better than
also depicted in Fig. 9. It divides the target training dataset for the others because it has a significantly larger number of
into N subsets. For every subset, n, the method performs lines per page. Besides, in the Ricordi dataset, the histogram
DA-TL using the rest of subsets, k = 1, . . . , N , k 6 = n, for 12 pages exhibits large values around 0.8. This dataset is
as training sets and it evaluates the CER metric over the subset known to have label misalignments.
n. Lines with CER above a threshold, , in the nth subset are
detected as wrongly transcribed and discarded. Hence, we are D. CLP THRESHOLD ANALYSIS
implementing some sort of k-fold validation, in which the size The selection of the threshold is central to the algorithm
of each validation partition is reduced after removing prob- performance. In Fig. 10 the CER of the healthy lines is mainly
lematic lines. Finally, the DA-TL is applied to the resulting distributed around a mode value, to the left of each histogram,
target database. Note the algorithm performs N + 1 different while outliers exhibit larger values. As representative values
training steps. However, the N steps concerning the target to be studied, after extensive simulations, we restrict our
subsets could be run in parallel, since they are independent analysis to the thresholds  = 0.5 and  = 0.7, for an average
of each other. Hence, the run time of applying this algorithm rate L = 0.1 of artificially modified lines, and R = 0.3.
is approximately double the run time of regular training. In Fig. 10 we indicate the percentage of lines with CER
In Fig. 10 we include the histogram of the CER per line equal or lower than 0.5 and 0.7, left and right red dashed lines
for the 5 ICFHR 2018 document-specific datasets using the in the subfigures, respectively. We conclude that almost 10%
CLP algorithm with N = 2. The ICFHR18-G was used as of lines have a CER above  = 0.7 when 4 pages for training
the source. The histograms were estimated with the CER of are available and the same occurs in the case of 12 pages when
the outputs of the n = 1, 2 stages computed with the lines  = 0.5.
not used during training, see the output of ‘‘Target subset n’’ The selection for  should not lead to the deletion of healthy
blocks in Fig. 9. In the left column, models have been trained lines, otherwise, the overall CER would raise. On the other
with 4 pages while in the right column they have been trained hand, the threshold must ensure a sensitivity when corrupted
with 12 pages. Lines are corrupted with artificial errors with lines are encountered.
probability L = 0.1, while every character in the label of a In the following, we study the CLP algorithm in two
line is changed with probability R = 0.3 to a random value. different scenarios. The first experiment we perform con-
Conservatively, we believe that a 10% average number of sists in applying the CLP algorithm to the ICFHR 2018 tar-
corrupted lines represent a label error rate similar to the one get datasets, with 4 and 12 pages as target training set
we encounter in real databases. It is interesting to observe that size. Then we evaluate the CLP for the Washington and

VOLUME 9, 2021 76683


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 10. Histogram of CER with DA-TL and ICFHR18-G as source dataset for the 5 document-specific
datasets using 4 pages (left) and 12 pages (right) of the target dataset. Lines and characters were corrupted
with probabilities L = 0.1 and R = 0.3 respectively. The histograms were evaluated with the outputs of the
N = 2 target subsets. Red dashed lines indicate the percentage of lines with CER≤  with  = 50% and
 = 70%, left and right lines, respectively.

Parzival databases, with 150 and 325 lines as target training 1) ICFHR 2018 COMPETITION RESULTS
set size. The same procedure is followed through all the We test the CLP algorithm over real databases where we do
scenarios: not have any prior knowledge about the pattern of labeling
1) Fit the model to the source set. errors. We do also include artificial errors to evaluate the CLP
2) Run DA-TL plus CLP with N = 2. robustness.

76684 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

TABLE 6. Mean CER (%) evaluated in Konzil, Schiller, Ricordi, Patzig and used, has lower variance and most lines are below 10% CER
Schwerin target documents in the ICFHR2018 Competition datasets for
DA-TL, DA-TL+CLP with  = 50% and  = 70%. DA-TL was applied with (see Fig. 10).
both a training set of 4 pages and 12 pages. The annotation for a line is It is also interesting to remark that in the Ricordi case,
corrupted with probability L = 0.1 and a character within it randomly
replaced with probability R. R = 0 indicates no error introduced in the
the algorithm improves the CER in the original dataset,
labelings. The number of removed lines by the CLP algorithm is included i.e., without synthetic errors. This is explained by the fact that
in parentheses in the last two columns. The best-achieved value in every in this dataset, as already discussed, there are some misla-
row is in boldface.
beled lines like in the case illustrated in Fig. 7. Additionally,
note that for R = 0 and  = 70% a large number of removed
lines is quite an indicator of the dataset containing errors in
the annotated lines.
For the sake of completeness, we include in Fig. 11 the
evolution of the CER versus the number of lines, l, in Fig. 6
including the CER for the proposed algorithm (CLP) (◦). The
introduction of the CLP improves the TL-DA approach when
the dataset has corrupted lines. In the range, l = [40, 50] the
TL-DA with CLP with l = 40 achieves the same CER as the
DA-TL with l = 50 lines in the training set.

2) WASHINGTON AND PARZIVAL RESULTS


In this second analysis, the model is pre-trained with the IAM
database as the source dataset to train the model with DA-TL
for the Washington and Parzival targets. There are two main
differences to the previous study of the ICFHR 2018 datasets:
1) the number and set of characters are different from the
source and targets datasets and 2) we compare the CER of
both targets in terms of the number of lines instead of the
number of pages, where we consider two cases, 150 lines and
325 lines, similar to the number of lines used in the previous
scenario.
The results of these analyses are reported in Table 6 and
The first rows in Table 7 include the results obtained after
Table 7. Their three last columns include the results for the
fine-tuning the model to the Washington dataset. In this study,
DA-TL with no CLP as ‘Baseline’, for the DA-TL+CLP with
the threshold  = 70% is the best option when the number of
 = 50%, and then for the DA-TL+CLP with  = 70%. For
lines is 150 while  = 50% exhibits the lowest CER when
every target dataset and training set size three rows are used to
the number of lines is 325. This is equivalent to the 4 and
report the CER (%) when no artificial errors are introduced,
12 pages in the Konzil, Schiller, and Ricordi cases in which
R = 0, for R = 30% and R = 50%.
the number of lines is similar. For these thresholds: we get an
In this first case, the ICFHR18-G dataset is used as the
improvement of 0.8 and 0.63 in the case of 150 lines and no
source. The 17 documents of this corpus have a total number
deterioration over the original dataset. In the case of 325 lines,
of 11424 lines. The DA-TL plus CLP was applied to the
we get a boost of 0.4 and 0.5 and no deterioration over the
five target documents in the competition: Konzil, Schiller,
original dataset.
Ricordi, Patzig, and Schwerin. The results are included in
Results obtained after fine-tuning the model to the Parzival
Table 6, where it is included the average value for the CER
dataset are also included in Table 7, see the lower rows.
and the number of removed lines by the CLP algorithm.
Similar conclusions can be drawn except for R = 30% and
In view of the results, we highlight the following aspects.
150 lines, where the 50% exhibits the best CER. If we choose
First note that, when errors are induced, the threshold
the threshold as in the previous cases, 70% and 50%, we still
 = 70% performs better in most of the cases when the train-
get a slight improvement or at least, no deterioration.
ing set is of 4 pages while the threshold  = 50% is the best
choice for 12 pages. Exceptions can be observed in Patzig and
Schwerin corpora. For the Patzig dataset, we conclude that E. CORRECTING LABEL MISALIGNMENT
 = 70% is the best choice for every case. This is due to the In Section V-B we summarized the different types of tran-
distribution of the errors in this dataset, which has a larger scription errors. One of these errors is due to the misalignment
variance, and therefore more lines are above the  = 50% of the annotations with the images. When a high number
CER, it can be seen in Fig. 10. In the Schwerin corpus, of lines are classified as mislabels, this type of error can
the threshold 50% has the best CER in all cases, the opposite be addressed by searching within the outputs of the DNN
that in the Patzig dataset. This is due to the distribution of the model for the whole target dataset, the transcript best fitting
errors in this dataset that, due to the larger number of lines every annotation in the GT, hence aligning annotations and

VOLUME 9, 2021 76685


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

FIGURE 11. CER (%) divided by the number of annotated lines, l , with the DA-TL approach using
the ICFHR18-G dataset as source and the Konzil dataset as target with no artificial errors (×), with
artificial errors () and with artificial errors and CLP used (•).

TABLE 7. Mean CER (%) evaluated in Washington and Parzival TABLE 8. Comparison between the CLP algorithm with line removal and
documents for DA-TL, CLP with threshold  = 50% and CLP with threshold the CLP plus alignment of the GT after detection. The mean CER (%) is
 = 70%. DA-TL was applied with the IAM dataset as the source and using evaluated for the Ricordi document with a training set of size 4 pages
150 and 325 lines from the target. The annotation for a line is corrupted (88 lines) and 12 pages (295 lines).
with probability L = 0.1 and a character within it randomly replaced with
probability R. R = 0 indicates no error introduced in the labelings. The
number of removed lines by the CLP algorithm is included in parentheses
in the last two columns.

TABLE 9. CER ICFHR 2018 Competition results for LSTM based models:
upper part, other previous approaches and, in the lower part, the results
for the approaches in this work. Lowest mean values in both parts are
highlighted in boldface.

images in the dataset. This approach is quite similar to the 2018 Competition.3 The results included in Table 9 were
one proposed in [38]. reported by the organizers of the competition. The contestants
In the case of the Ricordi dataset in the ICFHR 2018 com- provide the transcript of the 15 test pages for every docu-
petition, we realized that the CLP detected a high number ment in the target set: Konzil, Schiller, Ricordi, Patzig, and
of mislabeled lines in the dataset. Note the large numbers Schwerin. Then, the organizers evaluate the CER, publicly
of removed lines in Table 6 for  = 70% and this dataset publishing the results. In this table our results are compared
with R = 0. By simply visual inspection we confirmed against the 5 original contestants in the competition: OSU
that the error was of the type of misalignment of images [32], ParisTech [4], LITIS [47], PRHLT and RPPDI. These
and annotations. Here, we apply the CLP plus the simple approaches use DNN models based on CNN, LSTM, and
automatic alignment approach described above. CTC, where some variant of the LSTM is used. Some of them
The comparison between simply removing the mislabeled use DA in the target and LM. The recent work published
lines and correcting the alignment of the database is shown by Yousef et al. [9] using a DNN model based on a fully
in Table 8. In this table, one can observe a significant drop- gate convolutional network (GCN), outperformed the LSTM
ping in the CER when correcting these misalignments of the based approaches, with a mean value of 13.02 % providing a
lines. In the training with 4 pages, the overall decrease is 23.35 % CER for a 0-page training size.
3.7 percentage points. In the 12 pages analysis, the CER drops The results of the proposal in this work are included in the
0.3 percentage points when removing the lines while it further lowest rows of Table 9 where, following the conclusions in
decreases 0.8 percentage points when correcting them. Note Subsection V-D, we used  = 70% for the 1 and 4 pages
also that the gain is higher when a lower number of annotated training and  = 50% for the 16 pages. Also, the CLP
lines are used. includes an alignment stage. Results are presented in three
groups of columns. First, the average CER (%) for the 5 target
F. COMPARISON TO THE STATE-OF-THE-ART dataset is included when 0, 1, 4 and 16 pages of the target
datasets are used. The second group of 5 columns reports the
By using the proposed 5+5 DNN model with CNN and
BLSTM layers followed by a CTC, we conclude by analyzing 3 The results are publicly available in the ICFHR competition website:
the results of the novel DA-TL approach over the ICFHR https://scriptnet.iit.demokritos.gr/competitions/10/viewresults/

76686 VOLUME 9, 2021


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

average CER (%) for the learning with 0, 1, 4, and 16 pages [3] M. R. Yousefi, M. R. Soheili, T. M. Breuel, E. Kabir, and D. Stricker,
in the dataset for every document. The mean value per row is ‘‘Binarization-free OCR for historical documents using LSTM networks,’’
in Proc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015,
included in the last column. pp. 1121–1125.
[4] E. Chammas, C. Mokbel, and L. Likforman-Sulem, ‘‘Handwriting
VI. CONCLUSION recognition of historical documents with few labeled data,’’ in Proc.
13th IAPR Int. Workshop Document Anal. Syst. (DAS), Apr. 2018,
In this paper, we analyze, for small training sets and in the pp. 43–48.
framework of historical HTR, two well-known techniques in [5] T. Strauß, G. Leifert, R. Labahn, T. Hodel, and G. Mühlberger,
almost every deep learning application: TL and DA. We show ‘‘ICFHR2018 competition on automated text recognition on a READ
dataset,’’ in Proc. 16th Int. Conf. Frontiers Handwriting Recognit.
that TL improves the CER between 10-40 percentage points (ICFHR), Aug. 2018, pp. 477–482.
when applied to small training sets, of the order of 300 text [6] J. C. A. Jaramillo, J. J. Murillo-Fuentes, and P. M. Olmos, ‘‘Boosting hand-
lines. DA also drops the CER in a range of 2-20 percentage writing text recognition in small databases with transfer learning,’’ in Proc.
16th Int. Conf. Frontiers Handwriting Recognit. (ICFHR), Aug. 2018,
points when a network is trained from scratch. Before per- pp. 429–434.
forming TL, applying DA in the source dataset does reduce [7] Y. Soullard, W. Swaileh, P. Tranouez, T. Paquet, and C. Chatelain,
the CER. However, applying DA to the target datasets jointly ‘‘Improving text recognition using optical and language model writer adap-
with TL exhibits worse results than using TL alone. Hence, tation,’’ in Proc. Int. Conf. Document Anal. Recognit. (ICDAR), Sep. 2019,
pp. 1175–1180.
we propose the DA-TL approach where the DA is applied to [8] J. A. Sánchez, V. Romero, A. H. Toselli, M. Villegas, and E. Vidal, ‘‘A set
the source dataset in the TL process. of benchmarks for handwritten text recognition on historical documents,’’
Besides, we highlight that the DNN models are very sensi- Pattern Recognit., vol. 94, pp. 122–134, Oct. 2019.
[9] M. Yousef, K. F. Hussain, and U. S. Mohammed, ‘‘Accurate, data-efficient,
tive to the number of lines in the train set when this number unconstrained text recognition with convolutional neural networks,’’ Pat-
is low. Therefore, errors in annotated lines of small target tern Recognit., vol. 108, Dec. 2020, Art. no. 107482.
datasets have a greater impact than in large datasets, for the [10] J. C. Aradillas, J. J. Murillo-Fuentes, and P. M. Olmos, ‘‘Improving offline
HTR in small datasets by purging unreliable labels,’’ in Proc. 17th Int.
same proportions of mislabels. To avoid that, we propose a Conf. Frontiers Handwriting Recognit. (ICFHR), Dortmund, Germany,
method that can detect the mislabeled lines and remove it Sep. 2020, pp. 25–30.
from the training set. Furthermore, we fix errors of the mis- [11] G. M. Binmakhashen and S. A. Mahmoud, ‘‘Document layout analysis:
A comprehensive survey,’’ ACM Comput. Surveys, vol. 52, no. 6, pp. 1–36,
alignment type, by searching for the true labels in the datasets. Jan. 2020.
Comparing to the state-of-the art in the ICFHR 2018 Com- [12] S. He and L. Schomaker, ‘‘DeepOtsu: Document enhancement and
petition, it can be observed that the DA-TL and CLP out- binarization using iterative deep learning,’’ Pattern Recognit., vol. 91,
perform all approaches within the CNN+LSTM+CTC class pp. 379–390, Jul. 2019.
[13] S. Ares Oliveira, B. Seguin, and F. Kaplan, ‘‘DhSegment: A generic deep-
hence underlining the importance of the issues discussed: DA learning approach for document segmentation,’’ in Proc. 16th Int. Conf.
is important but in the source dataset, TL is to be considered, Frontiers Handwriting Recognit. (ICFHR), Aug. 2018, pp. 7–12.
and mislabeling detection and correction is important if the [14] J. Oncina and M. Sebban, ‘‘Learning stochastic edit distance: Application
in handwritten character recognition,’’ Pattern Recognit., vol. 39, no. 9,
dataset exhibits errors. Besides, the CLP introduces a residual pp. 1575–1587, Sep. 2006.
0.01 percentage points of loss if the datasets have no errors [15] Y.-C. Wu, F. Yin, and C.-L. Liu, ‘‘Improving handwritten chinese text
in the labels while the reduction is important if they have, recognition using neural network language models and convolutional
neural network shape models,’’ Pattern Recognit., vol. 65, pp. 251–264,
see the results for the Ricordi corpus where a reduction May 2017.
of 6.58 percentage points is achieved. The presence of errors [16] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
in this database was detected by checking the number of Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
removed lines by the CLP. [17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, ‘‘Connectionist
temporal classification: Labelling unsegmented sequence data with recur-
At this point, it is interesting to mention that other vari- rent neural networks,’’ in Proc. 23rd Int. Conf. Mach. Learn. (ICML),
ations of the algorithm have been tried to further improve Jun. 2006, pp. 369–376.
the performance. In this sense, we tried to evaluate the CTC [18] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, ‘‘Dropout improves
recurrent neural networks for handwriting recognition,’’ in Proc. 14th Int.
loss [17] to select a threshold . We found it complex to deal Conf. Frontiers Handwriting Recognit., Sep. 2014, pp. 285–290.
with because it depends on several factors like the number [19] P. Voigtlaender, P. Doetsch, and H. Ney, ‘‘Handwriting recognition with
of epochs in the training or if batch normalization has been large multidimensional long short-term memory recurrent neural net-
works,’’ in Proc. 15th Int. Conf. Frontiers Handwriting Recognit. (ICFHR),
applied. In future work, we expect to improve the algorithm in Oct. 2016, pp. 228–233.
this way. Another promising research line could be introduc- [20] D. Castro, B. Bezerra, and M. Valenca, ‘‘Boosting the deep multidimen-
ing TL-DA and CLP in other DNN models, such as the based sional long-short-term memory network for handwritten recognition sys-
tems,’’ in Proc. 16th Int. Conf. Frontiers Handwriting Recognit. (ICFHR),
on GCN [9], that has a quite low value for 0 pages, to further Aug. 2018, pp. 127–132.
improve the CER. Besides, introducing LM in the proposed [21] B. Moysset and R. Messina, ‘‘Are 2D-LSTM really dead for offline
DA-TL and CLP approaches could be also investigated. text recognition?’’ Int. J. Document Anal. Recognit., vol. 22, no. 3,
pp. 193–208, Sep. 2019.
[22] J. Puigcerver, ‘‘Are multidimensional recurrent layers really necessary for
REFERENCES handwritten text recognition?’’ in Proc. 14th IAPR Int. Conf. Document
[1] N. Serrano, F. Castro, and A. Juan-Císcar, ‘‘The RODRIGO database,’’ in Anal. Recognit. (ICDAR), Nov. 2017, pp. 67–72.
Proc. LREC, 2010, pp. 19–21. [23] D. Coquenet, C. Chatelain, and T. Paquet, ‘‘Recurrence-free unconstrained
[2] R. Saabni, A. Asi, and J. El-Sana, ‘‘Text line extraction for historical handwritten text recognition using gated fully convolutional network,’’ in
document images,’’ Pattern Recognit. Lett., vol. 35, no. 1, pp. 23–33, Proc. 17th Int. Conf. Frontiers Handwriting Recognit. (ICFHR), Sep. 2020,
Jan. 2014. pp. 19–24.

VOLUME 9, 2021 76687


J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines

[24] A. F. de Sousa Neto, B. L. D. Bezerra, A. H. Toselli, and E. B. Lima, [43] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
‘‘HTR-Flor++: A handwritten text recognition system based on a pipeline MA, USA: MIT Press, 2016.
of optical and language models,’’ in Proc. ACM Symp. Document Eng., [44] S. Jialin Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans.
New York, NY, USA, Sep. 2020, pp. 1–4. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[25] F. Naiemi, V. Ghods, and H. Khalesi, ‘‘A novel pipeline framework for [45] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
multi oriented scene text image detection and recognition,’’ Expert Syst. Off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
Appl., vol. 170, May 2021, Art. no. 114549. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 512–519.
[26] N. Daniyar, B. Kairat, K. Maksat, and A. Anel, ‘‘Classification of hand- [46] B. Efron, ‘‘Better bootstrap confidence intervals,’’ J. Amer. Stat. Assoc.,
written names of cities using various deep learning models,’’ in Proc. 15th vol. 82, no. 397, pp. 171–185, Mar. 1987.
Int. Conf. Electron., Comput. Comput. (ICECCO), Dec. 2019, pp. 1–4. [47] W. Swaileh, T. Paquet, Y. Soullard, and P. Tranouez, ‘‘Handwriting recog-
[27] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are nition with multigrams,’’ in Proc. 14th IAPR Int. Conf. Document Anal.
features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst., Recognit. (ICDAR), vol. 1, Nov. 2017, pp. 137–142.
vol. 27, 2014, pp. 3320–3328.
[28] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
T. Darrell, ‘‘DeCAF: A deep convolutional activation feature for generic
visual recognition,’’ in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2014, JOSÉ CARLOS ARADILLAS received the B.E.
vol. 32, no. 1, pp. 647–655. and M.Sc. degrees in telecommunication engi-
[29] L. Zhu, Z. Huang, Z. Li, L. Xie, and H. T. Shen, ‘‘Exploring auxiliary
neering from the Universidad de Sevilla, Spain,
context: Discrete semantic transfer hashing for scalable image retrieval,’’
in 2015 and 2017, respectively, where he is cur-
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5264–5276,
Nov. 2018. rently pursuing the Ph.D. degree with the Depart-
[30] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan, ‘‘Transferable rep- ment of Signal Theory and Communications. He is
resentation learning with deep adaptation networks,’’ IEEE Trans. Pattern also an Assistant Professor with the Universidad
Anal. Mach. Intell., vol. 41, no. 12, pp. 3071–3085, Dec. 2019. de Sevilla. He has been a Visiting Researcher with
[31] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Return the Max Planck Institute, Tübingen, Germany, and
of the devil in the details: Delving deep into convolutional the Universidad Carlos III de Madrid, Spain. His
nets,’’ in Proc. Brit. Mach. Vis. Conf., 2014. [Online]. Available: current research interests include machine learning for communications and
http://dx.doi.org/10.5244/C.28.6 deep learning applied to cultural heritage.
[32] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Cohen,
‘‘Data augmentation for recognition of handwritten words and lines using
a CNN-LSTM network,’’ in Proc. 14th IAPR Int. Conf. Document Anal.
Recognit. (ICDAR), vol. 1, Nov. 2017, pp. 639–645. JUAN JOSÉ MURILLO-FUENTES (Senior Mem-
[33] A. Poznanski and L. Wolf, ‘‘CNN-N-Gram for HandwritingWord recog- ber, IEEE) received the degree in telecommunica-
nition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), tion engineering from the Universidad de Sevilla,
Jun. 2016, pp. 2305–2314. in 1996, and the Ph.D. degree in telecommuni-
[34] P. Krishnan and C. V. Jawahar, ‘‘Matching handwritten document images,’’
cation engineering from the Universidad Carlos
in Computer Vision—ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling,
III de Madrid (UC3M), Spain, in 2001. He has
Eds. Cham, Switzerland: Springer, 2016, pp. 766–782.
[35] X. Shen and R. Messina, ‘‘A method of synthesizing handwritten chinese been a Visiting Researcher with the University of
images for data augmentation,’’ in Proc. 15th Int. Conf. Frontiers Hand- Cambridge and UCL. Since 2016, he has been
writing Recognit. (ICFHR), Oct. 2016, pp. 114–119. a Full Professor with the Universidad de Sevilla,
[36] P. Y. Simard, D. Steinkraus, and J. C. Platt, ‘‘Best practices for convolu- teaching several bachelor’s and master’s courses
tional neural networks applied to visual document analysis,’’ in Proc. 7th (M.Sc. and Ph.D.) related to digital communications and machine learning.
Int. Conf. Document Anal. Recognit., Aug. 2003, pp. 958–963. He is also a member of the Signal Processing and Learning Group, UC3M.
[37] H. Nisa, J. A. Thom, V. Ciesielski, and R. Tennakoon, ‘‘A deep learning He has published extensively, particularly in interdisciplinary fields, with
approach to handwritten text recognition in the presence of struck-out more than 80 journal articles and conference papers in his active record. His
text,’’ in Proc. Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), research interests include algorithm development for signal processing and
Dec. 2019, pp. 1–6. machine learning, and their applications to digital communication systems
[38] A. B. Salah, J. P. Moreux, N. Ragot, and T. Paquet, ‘‘OCR performance and cultural heritage.
prediction using cross-OCR alignment,’’ in Proc. 13th Int. Conf. Document
Anal. Recognit. (ICDAR), Aug. 2015, pp. 556–560.
[39] B. Shi, X. Bai, and C. Yao, ‘‘An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text PABLO M. OLMOS (Member, IEEE) received the
recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, M.Sc. degree and the Ph.D. degree in telecom-
pp. 2298–2304, Nov. 2017.
munication engineering from the Universidad de
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
Sevilla, in 2008 and 2011, respectively. He is cur-
from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, rently an Assistant Professor with the Universidad
2014. Carlos III de Madrid. He has held appointments
[41] U.-V. Marti and H. Bunke, ‘‘The IAM-database: An English sentence as a Visiting Researcher with Princeton University,
database for offline handwriting recognition,’’ Int. J. Document Anal. EPFL, Notre Dame University, ENSEA, and Bell
Recognit., vol. 5, no. 1, pp. 39–46, Nov. 2002. Labs. His research interests include approximate
[42] A. Fischer, A. Keller, V. Frinken, and H. Bunke, ‘‘Lexicon-free handwritten inference methods for Bayesian machine learning
word spotting using character HMMs,’’ Pattern Recognit. Lett., vol. 33, to information theory and digital communications.
no. 7, pp. 934–942, May 2012.

76688 VOLUME 9, 2021

You might also like