Boosting Offline Handwritten Text Recognition in
Boosting Offline Handwritten Text Recognition in
June 2, 2021.
Digital Object Identifier 10.1109/ACCESS.2021.3082689
ABSTRACT In this paper we address the problem of offline handwritten text recognition (HTR) in historical
documents when few labeled samples are available and some of them contain errors in the train set. Our three
main contributions are: first, we analyze how to perform transfer learning (TL) from a massive database to
a smaller historical database, analyzing which layers of the model need fine-tuning. Second, we analyze
methods to efficiently combine TL and data augmentation (DA). Finally, we propose an algorithm to
mitigate the effects of incorrect labeling in the training set. The methods are analyzed over the ICFHR
2018 competition database, Washington and Parzival. Combining all these techniques, we demonstrate a
remarkable reduction of CER (up to 6 percentage points in some cases) in the test set with little complexity
overhead.
INDEX TERMS Connectionist temporal classification (CTC), convolutional neural networks (CNN), data
augmentation (DA), deep neural networks (DNN), historical documents, long-short-term-memory (LSTM),
offline handwriting text recognition (HTR), outlier detection, transfer learning.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
76674 VOLUME 9, 2021
J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines
(CER) or word error rate (WER), given by the Levenshtein difficult. On the other hand it is not parallelizable [22]. For
distance [14] between the ground truth (GT) and the output these reasons, it has been discarded here.
of the model. Regarding the state-of-the-art DNN models for HTR, some
However, in reality most of the times we might only have recent works are in the line of avoiding recurrence in the
a limited number of lines for a given author and document. models, developing models based in fully-convolutional net-
Besides, transcription of part of the documents to get labeled works such as the (Gated) Fully Convolutional Networks
samples is expensive either in time or money. Take [1] as an (G)FCN [23]–[26]. This kind of model reduces the number
example, where the manual transcription process of a docu- of parameters in the architecture.
ment by an expert in paleography took an average of 35 min-
utes per page. In this scenario, allowing for a reduction in the B. TRANSFER LEARNING
transcript needed would greatly improve the viability and cost In the HTR problem with a reduced training set, TL was
of the process. applied by Soullard et al. in [7]. The main idea behind TL
The goal of the methods proposed in this paper is to signifi- is initializing the parameters of a model by those learned
cantly reduce the number of annotated lines, thus reducing the from a huge dataset beforehand, denoted as source. Then,
monetary cost of the process. Contributions are as follows: the available labeled set of samples of the dataset of interest,
• A DNN model with transfer learning (TL) by studying the target, is used to refine the parameters of the model,
in detail which layers should be retained and which ones usually just a subset of them [27]–[30]. In [30], they ana-
should be retrained. We found this model to provide the lyze how to reduce the dataset shift and enhance the fea-
best performance within the models using long-short- ture transferability in task-specific layers of deep networks.
term-memory (LSTM). Hence, with TL we start learning a different task to avoid
• A thoughtful analysis of HTR when little labeled data learning the whole set of parameters from scratch, preventing
is available, studying the evolution with the number of overfitting and favoring convergence. In [7], they proposed
transcript lines, highlighting how sensitive the models a method that applies TL in both the optical and the lan-
are to the number of lines in the train set when this guage model. In this and other similar previous proposals
number is low. on TL, the authors applied DA in both training and test
• A model combining TL and data augmentation (DA). steps.
We show that for a reduced number of lines, using DA
C. DATA AUGMENTATION
can be counterproductive.
• An algorithm to detect and avoid mislabeled lines is DA consists in augmenting the training set with synthetically
proposed and tested, with good results in small training generated samples. Similar to TL, it reduces the tendency to
datasets. Furthermore, in some scenarios, wrong labels overfit when training models with a large number of param-
are fixed. eters and limited labeled data. In DA for image classifica-
tion problems, the training set is increased by modifying the
The remaining of the paper is organized as follows: in original images through transformations such as scaling, rota-
Section II previous works reporting solutions to the problem tion, or flipping images, among others [31]. Several authors
of HTR over small datasets are summarized; in Section III have proposed specific DA techniques for HTR: in [32] the
the architecture and models used in this paper are presented; authors apply methods for augmentation and normalization
in Section IV the application of TL and DA for HTR is to improve HTR by allowing the network to be more tolerant
analyzed; in Section V an algorithm is proposed to detect and of variations in handwriting by profile normalization. In [33]
prone mislabeled lines in the training set; the paper ends with they show some affine transformation methods for data aug-
Section VI, drawing main conclusions. mentation in HTR. In [34] and [35] they synthesize new lines
images by concatenating characters from different datasets.
II. RELATED WORK AND CONTRIBUTIONS [34] does it from cursive characters while in [35] they do it
A. DNN MODEL from a database of handwritten Chinese characters. Similar
State-of-the-art architectures combine a convolutional neural to [32], in [36] they also apply some elastic distortions to the
network (CNN) [15] with a recurrent neural network (RNN) original images. In [4] the authors improve the performance
with LSTM cells [16]. This type of network models the condi- by augmenting the training set with specially crafted multi-
tioned probability, p(l|x), of a character sequence of arbitrary scale data. They also propose a model-based normalization
length, l, given an image, x, of fixed height and arbitrary scheme that considers the variability in the writing scale at
width. Note that varying the number of characters yields a the recognition phase. In these works, they apply DA in rel-
new design of the last layer of the DNN model. These models atively large well-known datasets, but here we show that the
are configured to minimize the connectionist temporal clas- regularization effect of any DA technique has no impact when
sification (CTC) cost function proposed by Graves in [17]. doing the fine-tuning adaptation to a singular writer in small
In some works 2D-LSTM [16] networks are used [18]–[21]. databases. Accordingly, we conclude that the combination of
This RNN has two main drawbacks. On the one hand, it has TL and DA applied to small datasets has to be done carefully,
an extremely large number of parameters that make learning to reduce the final error.
FIGURE 1. The adapted CRNN architecture from [39] used as baseline. The number of channels of each CNN
layer is shown in this scheme. Pooling layers after the first, second and third CNN layer are also depicted.
FIGURE 2. IAM handwritten text sample: image of a line and its transcript.
FIGURE 3. Washington handwritten text sample: image of a line and its transcript.
have 256 features in each direction, we perform a depth-wise is partitioned into training, validation, and test sets of 6161,
concatenation to adapt the input of the next layer, to the over- 900, and 2801 lines, respectively.2 Here, the validation and
all size of 512. Dropout regularization [18], [40] is applied test sets provided are merged in a unique test set. There are
at the output of every layer, except for the first convolutional 79 different characters in this database, including capital and
one, with rates 0.2 for the CNN layers and 0.5 for the BLSTM small letters, numbers, some punctuation symbols, and white
layers. space.
Finally, each column of features after the 5th BLSTM
layer, with depth 512, is mapped into the L + 1 output 2) THE RIMES DATABASE
labels with a fully connected layer, where L is the num- The RIMES database is a collection of French letters hand-
ber of characters in the alphabet of each database, e.g., 79, written by 1,300 volunteers who have participated in the
83, 96 or 102 in the IAM, Washington, Parzival or Inter- RIMES database creation by writing up to 5 emails. The
national Conference on Frontiers in Handwriting Recog- RIMES database thus contains 12,723 pages corresponding
nition (ICFHR) 2018 Competition databases, respectively. to 5605 mails of two to three pages. In our experiments,
The additional dimension is needed for the blank symbol we take a set of 12111 lines extracted from the Interna-
of the CTC [17], which concludes this architecture. Overall, tional Conference on Document Analysis and Recognition
this CNN-BLSTM-CTC model has 9.58 × 106 parameters (ICDAR) 2011 line-level competition. There are 100 different
roughly, depending on the number of characters in each characters in this database.
database.
The architecture is implemented in the open-source frame- 3) THE WASHINGTON DATABASE
work TensorFlow in Python, using the GPU-enabled ver- The Washington database contains 565 text lines of the
sion. We use the Adam algorithm, a learning rate of 0.003, George Washington letters, handwritten by two writers in the
β1 = 0.9 and β2 = 0.999. The parameters are updated using 18th century. Although the language is also English, the text
the gradients of the CTC loss on each batch of 16 text lines. is written in longhand script and the images are binarized as
We apply an early stopping criterion of 10 epochs without illustrated in Fig. 3, see [3] for a description of the differences
average improvement. between binarized and binarization-free images when apply-
The selected model was the one with the best performance ing HTR tasks. In this database, four possible partitions are
out of the 7 + 3, 8 + 0, 4 + 4, 5 + 5, and 6 + 6 configurations, provided to train and validate. In this work, we have randomly
where A + B corresponds to A convolutional followed by chosen one of them. The train, validation, and test set contain
B BLSTM layers. On the other hand, for the CTC we tried 325, 168 and 163 handwritten lines, respectively. There are
best path decoding and beam search decoding, with no sig- 83 different characters in the database.
nificant improvement of the later, despite its computational
complexity. 4) THE PARZIVAL DATABASE
The Parzival database contains 4477 text lines handwritten
B. DATABASES by three writers in the 13th century. In this case, the lines are
In this paper we focus on HTR over eight databases: IAM binarized like in the Washington database, but the text is writ-
[41], Washington [42], Parzival [42], and the five ones pro- ten in gothic script. A sample is included in Fig. 4. There are
vided at the ICFHR 2018 Competition [5]. 96 different characters in this database. Note that the Parzival
database has a large number of text lines in comparison to the
1) THE IAM DATABASE Washington one. We have randomly chosen a training set of
The IAM database [41] contains 13353 labeled text lines of the approximately same size as in the Washington training to
modern English handwritten by 657 different writers. The emulate learning with a small dataset, the main goal of this
images were scanned at a resolution of 300 dpi and saved as work.
PNG images with 256 gray levels. An image of this database 2 The names of the images of each set are provided in the Large Writer
is included in Fig. 2 alongside the GT transcript. The database Independent Text Line Recognition Task.
TABLE 1. Number of lines available for training and test in each dataset.
FIGURE 5. From top to bottom: Konzil, Schiller, Ricordi, Patzig and Schwerin handwritten text
samples with their transcripts.
TABLE 2. TL performance: Mean CER (%) and bootstrapped confidence where transcription is made on the word level. Note that these
interval at 95%, in brackets, of the model in Fig. 1 using TL for the
Washington, Parzival, Konzil, Schiller, Ricordi, Patzig and Schwerin databases have a considerably large number of labeled lines.
datasets (see Section III) as target domains. When not applying any augmentation technique, they get
a CER of 5.35 % (IAM) and 3.69 % (RIMES). The best CER
values reported in [32] by using various DA techniques are
3.93 % and 1.36 %, respectively. Which is equivalent to an
improvement of approximately 2 percentage points in both
databases.
Let us now extend the same analysis to scenarios with small
training datasets: Washington, Parzival, Konzil, Schiller,
Ricordi, Patzig, and Schwerin databases. As throughout the
of CER or WER. To get the statistics, the model in Fig. 1 paper, the transcriptions are made on line level. Results for the
is trained 10 times, where the parameters to initialize are IAM, RIMES, and the ICFGH18-G, i.e., the 17 documents
independently and randomly set. In Table 2, a non-parametric of the general dataset in the ICFHR 2018 database, are also
bootstrapped confidence interval at 95% [46] is also included. analyzed as references. In Table 3 we include the CER of our
For the remaining tables, the confidence intervals can be DNN model with no DA and two different DA techniques,
found in the supplementary material. affine transformation [33] and random warp grid distortion
In Table 2 we analyze different strategies of applying TL, (RWGD) [32], for all databases in Subsection III-B.
with no DA, for the Washington and Parzival target datasets We augment the training set by generating ten copies of
with the IAM database as the source and Konzil, Schiller, every line in the training set. One of these copies is the
Ricordi, Patzig, and Schwerin datasets (see Section III) as original line without distortions.
target domains with ICFHR18-G as the source. In each of In Table 3, for the largest databases, the DA improvement
ICFHR 2018 document-specific datasets, 12 pages are used is around 2 percentage points (2 percentage points in IAM,
for training, see the corresponding number of lines in Table 1. 1.9 percentage points in RIMES, and 2.5 percentage points
We conclude that the good choice is to freeze the first convo- in ICFHR18-G). However, in the small databases, the CER
lutional layer of the model (column ‘‘CNN1’’). This solution reduction is remarkable, in the range of 5 percentage points
will be used later in Subsection IV-C. to 23.6 percentage points, see CERs highlighted in boldface.
Note that the results in [32] are different to the ones in
B. DATA AUGMENTATION Table 3 because while in [32] transcription is done at the
In [32] the authors compare various DA approaches using word level here whole lines are processed. This explains that
both RIMES [42] and IAM [41] databases as benchmarks, in IAM without DA we get CER 7.2% while in [32] 5.35%
VOLUME 9, 2021 76679
J. C. Aradillas et al.: Boosting Offline HTR in Historical Documents With Few Labeled Lines
TABLE 3. DA performance: Mean CER and WER (%) with affine TABLE 4. TL and DA combined performance: Mean CER (%) evaluated for
transformations [33] and RWGD [32] DA approaches evaluated for all Washington and Parzival datasets using TL and DA with IAM database as
datasets in Subsection III-B. The DNN is trained from scratch using the the source. The number of annotated lines used in training is included as
number of lines indicated by ‘Train size’. Largest DA CER reductions are ‘Train size’.
highlighted in boldface.
FIGURE 6. (a) CER (%) divided by the number of annotated lines, l , used and (b) decrement of
CER (%) divided by the number of new labeled lines added to obtain it, 1l , in the training of the
DNN model with DA-TL approach using the ICFHR18-G dataset as the source and the Konzil
dataset as the target with no artificial errors (×) corrupted with artificial errors ().
of its marginal distribution, PT (x). After TL, the parameters at line level. Similar results were obtained for the other
of the DNN encode information from both the source and the datasets.
target training sets. At this point, we conjecture that by using The ICFHR 2018 target datasets have 16 labeled pages
DA in the target dataset and further refining the parameters, each. Unless otherwise indicated, 4 of them will be used for
the DNN model overfits to the augmented versions of the testing purposes while up to 12 pages will be used for training.
target samples, forgetting the knowledge learned from the Usually, 10 % out of the used training set is devoted to vali-
source one, that very much helps to transcript inputs out of dation. The ICFHR18-G dataset is used as source database in
the support generated by augmenting the target set. the TL-DA approach.
This leads to an increase of the final CER. In Fig. 6.(a) the blue curve in × represents the TL-DA CER
versus the available number of lines, l, of the target training
V. THE CORRUPTED LABEL PURGING (CLP) ALGORITHM set in the range 29-350 lines, corresponding to 1 and 12 pages,
In this section, we focus on the impact of the number of respectively. In the left part of the figure, the CER decreases
lines and their quality in the target dataset on the learning at a rate of 1 percentage point every 4 new lines added to
process of the DNN model. We first analyze the impact of the the training set. After approximately 50 lines, the decreasing
performance on the number of healthy lines, i.e., lines with rate of the CER changes to approximately 1 percentage point
no transcription errors in the training dataset. Then we study every 100 lines. This is evidenced in Fig. 6.(b) where the
how this performance degrades with label errors. Finally, absolute value of the variation of the CER in percentage
we propose an algorithm to detect and remove potential label points is depicted 1CER, with the increment of the number
errors in the dataset. of annotated lines used in the target to achieve it, 1l. It can
be concluded that the sensitivity to the number of samples in
A. PERFORMANCE VARIATION WITH THE NUMBER OF the training set is significantly larger for small training sets.
LABELED SAMPLES In Fig. 6 we also include the ‘‘Training set with errors’’
When a small number of lines is available in the target curve () which corresponds to the analysis above but where
training set, deep learning models are quite sensitive to small labeling errors have been artificially introduced, as follows.
variations in the number of labeled lines. In this subsection, The annotation of a line is modified with probability L. Then,
this sensitivity is evaluated on a specific dataset from the with modified labeling, a character is changed with prob-
ICFHR 2018 Competition [5]. The chosen training dataset ability R. In both cases following a Bernoulli distribution.
consists of 16 pages from the Konzil, which is segmented Every changed character is replaced by an independently and
randomly selected character, following a discrete uniform text. The output of a model trained with samples of the
probability. In Fig. 6, where L = 0.2 and R = 0.3, it is same dataset is shown below the GT. Despite in this
interesting to note that the impact of labeling errors in the line, the CER is about 35%, it can be observed that the
CER value is more dramatic for small training sets while the model output is quite similar to the handwritten text.
rate at which the CER decreases with the number of lines Manually annotating historical documents remains a chal-
added remains roughly unaltered. lenging task that is prone to errors, even for experts in the
field. As discussed in the previous section, when a huge
B. TYPES OF TRANSCRIPTION ERRORS set of annotated samples is available, deep learning models
Before proposing approaches to detect mislabels in the train- do not suffer from a few mislabeled samples, as they better
ing set, we discuss three typical types and causes of errors in generalize. However, when a limited set of annotated lines of
the datasets, as follows. a specific writer is available to train, mislabeled lines induce
1) Mislabeled characters. When labeling a training set, overfitting to transcripts with errors, which are quite hard to
the most common mistake is to confuse a character with tackle via regularization. In the example shown in Fig. 6 we
another, usually a similar one. This can be seen in the illustrate this problem when just a few mislabeled lines are
well-known IAM database [41], where it is indicated introduced.
that some lines could have some annotation errors in
the labels. This type of error is the one simulated in Algorithm 1 Corrupted Labels Purging (CLP)
Fig. 6.
Given inputs: source set xS ∈ XS and yS ∈ YS , target set
2) Label Misalignment. The second kind of detected error
xT ∈ XT and yT ∈ YT and threshold, .
happens due to a misalignment in the labels. This could
1) Fit the prediction function fS (yS |xS ) with the source
be caused by, e.g., a mistake in the name given to
training set {xS , yS }.
some images in the database. This error is encountered
2) Split the target training set into N subsets
several times in the Ricordi dataset from the ICFHR
{xT1 , xT2 , . . . , xTN }, {yT1 , yT2 , . . . , yTN }.
2018 Competition [5] as illustrated in Fig. 7. It can be
observed in this example that the transcript does not
for n = 1, . . . , N do
correspond to the handwritten text in the image above.
3) Initialize the prediction function fTn (·) = fS (·).
On the contrary, it is quite close to the model output,
4) Fine tune the prediction function with the whole target
after being trained with several lines of the dataset.
set except for the nth, {xTi 6=Tn }, {yTi 6=Tn }.
3) Special annotations in the ground truth. Perhaps,
5) Include in the new target set, {xT 0 , yT 0 }, all pairs
the most common source of error is due to special anno- (i) (i) (i) (i)
{xTn , yTn } whose predictions fTn (yTn |xTn ) have errors
tations that some transcribers or database managers
below a CER threshold, .
introduce in some datasets to include some notes inline.
end for
In [37] they found this problem in the IAM database:
6) Initialize the prediction function fT (·) = fS (·).
crossed out words that are labeled with the symbol
7) Fine tune the prediction function, fT 0 (yT 0 |xT 0 ), to the
‘‘#’’ followed by the word behind the blot. Training the
modified target set {xT 0 , yT 0 }.
model with this labeling might lead to unpredictable
Output:
behavior, since the model could replace the text using
Function fT (yT |xT ) over the target domain DT .
‘‘#’’ at different parts of the text. The model will either
be able to recognize the text behind the blot or replace
the word with the symbol ‘‘#’’, or both. Another special C. MISLABEL DETECTION ALGORITHM
annotation is included in Fig. 8, where they write in As one of our main contributions, we propose an algorithm
brackets extra characters that are not in the handwritten to detect and remove mislabeled lines from the training set,
FIGURE 9. Corrupted labels purging algorithm. The algorithm applied over target subset n is depicted. The same
procedure should be applied to all the subsets to build the Target dataset modified.
detailed in Algorithm 1. A block diagram of the algorithm is the results for the Schwerin dataset are remarkably better than
also depicted in Fig. 9. It divides the target training dataset for the others because it has a significantly larger number of
into N subsets. For every subset, n, the method performs lines per page. Besides, in the Ricordi dataset, the histogram
DA-TL using the rest of subsets, k = 1, . . . , N , k 6 = n, for 12 pages exhibits large values around 0.8. This dataset is
as training sets and it evaluates the CER metric over the subset known to have label misalignments.
n. Lines with CER above a threshold, , in the nth subset are
detected as wrongly transcribed and discarded. Hence, we are D. CLP THRESHOLD ANALYSIS
implementing some sort of k-fold validation, in which the size The selection of the threshold is central to the algorithm
of each validation partition is reduced after removing prob- performance. In Fig. 10 the CER of the healthy lines is mainly
lematic lines. Finally, the DA-TL is applied to the resulting distributed around a mode value, to the left of each histogram,
target database. Note the algorithm performs N + 1 different while outliers exhibit larger values. As representative values
training steps. However, the N steps concerning the target to be studied, after extensive simulations, we restrict our
subsets could be run in parallel, since they are independent analysis to the thresholds = 0.5 and = 0.7, for an average
of each other. Hence, the run time of applying this algorithm rate L = 0.1 of artificially modified lines, and R = 0.3.
is approximately double the run time of regular training. In Fig. 10 we indicate the percentage of lines with CER
In Fig. 10 we include the histogram of the CER per line equal or lower than 0.5 and 0.7, left and right red dashed lines
for the 5 ICFHR 2018 document-specific datasets using the in the subfigures, respectively. We conclude that almost 10%
CLP algorithm with N = 2. The ICFHR18-G was used as of lines have a CER above = 0.7 when 4 pages for training
the source. The histograms were estimated with the CER of are available and the same occurs in the case of 12 pages when
the outputs of the n = 1, 2 stages computed with the lines = 0.5.
not used during training, see the output of ‘‘Target subset n’’ The selection for should not lead to the deletion of healthy
blocks in Fig. 9. In the left column, models have been trained lines, otherwise, the overall CER would raise. On the other
with 4 pages while in the right column they have been trained hand, the threshold must ensure a sensitivity when corrupted
with 12 pages. Lines are corrupted with artificial errors with lines are encountered.
probability L = 0.1, while every character in the label of a In the following, we study the CLP algorithm in two
line is changed with probability R = 0.3 to a random value. different scenarios. The first experiment we perform con-
Conservatively, we believe that a 10% average number of sists in applying the CLP algorithm to the ICFHR 2018 tar-
corrupted lines represent a label error rate similar to the one get datasets, with 4 and 12 pages as target training set
we encounter in real databases. It is interesting to observe that size. Then we evaluate the CLP for the Washington and
FIGURE 10. Histogram of CER with DA-TL and ICFHR18-G as source dataset for the 5 document-specific
datasets using 4 pages (left) and 12 pages (right) of the target dataset. Lines and characters were corrupted
with probabilities L = 0.1 and R = 0.3 respectively. The histograms were evaluated with the outputs of the
N = 2 target subsets. Red dashed lines indicate the percentage of lines with CER≤ with = 50% and
= 70%, left and right lines, respectively.
Parzival databases, with 150 and 325 lines as target training 1) ICFHR 2018 COMPETITION RESULTS
set size. The same procedure is followed through all the We test the CLP algorithm over real databases where we do
scenarios: not have any prior knowledge about the pattern of labeling
1) Fit the model to the source set. errors. We do also include artificial errors to evaluate the CLP
2) Run DA-TL plus CLP with N = 2. robustness.
TABLE 6. Mean CER (%) evaluated in Konzil, Schiller, Ricordi, Patzig and used, has lower variance and most lines are below 10% CER
Schwerin target documents in the ICFHR2018 Competition datasets for
DA-TL, DA-TL+CLP with = 50% and = 70%. DA-TL was applied with (see Fig. 10).
both a training set of 4 pages and 12 pages. The annotation for a line is It is also interesting to remark that in the Ricordi case,
corrupted with probability L = 0.1 and a character within it randomly
replaced with probability R. R = 0 indicates no error introduced in the
the algorithm improves the CER in the original dataset,
labelings. The number of removed lines by the CLP algorithm is included i.e., without synthetic errors. This is explained by the fact that
in parentheses in the last two columns. The best-achieved value in every in this dataset, as already discussed, there are some misla-
row is in boldface.
beled lines like in the case illustrated in Fig. 7. Additionally,
note that for R = 0 and = 70% a large number of removed
lines is quite an indicator of the dataset containing errors in
the annotated lines.
For the sake of completeness, we include in Fig. 11 the
evolution of the CER versus the number of lines, l, in Fig. 6
including the CER for the proposed algorithm (CLP) (◦). The
introduction of the CLP improves the TL-DA approach when
the dataset has corrupted lines. In the range, l = [40, 50] the
TL-DA with CLP with l = 40 achieves the same CER as the
DA-TL with l = 50 lines in the training set.
FIGURE 11. CER (%) divided by the number of annotated lines, l , with the DA-TL approach using
the ICFHR18-G dataset as source and the Konzil dataset as target with no artificial errors (×), with
artificial errors () and with artificial errors and CLP used (•).
TABLE 7. Mean CER (%) evaluated in Washington and Parzival TABLE 8. Comparison between the CLP algorithm with line removal and
documents for DA-TL, CLP with threshold = 50% and CLP with threshold the CLP plus alignment of the GT after detection. The mean CER (%) is
= 70%. DA-TL was applied with the IAM dataset as the source and using evaluated for the Ricordi document with a training set of size 4 pages
150 and 325 lines from the target. The annotation for a line is corrupted (88 lines) and 12 pages (295 lines).
with probability L = 0.1 and a character within it randomly replaced with
probability R. R = 0 indicates no error introduced in the labelings. The
number of removed lines by the CLP algorithm is included in parentheses
in the last two columns.
TABLE 9. CER ICFHR 2018 Competition results for LSTM based models:
upper part, other previous approaches and, in the lower part, the results
for the approaches in this work. Lowest mean values in both parts are
highlighted in boldface.
images in the dataset. This approach is quite similar to the 2018 Competition.3 The results included in Table 9 were
one proposed in [38]. reported by the organizers of the competition. The contestants
In the case of the Ricordi dataset in the ICFHR 2018 com- provide the transcript of the 15 test pages for every docu-
petition, we realized that the CLP detected a high number ment in the target set: Konzil, Schiller, Ricordi, Patzig, and
of mislabeled lines in the dataset. Note the large numbers Schwerin. Then, the organizers evaluate the CER, publicly
of removed lines in Table 6 for = 70% and this dataset publishing the results. In this table our results are compared
with R = 0. By simply visual inspection we confirmed against the 5 original contestants in the competition: OSU
that the error was of the type of misalignment of images [32], ParisTech [4], LITIS [47], PRHLT and RPPDI. These
and annotations. Here, we apply the CLP plus the simple approaches use DNN models based on CNN, LSTM, and
automatic alignment approach described above. CTC, where some variant of the LSTM is used. Some of them
The comparison between simply removing the mislabeled use DA in the target and LM. The recent work published
lines and correcting the alignment of the database is shown by Yousef et al. [9] using a DNN model based on a fully
in Table 8. In this table, one can observe a significant drop- gate convolutional network (GCN), outperformed the LSTM
ping in the CER when correcting these misalignments of the based approaches, with a mean value of 13.02 % providing a
lines. In the training with 4 pages, the overall decrease is 23.35 % CER for a 0-page training size.
3.7 percentage points. In the 12 pages analysis, the CER drops The results of the proposal in this work are included in the
0.3 percentage points when removing the lines while it further lowest rows of Table 9 where, following the conclusions in
decreases 0.8 percentage points when correcting them. Note Subsection V-D, we used = 70% for the 1 and 4 pages
also that the gain is higher when a lower number of annotated training and = 50% for the 16 pages. Also, the CLP
lines are used. includes an alignment stage. Results are presented in three
groups of columns. First, the average CER (%) for the 5 target
F. COMPARISON TO THE STATE-OF-THE-ART dataset is included when 0, 1, 4 and 16 pages of the target
datasets are used. The second group of 5 columns reports the
By using the proposed 5+5 DNN model with CNN and
BLSTM layers followed by a CTC, we conclude by analyzing 3 The results are publicly available in the ICFHR competition website:
the results of the novel DA-TL approach over the ICFHR https://scriptnet.iit.demokritos.gr/competitions/10/viewresults/
average CER (%) for the learning with 0, 1, 4, and 16 pages [3] M. R. Yousefi, M. R. Soheili, T. M. Breuel, E. Kabir, and D. Stricker,
in the dataset for every document. The mean value per row is ‘‘Binarization-free OCR for historical documents using LSTM networks,’’
in Proc. 13th Int. Conf. Document Anal. Recognit. (ICDAR), Aug. 2015,
included in the last column. pp. 1121–1125.
[4] E. Chammas, C. Mokbel, and L. Likforman-Sulem, ‘‘Handwriting
VI. CONCLUSION recognition of historical documents with few labeled data,’’ in Proc.
13th IAPR Int. Workshop Document Anal. Syst. (DAS), Apr. 2018,
In this paper, we analyze, for small training sets and in the pp. 43–48.
framework of historical HTR, two well-known techniques in [5] T. Strauß, G. Leifert, R. Labahn, T. Hodel, and G. Mühlberger,
almost every deep learning application: TL and DA. We show ‘‘ICFHR2018 competition on automated text recognition on a READ
dataset,’’ in Proc. 16th Int. Conf. Frontiers Handwriting Recognit.
that TL improves the CER between 10-40 percentage points (ICFHR), Aug. 2018, pp. 477–482.
when applied to small training sets, of the order of 300 text [6] J. C. A. Jaramillo, J. J. Murillo-Fuentes, and P. M. Olmos, ‘‘Boosting hand-
lines. DA also drops the CER in a range of 2-20 percentage writing text recognition in small databases with transfer learning,’’ in Proc.
16th Int. Conf. Frontiers Handwriting Recognit. (ICFHR), Aug. 2018,
points when a network is trained from scratch. Before per- pp. 429–434.
forming TL, applying DA in the source dataset does reduce [7] Y. Soullard, W. Swaileh, P. Tranouez, T. Paquet, and C. Chatelain,
the CER. However, applying DA to the target datasets jointly ‘‘Improving text recognition using optical and language model writer adap-
with TL exhibits worse results than using TL alone. Hence, tation,’’ in Proc. Int. Conf. Document Anal. Recognit. (ICDAR), Sep. 2019,
pp. 1175–1180.
we propose the DA-TL approach where the DA is applied to [8] J. A. Sánchez, V. Romero, A. H. Toselli, M. Villegas, and E. Vidal, ‘‘A set
the source dataset in the TL process. of benchmarks for handwritten text recognition on historical documents,’’
Besides, we highlight that the DNN models are very sensi- Pattern Recognit., vol. 94, pp. 122–134, Oct. 2019.
[9] M. Yousef, K. F. Hussain, and U. S. Mohammed, ‘‘Accurate, data-efficient,
tive to the number of lines in the train set when this number unconstrained text recognition with convolutional neural networks,’’ Pat-
is low. Therefore, errors in annotated lines of small target tern Recognit., vol. 108, Dec. 2020, Art. no. 107482.
datasets have a greater impact than in large datasets, for the [10] J. C. Aradillas, J. J. Murillo-Fuentes, and P. M. Olmos, ‘‘Improving offline
HTR in small datasets by purging unreliable labels,’’ in Proc. 17th Int.
same proportions of mislabels. To avoid that, we propose a Conf. Frontiers Handwriting Recognit. (ICFHR), Dortmund, Germany,
method that can detect the mislabeled lines and remove it Sep. 2020, pp. 25–30.
from the training set. Furthermore, we fix errors of the mis- [11] G. M. Binmakhashen and S. A. Mahmoud, ‘‘Document layout analysis:
A comprehensive survey,’’ ACM Comput. Surveys, vol. 52, no. 6, pp. 1–36,
alignment type, by searching for the true labels in the datasets. Jan. 2020.
Comparing to the state-of-the art in the ICFHR 2018 Com- [12] S. He and L. Schomaker, ‘‘DeepOtsu: Document enhancement and
petition, it can be observed that the DA-TL and CLP out- binarization using iterative deep learning,’’ Pattern Recognit., vol. 91,
perform all approaches within the CNN+LSTM+CTC class pp. 379–390, Jul. 2019.
[13] S. Ares Oliveira, B. Seguin, and F. Kaplan, ‘‘DhSegment: A generic deep-
hence underlining the importance of the issues discussed: DA learning approach for document segmentation,’’ in Proc. 16th Int. Conf.
is important but in the source dataset, TL is to be considered, Frontiers Handwriting Recognit. (ICFHR), Aug. 2018, pp. 7–12.
and mislabeling detection and correction is important if the [14] J. Oncina and M. Sebban, ‘‘Learning stochastic edit distance: Application
in handwritten character recognition,’’ Pattern Recognit., vol. 39, no. 9,
dataset exhibits errors. Besides, the CLP introduces a residual pp. 1575–1587, Sep. 2006.
0.01 percentage points of loss if the datasets have no errors [15] Y.-C. Wu, F. Yin, and C.-L. Liu, ‘‘Improving handwritten chinese text
in the labels while the reduction is important if they have, recognition using neural network language models and convolutional
neural network shape models,’’ Pattern Recognit., vol. 65, pp. 251–264,
see the results for the Ricordi corpus where a reduction May 2017.
of 6.58 percentage points is achieved. The presence of errors [16] S. Hochreiter and J. Schmidhuber, ‘‘Long short-term memory,’’ Neural
in this database was detected by checking the number of Comput., vol. 9, no. 8, pp. 1735–1780, 1997.
removed lines by the CLP. [17] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, ‘‘Connectionist
temporal classification: Labelling unsegmented sequence data with recur-
At this point, it is interesting to mention that other vari- rent neural networks,’’ in Proc. 23rd Int. Conf. Mach. Learn. (ICML),
ations of the algorithm have been tried to further improve Jun. 2006, pp. 369–376.
the performance. In this sense, we tried to evaluate the CTC [18] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour, ‘‘Dropout improves
recurrent neural networks for handwriting recognition,’’ in Proc. 14th Int.
loss [17] to select a threshold . We found it complex to deal Conf. Frontiers Handwriting Recognit., Sep. 2014, pp. 285–290.
with because it depends on several factors like the number [19] P. Voigtlaender, P. Doetsch, and H. Ney, ‘‘Handwriting recognition with
of epochs in the training or if batch normalization has been large multidimensional long short-term memory recurrent neural net-
works,’’ in Proc. 15th Int. Conf. Frontiers Handwriting Recognit. (ICFHR),
applied. In future work, we expect to improve the algorithm in Oct. 2016, pp. 228–233.
this way. Another promising research line could be introduc- [20] D. Castro, B. Bezerra, and M. Valenca, ‘‘Boosting the deep multidimen-
ing TL-DA and CLP in other DNN models, such as the based sional long-short-term memory network for handwritten recognition sys-
tems,’’ in Proc. 16th Int. Conf. Frontiers Handwriting Recognit. (ICFHR),
on GCN [9], that has a quite low value for 0 pages, to further Aug. 2018, pp. 127–132.
improve the CER. Besides, introducing LM in the proposed [21] B. Moysset and R. Messina, ‘‘Are 2D-LSTM really dead for offline
DA-TL and CLP approaches could be also investigated. text recognition?’’ Int. J. Document Anal. Recognit., vol. 22, no. 3,
pp. 193–208, Sep. 2019.
[22] J. Puigcerver, ‘‘Are multidimensional recurrent layers really necessary for
REFERENCES handwritten text recognition?’’ in Proc. 14th IAPR Int. Conf. Document
[1] N. Serrano, F. Castro, and A. Juan-Císcar, ‘‘The RODRIGO database,’’ in Anal. Recognit. (ICDAR), Nov. 2017, pp. 67–72.
Proc. LREC, 2010, pp. 19–21. [23] D. Coquenet, C. Chatelain, and T. Paquet, ‘‘Recurrence-free unconstrained
[2] R. Saabni, A. Asi, and J. El-Sana, ‘‘Text line extraction for historical handwritten text recognition using gated fully convolutional network,’’ in
document images,’’ Pattern Recognit. Lett., vol. 35, no. 1, pp. 23–33, Proc. 17th Int. Conf. Frontiers Handwriting Recognit. (ICFHR), Sep. 2020,
Jan. 2014. pp. 19–24.
[24] A. F. de Sousa Neto, B. L. D. Bezerra, A. H. Toselli, and E. B. Lima, [43] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge,
‘‘HTR-Flor++: A handwritten text recognition system based on a pipeline MA, USA: MIT Press, 2016.
of optical and language models,’’ in Proc. ACM Symp. Document Eng., [44] S. Jialin Pan and Q. Yang, ‘‘A survey on transfer learning,’’ IEEE Trans.
New York, NY, USA, Sep. 2020, pp. 1–4. Knowl. Data Eng., vol. 22, no. 10, pp. 1345–1359, Oct. 2010.
[25] F. Naiemi, V. Ghods, and H. Khalesi, ‘‘A novel pipeline framework for [45] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, ‘‘CNN features
multi oriented scene text image detection and recognition,’’ Expert Syst. Off-the-shelf: An astounding baseline for recognition,’’ in Proc. IEEE
Appl., vol. 170, May 2021, Art. no. 114549. Conf. Comput. Vis. Pattern Recognit. Workshops, Jun. 2014, pp. 512–519.
[26] N. Daniyar, B. Kairat, K. Maksat, and A. Anel, ‘‘Classification of hand- [46] B. Efron, ‘‘Better bootstrap confidence intervals,’’ J. Amer. Stat. Assoc.,
written names of cities using various deep learning models,’’ in Proc. 15th vol. 82, no. 397, pp. 171–185, Mar. 1987.
Int. Conf. Electron., Comput. Comput. (ICECCO), Dec. 2019, pp. 1–4. [47] W. Swaileh, T. Paquet, Y. Soullard, and P. Tranouez, ‘‘Handwriting recog-
[27] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, ‘‘How transferable are nition with multigrams,’’ in Proc. 14th IAPR Int. Conf. Document Anal.
features in deep neural networks?’’ in Proc. Adv. Neural Inf. Process. Syst., Recognit. (ICDAR), vol. 1, Nov. 2017, pp. 137–142.
vol. 27, 2014, pp. 3320–3328.
[28] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and
T. Darrell, ‘‘DeCAF: A deep convolutional activation feature for generic
visual recognition,’’ in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2014, JOSÉ CARLOS ARADILLAS received the B.E.
vol. 32, no. 1, pp. 647–655. and M.Sc. degrees in telecommunication engi-
[29] L. Zhu, Z. Huang, Z. Li, L. Xie, and H. T. Shen, ‘‘Exploring auxiliary
neering from the Universidad de Sevilla, Spain,
context: Discrete semantic transfer hashing for scalable image retrieval,’’
in 2015 and 2017, respectively, where he is cur-
IEEE Trans. Neural Netw. Learn. Syst., vol. 29, no. 11, pp. 5264–5276,
Nov. 2018. rently pursuing the Ph.D. degree with the Depart-
[30] M. Long, Y. Cao, Z. Cao, J. Wang, and M. I. Jordan, ‘‘Transferable rep- ment of Signal Theory and Communications. He is
resentation learning with deep adaptation networks,’’ IEEE Trans. Pattern also an Assistant Professor with the Universidad
Anal. Mach. Intell., vol. 41, no. 12, pp. 3071–3085, Dec. 2019. de Sevilla. He has been a Visiting Researcher with
[31] K. Chatfield, K. Simonyan, A. Vedaldi, and A. Zisserman, ‘‘Return the Max Planck Institute, Tübingen, Germany, and
of the devil in the details: Delving deep into convolutional the Universidad Carlos III de Madrid, Spain. His
nets,’’ in Proc. Brit. Mach. Vis. Conf., 2014. [Online]. Available: current research interests include machine learning for communications and
http://dx.doi.org/10.5244/C.28.6 deep learning applied to cultural heritage.
[32] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Cohen,
‘‘Data augmentation for recognition of handwritten words and lines using
a CNN-LSTM network,’’ in Proc. 14th IAPR Int. Conf. Document Anal.
Recognit. (ICDAR), vol. 1, Nov. 2017, pp. 639–645. JUAN JOSÉ MURILLO-FUENTES (Senior Mem-
[33] A. Poznanski and L. Wolf, ‘‘CNN-N-Gram for HandwritingWord recog- ber, IEEE) received the degree in telecommunica-
nition,’’ in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), tion engineering from the Universidad de Sevilla,
Jun. 2016, pp. 2305–2314. in 1996, and the Ph.D. degree in telecommuni-
[34] P. Krishnan and C. V. Jawahar, ‘‘Matching handwritten document images,’’
cation engineering from the Universidad Carlos
in Computer Vision—ECCV, B. Leibe, J. Matas, N. Sebe, and M. Welling,
III de Madrid (UC3M), Spain, in 2001. He has
Eds. Cham, Switzerland: Springer, 2016, pp. 766–782.
[35] X. Shen and R. Messina, ‘‘A method of synthesizing handwritten chinese been a Visiting Researcher with the University of
images for data augmentation,’’ in Proc. 15th Int. Conf. Frontiers Hand- Cambridge and UCL. Since 2016, he has been
writing Recognit. (ICFHR), Oct. 2016, pp. 114–119. a Full Professor with the Universidad de Sevilla,
[36] P. Y. Simard, D. Steinkraus, and J. C. Platt, ‘‘Best practices for convolu- teaching several bachelor’s and master’s courses
tional neural networks applied to visual document analysis,’’ in Proc. 7th (M.Sc. and Ph.D.) related to digital communications and machine learning.
Int. Conf. Document Anal. Recognit., Aug. 2003, pp. 958–963. He is also a member of the Signal Processing and Learning Group, UC3M.
[37] H. Nisa, J. A. Thom, V. Ciesielski, and R. Tennakoon, ‘‘A deep learning He has published extensively, particularly in interdisciplinary fields, with
approach to handwritten text recognition in the presence of struck-out more than 80 journal articles and conference papers in his active record. His
text,’’ in Proc. Int. Conf. Image Vis. Comput. New Zealand (IVCNZ), research interests include algorithm development for signal processing and
Dec. 2019, pp. 1–6. machine learning, and their applications to digital communication systems
[38] A. B. Salah, J. P. Moreux, N. Ragot, and T. Paquet, ‘‘OCR performance and cultural heritage.
prediction using cross-OCR alignment,’’ in Proc. 13th Int. Conf. Document
Anal. Recognit. (ICDAR), Aug. 2015, pp. 556–560.
[39] B. Shi, X. Bai, and C. Yao, ‘‘An end-to-end trainable neural network
for image-based sequence recognition and its application to scene text PABLO M. OLMOS (Member, IEEE) received the
recognition,’’ IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 11, M.Sc. degree and the Ph.D. degree in telecom-
pp. 2298–2304, Nov. 2017.
munication engineering from the Universidad de
[40] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, ‘‘Dropout: A simple way to prevent neural networks
Sevilla, in 2008 and 2011, respectively. He is cur-
from overfitting,’’ J. Mach. Learn. Res., vol. 15, no. 1, pp. 1929–1958, rently an Assistant Professor with the Universidad
2014. Carlos III de Madrid. He has held appointments
[41] U.-V. Marti and H. Bunke, ‘‘The IAM-database: An English sentence as a Visiting Researcher with Princeton University,
database for offline handwriting recognition,’’ Int. J. Document Anal. EPFL, Notre Dame University, ENSEA, and Bell
Recognit., vol. 5, no. 1, pp. 39–46, Nov. 2002. Labs. His research interests include approximate
[42] A. Fischer, A. Keller, V. Frinken, and H. Bunke, ‘‘Lexicon-free handwritten inference methods for Bayesian machine learning
word spotting using character HMMs,’’ Pattern Recognit. Lett., vol. 33, to information theory and digital communications.
no. 7, pp. 934–942, May 2012.