Handwriting - Recognition, Development and Analysis (gnv64) PDF
Handwriting - Recognition, Development and Analysis (gnv64) PDF
HANDWRITING
RECOGNITION, DEVELOPMENT
AND ANALYSIS
No part of this digital document may be reproduced, stored in a retrieval system or transmitted in any form or
by any means. The publisher has taken reasonable care in the preparation of this digital document, but makes no
expressed or implied warranty of any kind and assumes no responsibility for any errors or omissions. No
liability is assumed for incidental or consequential damages in connection with or arising out of information
contained herein. This digital document is sold with the clear understanding that the publisher is not engaged in
rendering legal, medical or any other professional services.
COMPUTER SCIENCE, TECHNOLOGY
AND APPLICATIONS
HANDWRITING
RECOGNITION, DEVELOPMENT
AND ANALYSIS
All rights reserved. No part of this book may be reproduced, stored in a retrieval system or transmitted in any form or by
any means: electronic, electrostatic, magnetic, tape, mechanical photocopying, recording or otherwise without the written
permission of the Publisher.
We have partnered with Copyright Clearance Center to make it easy for you to obtain permissions to reuse content from
this publication. Simply navigate to this publication’s page on Nova’s website and locate the “Get Permission” button
below the title description. This button is linked directly to the title’s permission page on copyright.com. Alternatively, you
can visit copyright.com and search by title, ISBN, or ISSN.
For further questions about using the service on copyright.com, please contact:
Copyright Clearance Center
Phone: +1-(978) 750-8400 Fax: +1-(978) 750-4470 E-mail: [email protected].
The Publisher has taken reasonable care in the preparation of this book, but makes no expressed or implied warranty of any
kind and assumes no responsibility for any errors or omissions. No liability is assumed for incidental or consequential
damages in connection with or arising out of information contained in this book. The Publisher shall not be liable for any
special, consequential, or exemplary damages resulting, in whole or in part, from the readers’ use of, or reliance upon, this
material. Any parts of this book based on government reports are so indicated and copyright is claimed for those parts to
the extent applicable to compilations of such works.
Independent verification should be sought for any data, advice or recommendations contained in this book. In addition, no
responsibility is assumed by the publisher for any injury and/or damage to persons or property arising from any methods,
products, instructions, ideas or otherwise contained in this publication.
This publication is designed to provide accurate and authoritative information with regard to the subject matter covered
herein. It is sold with the clear understanding that the Publisher is not engaged in rendering legal or any other professional
services. If legal or any other expert assistance is required, the services of a competent person should be sought. FROM A
DECLARATION OF PARTICIPANTS JOINTLY ADOPTED BY A COMMITTEE OF THE AMERICAN BAR
ASSOCIATION AND A COMMITTEE OF PUBLISHERS.
Additional color graphics may be available in the e-book version of this book.
Preface vii
Part I Recognition and Development 1
Chapter 1 Handwriting Recognition: Overview, Challenges and Future 3
Trends
Everton Barbosa Lacerda, Thiago Vinicius M. de Souza,
Cleber Zanchettin, Juliano Cícero Bitu Rabelo
and Lara Dantas Coutinho
Chapter 2 Thresholding 33
Edward Roe and Carlos Alexandre Barros de Mello
Chapter 6 Handwritten and Printed Image Datasets: A Review and Proposals 149
for Automatic Building
Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra,
Eduardo Muller, Cleber Zanchettin and Alejandro Toselli
Part II Analysis and Applications 167
Chapter 7 Mathematical Expression Recognition 169
Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedí
Chapter 11 Handwritten Keyword Spotting the Query by Example (QbE) Case 297
Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
This book has the primary goal to present and discuss some recent advances and ongoing
developments in the Handwritten Text Recognition (HTR) field, resulting from works done
on different HTR-related topics for the achievement of more accurate and efficient
recognition systems. Nowadays, there is an enormous worldwide interest in HTR systems,
which is mostly driven by the emergence of new portable devices incorporating handwriting
recognition functions. Others interests are the biometric identification systems employing
handwritten signature, as well as the requirements from cultural heritage institutions like
historical archives and libraries in order to preserve their large collections of historical
(handwritten) documents. The book is organized into two sections: the first one is mainly
devoted to describing the current state-of-the-art in HTR and the last advances in some of the
steps involved in HTR workflow (that is, preprocessing, feature extraction, recognition
engines, etc.), whereas the second focuses more on some relevant HTR-related applications.
In more depth, the first part offers an overview of the current state-of-the-art of HTR
technology and introduces the new challenges and research opportunities in the field.
Besides, it provides a general discussion of currently ongoing approaches towards solving
the underlying search problems on the basis of existing methods for HTR in terms of both
accuracy and efficiency. In particular, there are chapters especially focused on image
thresholding and enhancement, text image preprocessing techniques for historical
handwritten documents and feature extraction method for HTR. Likewise, in line with the
breakout success of Deep Neural Networks (DNNs) in the field, a whole chapter is devoted
to describing the designing of HTR systems based on DNNs. Finally, a chapter listing the
most used benchmarking datasets for HTR is also included, providing detailed information
about which types of HTR systems (on/off-line) and features are commonly considered for
each of them.
In the second part, several systems – also developed on the basis of the fundamental
concepts and general approaches outlined in the first part – are described for several HTR-
related applications. Presented in the corresponding chapters, these applications cover a wide
spectrum of scenarios: mathematical formulae recognition, scripting language recognition,
multimodal handwriting-speech recognition, hardware design for on-line HTR, student
performance evaluation through handwriting analysis, performance evaluation methods,
keyword spotting, and handwritten signature verification systems.
Last but not least, it is important to remark that to a large extent, this book is the result of
works carried out by several researchers in the Handwritten Text Recognition field.
viii Byron Leite Dantas Bezerra, Cleber Zanchettin, Alejandro H. Toselli et al.
Therefore it owes credit to these researchers that have directly contributed to their ideas,
discussions and technical collaborations and in general who, in one manner or another, have
made it possible.
Chapter 1
H ANDWRITING R ECOGNITION :
OVERVIEW, C HALLENGES AND F UTURE T RENDS
Everton Barbosa Lacerda1,∗, Thiago Vinicius M. de Souza1,†,
Cleber Zanchettin1,‡, Juliano Cı́cero Bitu Rabelo2,§
and Lara Dantas Coutinho2,¶
1
Centro de Informática,
Universidade Federal de Pernambuco, Recife, Brazil
2
Document Solutions, Recife, Brazil
1. Introduction
Handwriting recognition emerged as an important research field since the early days of
computer science and engineering development. Furthermore, the appealing motivation
and convenience of automatically reading our paper documents and converting them to
digital format have always pushed the area forward. Both academia and industry have been
developing studies and products which aim to read digital documents. Besides, in spite of
major efforts devoted to bring out a paper-free society, a huge number of paper documents
are generated and processed by computers every day, all over the world, in order to handle,
retrieve and store information (Bortolozzi et al., 2005).
At the beginning, due to several aspects, machine printed documents have evolved
quicker and sooner. This fact is the result of the constrained set of symbols (the avail-
able fonts in computer systems) and their uniform layout, size, and position. In addition to
this, structured layouts are commonly seen in machine documents, which also facilitates the
process of recognition, since it makes the finding and isolation of words or characters easier.
Therefore, the increasing use and dissemination of OCR software are based on structured
printed documents.
∗
E-mail address: [email protected].
†
E-mail address: [email protected].
‡
E-mail address: [email protected].
§
E-mail address: [email protected].
¶
E-mail address: [email protected].
4 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
The research related to handwritten characters recognition began in the 1950s with the
creation of the first commercial OCRs. Even with the technological advances of imaging
processing and acquisition devices, the scenario still remains challenging and current for
new researchers. The task itself consists of detecting and recognizing the characters present
in an input image and converting them to the binary pattern, corresponding to the character
found.
The character recognition process is handled according to how the writing information
is obtained. There are two most common ways of obtaining information about the writing
of characters: (1) when there are pre-existing handwritten documents and the input images
are acquired via scanners or photo cameras, the process is called ”offline recognition”. In
this scenario, we only have information about the image intensity, that is, the values of
each pixel in the coded image; (2) when the writing is directly made in devices capable of
capturing the Cartesian coordinates and information inherent to the writing process itself,
such as stroke velocity, pen pressure or the order of the traces, the process is called ”online
recognition”. Generally, the effective use of that temporal information by online recognition
techniques yields better results in comparison to offline methods.
The recognition of manuscripts is much harder than that of printed texts. Several factors
contribute to this: (i) the great variability of writing styles, leading to a virtually infinite
set of possible formats for the same symbol or letter; this is easily seen on the writing
of different people, but also happens when a person’s calligraphy change over time; (ii) the
similarity between some characters is high; (iii) touching and overlapping characters (Mello
et al., 2012). Poorly written and degraded characters may turn the recognition of this kind
of document even more difficult.
The aforementioned issues refer only to characters themselves. However, besides those,
there are difficulties that affect both printed and handwritten documents, such as back-
ground noise, poor image quality, degradation over time, etc. However, normally, those
problems appear to be more hazardous for handwriting recognition. Maybe that results
from the intrinsic hardness to cope with this kind of document as opposed to printed text;
it is not possible to make general assumptions about the document content or layout, which
may facilitate the recognition process.
Regarding the text recognition, there are some strategies that refer to text granularity,
i.e., whether we are concerned about sentences, words, or characters. The classical ap-
proach is to segment the document into regions, lines, words, and finally, characters, and
execute the classification of symbols, which correspond to some alphabet, e.g. Latin, Ara-
bic, Chinese, etc. In that case, the classification phase aims to label each isolated character
to its correct class.
By far, most of the applications and research in recognition are based on that framework.
Nevertheless, because of some hindrances of isolating characters, there are also methods
working on words or sentences. The advantages are the possibilities of using context, in
other words, utilize the results of previous words to assist the recognition of the next one;
or the application of dictionaries which can help to correct the words, in the case of some
wrong characters.
Thereat, in this book we address different fields and challenges of handwriting recog-
nition. In doing this, we choose to divide it into two main parts: the first part of the book,
called Recognition and Development, comprising Chapters 1–6, covers core concepts and
Handwriting Recognition: Overview, Challenges and Future Trends 5
challenges.
In this first chapter, we begin presenting the most recent methods in each of the men-
tioned approaches. In order to make understanding and reading easier, and besides, to
enable specific search, i.e., to make possible the consultation of the desired domain only,
we illustrate the different application areas (digits, characters and words) separately, in the
following sections. Later, we comment about the tendencies and possible future outbreaks
in this evolving and fascinating research field.
Chapter 2 explores some recent algorithms for thresholding document images. Al-
though this is a theme with works dated from decades ago, it is still unsolved. When
documents have particular features as texture, patterns, smears, smudges, folding marks,
adhesive tape marks, ink-bleed or bleed through effect, the process of finding the correct
separation between background and foreground is not so simple.
In Chapter 3, the recent advances and ongoing developments for historical handwritten
document processing are investigated. It outlines the main challenges involved, the different
tasks that have to be implemented, as well as practices and technologies that currently exist
in the literature.
Chapter 4 investigates different approaches for feature extraction, revising the literature
and proposing an approach based on the application of the CDF 9/7 Wavelet Transform
(WT) in order to represent the content of each slice.
Chapter 5 revises important aspects to take into account when building neural networks
for handwriting recognition in the hybrid NN/HMM framework, providing a better under-
standing and evaluation of their relative importance. The authors show that deep neural
networks produce consistent and significant improvements over networks with one or two
hidden layers, independently of the kind of neural network (MLP or RNN) and of input
(handcrafted features or pixels).
Motivated by: (i) the absence of datasets available for every language used in the world;
(ii) none of the existent datasets for a specific language is large and diverse enough to
produce recognition systems as reliable as human readers; (iii) manually building large
image text datasets can be impractical if we take into account the diversity of applications
in the real world; Chapter 6 presents two techniques to generate large and diverse datasets,
one for handwritten image texts and other for machine printed ones.
In the second part of this book, named Analysis and Applications, Chapters 7–15, differ-
ent authors propose techniques to address with handwriting recognition in diverse contexts.
In Chapter 7, the authors present the main challenges in the recognition of mathematical ex-
pressions, and propose an integrated approach to address them. A formal statistical frame-
work of a model based on two-dimensional grammars and its associated parsing algorithm
are presented.
Chapter 8 presents the state of the art of online handwriting recognition of main In-
dian scripts and then proposes a general scheme to recognize Indian scripts. The authors
combine online and offline information to classify segmented primitives.
Chapter 9 describes in detail the historical handwritten document analysis for Southeast
Asian palm leaf manuscripts by reporting the latest studies and experimental results of
document analysis tasks which range from corpus collection, ground truth data generation,
binarization process to the isolated character recognition and the word spotting tasks.
A multimodal interactive transcription system where user feedback is provided by
6 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
means of touchscreen pen strokes, traditional keyboard, and mouse operations is presented
in Chapter 10. The combination of the main and the feedback data streams is based on the
use of Confusion Networks derived from the output of three recognition systems: two hand-
written text recognition systems (off-line and on-line), and an automatic speech recognition
system.
Chapter 11 exploits the evolution of keyword spotting in handwritten documents focus-
ing on the Query by Example case where the query is a word image. It aims to present in a
concise manner the distinct algorithms which have been presented for over two decades so
that useful conclusions should be drawn for the future steps in this exciting research area.
The details of the development, including background, structure, fabrication method,
performance and applications of handwriting-enabled twisting-ball display is discussed in
Chapter 12. This technology will be applicable to next-generation of electronic whiteboard.
Proficiency of writing skills is even a goal that students should achieve. In this context,
Chapter 13 aims to investigate the performance of Brazilian students during the thematic
writing task regarding the speed and legibility criteria, according to the Brazilian adaptation
of the Detailed Assessment of Speed of Handwriting. Although this study is not directly re-
lated to automatic reading methods or practices, it traverses about the object of recognition
methods: handwriting; and therefore, it is interesting to consider the writing process when
thinking about automatic reading.
Other interesting application of handwriting recognition is the automatic processing
of signatures. In this scenario, the purpose of Chapter 14 is to analyze and discuss the
most used data sets in the literature in order to find what are the challenges pursued by
the community in the past few years. In addition to this, they propose a new dataset and
appropriated metrics to analyze and evaluate signature verification techniques.
The possibility of acquiring handwritten on-line signatures is exponentially rising due
to the availability of low-cost acquisition systems integrated into mobile devices, such as
smartphones, tablets, PDAs, etc. In Chapter 15, the most interesting current challenges
of the handwritten on-line signature processing are identified and promising directions for
further research are highlighted.
2. Models
This section presents a brief overview and explanation of the models that lay the foundation
for state-of-the-art methods. It is not meant to explore all details about all algorithms devel-
opment and training, however, it is possible to understand their principles and ideas, which
in this context is sufficient to ease the understanding of literature techniques, and may help
to select one or other algorithm when developing new methods.
class do not deviate so much from the values of the former. Thus, the distance between
those instances should be small.
Therefore, k-NN is a nonparametric method which works as follows: we store all avail-
able samples, that correspond to the training data; it is composed of those examples features
and their labels. Then, we compare the input example to all training data and assign its class
to be the same of the majority of k nearest samples of the training set. k is a user-defined
constant. Figure 1 shows the functioning principle of k-NN. In this scenario, the input is
marked as a star, while we have two classes (empty and full circles). In both situations
illustrated by Figure 1, k = 3 and k = 7, the method predicts the input as belonging to the
full circle class.
Thus, it is possible to observe that parameter k plays a pivotal role in k-NN results.
It is not hard to notice that depending on this value, the algorithm may change its predic-
tion. So, the best value for this parameter depends on the problem, and specifically, on the
data. A common practice for the estimation of k value is to vary the value from one to the
square root of the number of training samples (Duda et al., 2000), and choose the value that
achieved best results. Alternatively, it is possible to weight the neighbors such that nearer
neighbors contribute more to the result. In that case, the weights are normally proportional
to the inverse of the distance from the query point.
Proximity definition depends on the distance measure. The most employed distance
metric in k-NN is Euclidean distance, although it is possible to find various others in the
literature, such as city-block or Manhattan, Mahalanobis, Minkowski, only to cite a few.
As k, the distance measure may also change output values of the algorithm. An exploratory
analysis can indicate the most suitable measure for a specific data set.
The shortcomings of k-NN are mainly the dependency of data structure, and the great
processing cost overload when the training set increases. Since the algorithm is based on a
comparison of the input to the training data, the greater the training set, the greater is the
number of operations, and, consequently, also is the processing time. To overcome this, it
is possible to implement pruning policies, which aims to decrease the number of training
examples, normally based on some similarity measure. In this context, it is known that
8 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
several similar examples probably do not help to discriminate data. Therefore, some of
those examples could be excluded without generalization loss, thus helping to improve the
performance of the method.
Initially, neural networks were formed by one layer, and consequently, their training
was straightforward, since the output is directly observable, and thus, can be used to guide
weights adjusting. The more drastic fact is: single layer networks were only able to solve
linearly separable problems. The solution to that limitation appeared with the description
of the backpropagation algorithm (Rumelhart et al., 1986a)(Rumelhart et al., 1986b).
The fundamental idea of the algorithm is to use gradient descent to calculate hidden
layers errors by an estimate of the effect caused by them over output layer error. Thus, the
output error is calculated and is backpropagated to hidden layers, making possible weights
updating proportionally to the values of connections between the layers. Due to the use
of gradient, activation functions need to be differentiable. That justifies the use of sig-
moid functions, since they are a differentiable approximation to step function (early used in
Rosenblatt’s Perceptron (Haykin, 2009)).
The training proceeds in two phases: (i) forward phase, when the input signal is propa-
gated through the network until it reaches the output; (ii) backward phase, when an error is
obtained by comparing the output of the network with the desired response. That resulting
error is propagated from output to input (backward direction), layer by layer, allowing the
Handwriting Recognition: Overview, Challenges and Future Trends 9
(a) (b)
Figure 3. Correct decision surfaces: (a) smaller margin, and (b) maximal margin.
The basic procedure to determine the optimal hyperplane only permits the classification
of linearly separable data. In order to treat non-separable data, it is introduced the concept
of soft margins. In that case, classification errors are allowed in training, to provide wider
margins, which tends to augment generalization power of the classifier. Figure 4 shows
this situation, exhibiting both linearly separable and non-separable data in items 2.3 and 2.3
respectively, where the points marked as ξ are in the wrong side of the decision surface.
Although margin softening is quite useful, it is not self-sufficient to give the required
10 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
(a) (b)
Figure 4. Support vectors classification: (a) linearly separable data; hard margins, and (b)
non linearly separable data; soft margins (adapted from (Hastie et al., 2009)).
classification skills to SVM. This accrues from the fact the hyperplanes are not always
adequate to separate input data; there are situations in which a non-linear surface would
be more suitable. This conversion or mapping is obtained by the ”kernel trick” (Haykin,
2009). The idea is: non-linearly separable input data may become linearly separable in
other hyperspace, in which it will be possible to define a hyperplane that discriminates the
given data. Figure 5 presents such a transformation.
(a) (b)
More details about SVM training, including the optimization problem to determine sup-
port vectors can be consulted in (Haykin, 2009)(Hastie et al., 2009). And, as others clas-
sifiers, SVM also has parameters which define its performance at a certain problem. The
main ones are the kernel function and its internal parameters: as example of functions, we
can cite polynomial and Gaussian or RBF (radial basis function); and the regularization
constant, used on the margins softening (Hastie et al., 2009). By the way, SVM are more
sensible to the variation of its parameters, which is considered a shortcoming of the method
(Braga et al., 2007).
As exposed, SVM is originally built to treat binary problems. Naturally, there are sev-
eral multiclass problems; at this point, two principal strategies have been used to deal with
Handwriting Recognition: Overview, Challenges and Future Trends 11
this situation. One of them is based on modifying SVM training; interesting results were
obtained, but, the computational cost is very high. The second approach consists of the
decomposition multiclass scenario in various binary groups, where SVM are applied as
usually. This strategy is used more often (Braga et al., 2007).
2.4. Committees
Generally, the most common practice in the use of learning machines is to perform several
trainings with a set of examples, testing the performance of the model in a validation set,
proceeding by modifying the model parameters, until obtaining a better performance and
finally using it in the test set to get the best hit rate. This approach makes us think that we
are choosing the best possible classifier. However, it is worth mentioning that there is a
large stochastic factor when selecting validation sets, and even with a carefull distribution
of this set, it is possible that the network does very well on that chosen part but does not
have the same performance with the test set.
Machine learning committees are mechanisms that seek to combine the performance of
several specialist machines to achieve the same common goal. The idea is that the weakness
of a machine trained in a particular situation is compensated by the support of the others
(Sammut and Webb, 2011). Those mechanisms are generally constructed in two ways,
being defined as static or dynamic structures.
Static structure committees use a mechanism that combines the result of several clas-
sifiers without the interference of input signals. The ensemble is a committee of static
structure that performs a linear combination of different machine results, which can be
done by performing the average of the outputs of the machines or selecting the best result
among the best votes. Another example of a static committee is the boosting mechanism
that combines several weak classifiers into a strong classifier. The Adaboost (Freund et al.,
1996) algorithm, is a remarkable example that represents this type of mechanism. In this
technique, the idea is to train a series of new and stronger classifiers to correct the mistakes
made by previous machines and to combine the output results.
In a dynamic structure committee, the input signal acts directly on the mechanism that
combines the output results. One of the most common models of dynamic structure com-
mittees is called a mixture of experts (Jacobs et al., 1991). In that model, several classifiers
are trained in different regions of the input data. The switching between the regions of the
input data and the models to be used in those regions is carried out through the interaction
of the input signal.
with many layers. For this reason, these approaches are called Deep Learning (Goodfellow
et al., 2016).
Today the field of research in Deep Learning is extremely heated mainly because tech-
niques of this area are achieving the best results regarding the tasks of classification, detec-
tion, and localization of images, processing of natural languages, speech and audio. Within
the models known in the Deep Learning scenario, the convolutional and Long-Shot-Term-
Memory (LSTM) networks have been obtaining the best results in most fields of research in
machine learning. Today big companies like Google, Baidu, Facebook and others use these
types of models in their main systems.
The concept of Deep Learning involves a cascade of non-linear transformations, using
end to end learning, with supervised, unsupervised, probabilistic approaches and normally
hierarchical. The following sections will briefly describe the operation of one of the most
common models in handwriting recognition. A more detailed revision can be found in
Goodfellow et al. (2016) and is also presented in Chapter 5.
Like MLP networks, convolutional networks are formed by several layers. Those layers
Handwriting Recognition: Overview, Challenges and Future Trends 13
may be arranged to sequentially perform similar or different functions. The first layer of
a CNN is usually the convolution layer. That layer is responsible for receiving the input
image with dimensions N1xM1xD. In it, the convolution operations are carried out with the
aid of a filter with dimensions N2xM2xD, where the weights of the network are present.
This operation results in a map representing the extracted features of the image.
In CNNs, taking into account the space domain, the convolution operation consists in
carrying out the displacement of a mask of weights throughout the image obeying a certain
orientation and direction where, for each position of the displacement, the internal product
between the elements of the mask and the elements of the image region below the mask
is calculated. At each offset, an activation value is generated that will be assigned to the
feature map at the position relative to the center of the mask over the image. A depth level of
the feature filter used represents a depth level in the resulting feature map. In that first step,
the resulting feature map tends to specialize in low-level features found in image objects,
such as small edges and curves. In the next stage of execution of the network, the map of
features is passed as input to other layers, where representations will gain new levels of
abstractions.
In addition to the convolution layer, CNNs have other types of layers as the network
becomes deeper. Generally, a convolution layer is accompanied by an activation layer that
limits the values passed as input to the other layers. The most commonly used activation
function in convolutional networks and the Rectified Linear Unit (ReLu) (Glorot et al.,
2011). The function brought interesting results in the training phase and in the prevention
of overfitting, compared to previously used functions such as the hyperbolic tangent and the
sigmoidal function.
Another common and important layer within the CNN universe is the Max-Pooling
(Huang et al., 2007) layer that promotes a downsampling of the input data by selecting only
the largest value within a neighborhood of the image. Pooling helps render representations
of images more invariant to operations such as translation in input data.
The last layer of a network is usually a fully-connected layer where the feature maps are
concatenated into a single vector and passed to the layer that will return the probabilities of
the instance belonging to a previously trained class. Training of a convolutional network is
performed using the backpropagation algorithm (Rumelhart et al., 1988). In that algorithm,
the result of the classification of the network is compared with the label of the example
class, then the classification error is retropropagated towards the previous layers so that
their weights are updated.
tecture is very similar to the other models of this family, differentiating itself precisely by
the use of the memory cell structure.
A memory cell is composed of elements that allow access and regulate the sharing of
information between neurons in the classification of a sequence. These are components,
an input gateway, a neuron with a self-recurrent connection, a forget gate and an output
gate. The self-recurrent connection ensures that all interference from the outside is ignored,
allowing the state of a memory cell to be kept constantly between one time period and
another. The input-gate controls the interactions between the memory cells and the envi-
ronment. The input gate may block or allow input signals to change the state of the memory
cell. The output gate can allow or prevent the state of a memory cell from having an ef-
fect on other neurons. Finally, the forget gate controls the recursive retro connection of the
memory cell, which allows the cell to remember or forget its previous stage when needed.
remain the same, the difference being the organization of the machines, either inside the
model itself (in the case of deep learning), or externally (if we consider the ensembles).
That fact comes out to be logical, although a bit controversial, when we remember that the
neurons or basic processing units have not changed, and the representation and distribution
of the knowledge over the network have been altered in deep learning, but not in its intrinsic
concept: values and weights run forward and backward to adjust the model. This preamble
does not obscure the virtue of recent methods or is intended to do so. It is only meant to
make this bridge and pay off credits to their predecessors.
Indeed, there are several improvements and interesting ideas in these methods. Deep
learning has the appealing motivation of removing feature engineering, which is one of
the greatest difficulties when working with neural networks. It is known that if we have
good features, almost all classifiers could discriminate these data. However, to encounter
the best set of features or at least a good one may be a hard task. Even because, although
humans are very proficient in reading, not necessarily we know what characteristics our
brain uses to recognize the numbers, for example. Thus, we could only imagine what
features are relevant and discriminant. And the performance of learning algorithms, in
this case, depends on the quality of features too. Therefore, in deep learning, the features
themselves are learned and coded inside the model itself and, consequently, the designer of
the network does not need to concern about that. Of course, network architecture remains
a function of the designer work, although it is claimed that the impact of architecture is
somewhat reduced in deep models, or in other words, it is not expected to have drastic
variations due to small modifications in the network structure.
On the other path, in ensembles of classifiers, the idea is that several experts may pro-
pose better solutions to a given problem than only one of them. It is not hard to think if we
have more specialized agents, each one covering some area, region or subject, mainly when
the problem is too complicated, they will tend to cover, more easily, the whole spectrum of
that matter. In addition to this, if we also have an efficient mechanism to merge or select
the best answer(s), more precise outcomes tend to be achieved. Thus, since each expert
knows its sub-area and therefore provides meaningful or suboptimal answers regarding her
localized knowledge, a “conference” of an experts group brings out the response for the
given input, which is the global optimal answer in the best scenario.
3. Applications
3.1. Digits
There are various proposals in the literature to the specific task of recognizing handwritten
Hindu-Arabic digits only. That especialization comes in general to simplify the problem,
since, in a broader sense, digits may be regarded as characters. At first, we reduce the
number of classes to ten (digits varies from zero to nine), and consequently there are fewer
confusion possibilities, because only some pairs of digits are intrinsically similar. More-
over, digits natural intra-class variability is smaller than that of characters. In addition to
those implementation issues, there are real world problems in which only digits need to be
handled, such as automatic postal addressing, the processing of courtesy amount in bank
checks, or the processing of dates or page numbers in documents to provide automatic
16 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
search and indexing. Thus, when one focuses on applications such as those, it makes sense
to adopt recognition strategies specially designed to work on digits.
At this point, we need to resemble that these modifications are not always related to the
core of the method. For example, neural networks are used to classify numbers, characters,
or both at the same time, and the algorithm per se is not modified. However, different
architectures may be used to deal with each situation. That also holds true if a principle
such Occam’s razor is considered (i.e., if we may use a simpler method, generally it is the
best solution). So, if a simpler architecture may treat digits when working over digits, it is
not necessary to use a more complex solution designed to handle characters in general.
Other question regarding digits as characters for recognition is the string length. Most
of the papers in the literature focus on isolated digits. In other words, the digits need to be
segmented. Segmentation is one of the most complicated tasks in document processing and
is another prolific research area, and due to the scope of this chapter, we do not address that
issue. However, there are methods that are applied to a numeric string as a whole, without
performing a segmentation step, analogous to the case of word recognition. In Chapter
3, some issues of document processing and image segmentation are investigated and the
principal literature works are addressed.
However, as we illustrate in Sections “Deep Learning” and “Committee”, these are
relatively complex models, which require more samples, training, computational cost, and
efforts. As we show in Section “Digits”, other simpler methods can achieve interesting
results, whilst not as accurate, but with much fewer efforts in training and classification
itself.
3.2. Characters
There are several factors that make character recognition a more complex task than digit
recognition. In this scenario, one takes into account the existence of a greater number
of classes to be recognized (this set can be composed of digits, uppercase, lowercase or
accented characters, punctuation and symbols). The variation of the calligraphy of the
individuals during the act of writing, the high similarity between distinct characters, as well
as the change of the style of writing over time, are characteristics that also make the process
of recognition of handwritten characters a complex activity. It is, therefore, noticeable that
there is a high level of variability of the instances that can be correctly assigned to the same
class. On the other hand, the high similarity between some distinct characters also increases
the occurrence of false positives.
The task still becomes more challenging, as it is usually specific to each domain and
application. In this way, the techniques established and used for the recognition of a cer-
tain family of characters, such as Latin / Roman, can not be applied in the same way for
characters of different origin, such as Indus, Chinese or Arab.
That variability in writing makes the field of research related to the recognition of hand-
written characters extensive so that the number of problems that must be addressed in order
to obtain a satisfactory result in those various scenarios is enormous. That fact contributes
to the existence of several ramifications of research in the area. In today’s academic and
industry environment, research can be found that works on the recognition of the various
stages related to a recognition system.
Handwriting Recognition: Overview, Challenges and Future Trends 17
Among those stages, the final phase of character classification received a significant
gain with the introduction of Deep Learning techniques. Today, in the offline character
recognition task, the main approaches use deep multidimensional networks and deep neural
network committees.
4. Selected Works
In this section, we introduce some of the most relevant papers in handwriting recognition.
At this scenario, we consider works with the focus on digits, characters, or even both.
Some papers still cover other applications, although they are out of the scope of this chapter.
Because of this, we decided to not dive into separated subsections for each regarded subarea
(as evidenced in Sections “Applications” and “Results”). The paper selection was guided
by their reported accuracy on benchmark databases, which is an interesting criterion since
this measure tends to be a good indicator of the merit of techniques.
training time larger. Nevertheless, the authors try to diminish that influence by employing
heuristics which make training more efficient, even with the inclusion of invariances.
At this point, to demonstrate the potential of using virtual examples (see Figure 8),
consider we have prior knowledge indicating that the decision function should be invariant
to horizontal translations. Therefore, the true decision boundary is given by the dotted line
in the first frame (top left). However, due to different training examples, different separation
hyperplanes are fully possible (top right). SVM would calculate the optimal hyperplane, as
shown in Section 2.3 (bottom left), which is very different from the true boundary. In that
case, the ambiguous point, denoted by the question mark, would be wrongly classified. The
use of prior knowledge and consequent generation of virtual support vectors (VSV) yields a
much more accurate decision boundary (bottom right), and leads to the correct classification
of the ambiguous point.
Figure 8. More accurate decision boundary by virtual support vectors (from (Decoste and
Schölkopf, 2002)).
Specifically, they have developed two main heuristics, related to each other, over the
SMO (Sequential Minimal Optimization) algorithm presented by Platt (1999) (although
implementation and tests were conducted over an enhanced version of SMO, described by
Keerthi et al. (2001)): (i) maximization of cache reuse, and (ii) digestion, the reduction of
intermediate SV bulge. Cache reuse is important because most of the time spent at training
dues to kernel matrix calculations. Besides, it is common that the same value is required
many times. Therefore, if those values are stored in a “cache”, redundant calculation time
is saved. Digestion takes place when no additional support vectors are able to cache their
kernel computations or, in other words, when the number of support vector set exceeded
cache size. That issue is more severe in the case of virtual examples, since much more data
is generated (intermediate SV bulge). The basic idea is to jump out of full SMO iterations
early, once the working candidate support vector set grows by a large amount. Digestion
allows for better control on the intermediate SV bulge size, besides enabling a trade-off
between the cost of overflowing the kernel cache and the cost of doing as many inbound
iterations as the standard SMO would.
Handwriting Recognition: Overview, Challenges and Future Trends 19
row shows the respective displacement grids generated to obtain the transformed images.
The first two rows present results from digits belonging to different classes, while the two
later, for digits from the same class. The examples on the left consider only image gray
values. That one on the right shows the results using local context for matching (by the first
derivative, obtained via Sobel filtering).
Figure 9. Nonlinear matching applied to digits images (after (Keysers et al., 2007)).
It is possible to observe that, for the same class, the models with more restrictions
produce inferior matches to the models with less restrictive matching constraints (please
concentrate on the left side). In the case of local context (right side), notice that the match-
ings for the same class remain very accurate, while the matching for the different classes
is visually not as good as before (especially for models with fewer constraints, such as the
IDM). Note also that the displacement grid is more homogeneous for matchings of the same
class.
Thus, using this kind of artificially generated examples led to better results on hand-
written digits recognition. Specifically, in conjunction with local context, the deformation
model which obtained the best result was P2DHMDM (Keysers et al., 2004a)(Keysers et al.,
2004b) (standing for Pseudo-two-dimensional hidden Markov distortion model). Other de-
formation model using pixel local context which achieved competitive results was IDM,
which is a simpler model, allowing a trade-off between complexity and accuracy.
complications of MLP really necessary? Can’t one simply train really big plain MLP on
MNIST?
Initial thinking may indicate that deep MLP does not seem to work better than shallow
networks (Bengio et al., 2007). Training them is hard because backpropagated gradients
vanish exponentially in the number of layers (Hochreiter et al., 2001). A serious problem
affecting the training of big MLPs was processing power; training that kind of structure
is unfeasible when considering conventional CPUs. Because of that, Cireşan, Meier and
Schmidhuber (Cireşan et al., 2010) also make use of graphical units (GPUs), which permit
fine-grained parallelism. In addition to that, the network is trained on slightly deformed
images, continually generated online, i.e., created in each iteration; hence, the whole un-
deformed training set is available to validation, without wasting training images.
Detailing used strategies, training is performed using standard online backpropagation
(Russel and Norvig, 2010), without momentum, but with a variable learning rate. Weights
are initialized with a uniform random distribution; and activation function of each neuron
is a scaled hyperbolic tangent (after (Lecun et al., 1998)). The images deformations were:
elastic distortions (Simard et al., 2003); an angle for either rotation or horizontal shearing;
and horizontal and vertical scaling.
Some MLP architectures were investigated (Cireşan et al., 2010). The one which
yielded the best results was: 784, 2500, 2000, 1500, 1000, 500, 10, each number mean-
ing the number of neurons of the layers, being the first one the input layer (784 neurons
because the input images are 28x28), and the last one, the output layer (that of course
presents 10 neurons, since the problem at hand is digit recognition); totalizing 12.11 mil-
lion weights. Other interesting information about the training procedure were the outcomes
obtained by the use of GPU: deformation routine was accelerated by a factor of 10; forward
and backward propagation were sped up by a factor of 40.
The performed experiments proved simple plain deep MLP can be trained. Even the
huge number of weights could be optimized with gradient descent, achieving test errors
below 1% after 20 to 30 epochs in less than two hours of training. In part, the explanation
comes from the continual deformations of the training set, that generate a virtually infinite
supply of training examples, and the network rarely sees any training image twice or indef-
initely, what seems to be the case in normal backpropagation training, causing saturation to
the network.
connected with another convolution layer containing 40 maps of 9x9 neurons each. The
last max-pooling layer reduces map size to 3x3, using 3x3 filters. A fully connected layer
of 150 neurons is connected to the max-pooling layer. Output layer has one unit per class,
and therefore, they have 62 neurons for characters and 10 for digits. All CNNs are trained
in a full online mode with annealed learning rate and continually deformed data (elastic
deformation, rotation and horizontal and vertical scaling, as made in (Cireşan et al., 2010)).
Also, GPUs were used to accelerate all training procedure.
Experiments were performed on the original and six preprocessed data sets. Prepro-
cessing was motivated by different aspect ratios of characters caused by writing styles vari-
ations. The width of all characters were normalized to 10, 12, 14, 16, 18, 20 pixels, except
for characters “1”, “i”, “I” and “l”, and the original data (Cireşan et al., 2011b). Figure 10
illustrates training and testing strategy. Training is shown in item a: each network is trained
separately and normalization is done prior to training. During each training epoch, every
character is distorted in a different way, and the data is fed to the network. The committees
are formed by averaging corresponding outputs (item b).
For each of the datasets (original or normalized), five CNNs with different initialization
are trained for the same number of epochs (resulting in a committee formed by 35 CNNs).
Consequently, it is possible to analyze output errors for the 57 = 78125 possible committees
of five nets, each trained on one of the seven data sets.
Figure 10. Classification strategy, (a) training a committee member, (b) testing with a
committee (from (Cireşan et al., 2011a)).
Therefore, simple training data preprocessing led to experts with less correlated errors
than those of different nets trained on the same bootstrapped data. Thus, simply averaging
experts outputs considerably improved recognition rates. It was credited the first time au-
tomatic recognition really comes near to human performance (Lecun et al., 1995)(Kimura
et al., 1997).
Handwriting Recognition: Overview, Challenges and Future Trends 23
Cireşan et al. (2012) proposed a multi-column approach for Deep Neural Networks (DNN)
based on small receptive fields of convolutional winner-take-all neurons yield large network
depth, resulting in roughly as many sparsely connected neural layers. Only winner neurons
are trained. The authors suggest the several deep neural columns become experts on inputs
preprocessed in different ways and the result is the average of their predictions.
The proposed architecture and its training and testing procedures are illustrated in Fig-
ure 11.
Figure 11. (a) DNN architecture. (b) MCDNN architecture. The input image can be pre-
processed by P0 − Pn−1 blocks. An arbitrary number of columns can be trained on inputs
preprocessed in different ways. The final predictions are obtained by averaging individual
predictions of each DNN. (c) Training a DNN (from Cireşan et al. (2012)).
The authors combine several techniques to iteratively train DNN in a supervised way.
They use hundreds of maps per layer, with many (6-10) layers of non-linear neurons stacked
on top of each other. The overlapped receptive share weights of 2-dimensional layers uses
winner-take-all. Given some input pattern, a max pooling technique determines the winning
neurons. It allows the selection of the most active neuron of each region. The winners of
some layer represent a smaller, down-sampled layer with lower resolution, feeding the next
layer in the hierarchy (Cireşan et al., 2012). The receptive fields and winner-take-all use 2x2
or 3x3 neurons. The DNN columns are combined to form a Multi-column DNN (MCDNN).
Given some input pattern, the predictions of all columns are averaged:
24 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
1 X
#columns
y i M CDN N = y i DN Nj (1)
N
j
The experiments are performed with MNIST, NIST SD 19, Chinese characters, NORB,
Traffic signs and CIFAR 10 datasets. The authors claim that is the first time human-
competitive results were achieved on widely used computer vision benchmarks. As it can be
seen in Table 1, the obtained results are impressive. On many image classification datasets,
the MCDNN improves the state-of-the-art by 26-80%.
∗
Letters with 52 classes. For global results see Cireşan et al. (2012) on Table 4.
Experiments are evaluated on UNIPEN lowercase and uppercase datasets, with recog-
nition rates of 93.7% for uppercase and 90.2% for lowercase, respectively. All uppercase
or lowercase samples are randomly divided into 3 subsets. Training is on the first 2 sub-
sets; 33% and 67% of the 3rd subset are validation set and test set respectively. Training is
repeated 3 times.
5. Literature Results
5.1. Digits
Here, we present some of the most recent and relevant results on single digit recognition
(summarized in Table 2). For each method exposed in this table, we indicate the underlying
model, according to what is explained in Section 2, the used features or the absence of them
(denoted by N/A, standing for not applicable), and the error rate (in percentage) over the
MNIST database.
Table 2 shows that there are several interesting results for digits recognition. In this
context, merit is twofold: (i) error rates are very low, and reached that of human perfor-
mance (Cireşan et al., 2012); and (ii) there is not only one kind of model that reaches good
accuracy. As the main objective, the accuracy of a recognizer is its fundamental reason,
and therefore, those rates reaffirm there are methods able to handle handwritten digits to
a certain extent (of course, this is a reference database, and different problems and tricky
examples appear all the time). Different applications, or even practical issues of imple-
mentation or of model training, and even about the technical team skills, may favor one
technique over another. For instance, if one does not possess suitable hardware, use of
deep learning is unfeasible (graphical cards are still expensive today). Moreover, when the
hardware is proper, it is still needed that people have specific programming skills, such as
parallel programming, CUDA, and other software libraries. In such a situation, the use of
kNN or SVM-based methods may be an appropriate choice if a possible accuracy loss may
be tolerated. Another factor may be the “time to market”, since a simpler model may be
implemented quickly, and depending on deadline requirements, this technique may appear
to be more adequate.
In addition to this, one may think the results shown above indicate that digit recogni-
tion is solved. However, we need to remind this scenario considers databases with “c”an”
digits, or in other words, those images have no noise or another kind of artifacts that could
impair the classification process itself. Another aspect relates to the data distribution on the
datasets. In MNIST, for instance, although the original configuration states 60,000 test sam-
26 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
ples, the common practice is to use a subset of 10,000 samples. At first sight, this number
appears to be significant. But, if we remember the huge quantity of processed images and
data daily, and consequently, the uncountable number of digits to be treated, this amount
may be insufficient to a practical or business point of view (of course, this fact does not
invalidate the findings of the area over this database). The use of other datasets as the NIST
SD19 is welcome but lacks standardization, since it does not provide a default or suggested
a division of data between training and test. That may cause misinterpretation or confusion
of results because the algorithms are not necessarily evaluated using the same data partition.
Therefore, we think new databases are needed, mainly when considering the required
quantity of data for training of deep learning models. In this context, the number of ex-
amples should be increased, as well the variations over the data sets. It is also latent the
necessity of scrutinized evaluation of the methods, since many papers do not run statistical
tests and, in various cases, the difference between them is minimal. Of course, this fact
does not disregard the proposition of new techniques and, obviously, we are not stating
new methods are only valuable when overcome the performance of others in terms of er-
ror rate, since there are several others factors about the merits of published works, such as
computational cost, speed of convergence, simplicity or complexity of the models, etc.
5.2. Characters
In this section, we present some of the most recent and relevant results on character recog-
nition (summarized in Table 3). For each method exposed in this table, we indicate the
underlying model, according to what is explained in Section 2, the dataset used in the ex-
periments, and the error rate in percentage for classification the uppercase and lowercase
letters.
We can see in the Table 3 that the best methods for classification of isolated characters
are the manuscripts based on convolutional networks. The convolutional networks commit-
tees as well as the task of digit classification obtained by far the best results, even with the
greatest difficulty of the task addressed. The biggest cause of error in the classification is
the ambiguity between some characters like ĺ, it́hat are impossible to be treated if context
information is not taken into account.
However, it is worth noting that the problem of character classification has not yet been
completely solved since the good results were obtained in very specific subgroups of data.
In the classification of handwritten characters, the results are better when the networks are
trained and tested in bases with specific characteristics, where the images have the same
size, width and variation pattern. However, when a set of tests with a greater variation and
number of classes is used for validation of the method, the results are somewhat lower. It
Handwriting Recognition: Overview, Challenges and Future Trends 27
is soon observed that the greater the number of classes introduced in the problem at hand,
the higher the chance of ambiguity and error in the classification. We can note this in
(Cireşan et al., 2012) and (Cireşan et al., 2011a) when using a set consisting of upper and
lower characters to validate the results are less than the results using the sets individually as
shown above. The results could then be quite different with different databases.
Observing those problems, we note that there are new challenges that must be ad-
dressed, since in the results there were no accented characters, symbols and punctuation. In
other alphabets, the amount of characters is much greater than those belonging to the Latin
alphabet so, in those cases, new strategies must be elaborated. It has been found that the
number of papers in the classification of Latin manuscripts is much larger than the number
of papers involving other families of handwritten characters. Thus, there is more need of
research in this area, including the tackling of the two problems mentioned above (which
although are similar, have some differences).
but we are in fact interested in recognizing all characters, which represent the document
content.
Thus, we think those questions need to be considered when proposing new models or
the design of new methods using existent models.
Acknowledgments
The authors acknowledge Document Solutions for sponsoring this research. The authors
also thank CNPQ for supporting the project under the grant “Bolsa de Produtividade DT”
(Process 311338/2015–1).
References
Bengio, Y., Pascal, L., Popovici, D., and Larochelle, H. (2007). Greedy layer-wise training
of deep networks. In Schölkopf, P. B., Platt, J. C., and Hoffman, T., editors, Advances
in Neural Information Processing Systems 19, pages 153–160. MIT Press, Cambridge,
MA, USA.
Bengio, Y., Simard, P., and Frasconi, P. (1994). Learning long-term dependencies with
gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166.
Bortolozzi, F., Britto Jr, A. S., de Oliveira, L. E. S., and Morita, M. (2005). Recent ad-
vances in handwriting recognition. In Pal, U., Parui, S. K., and Chaudhuri, B. B., editors,
Document Analysis, pages 1–30.
Braga, A. P., Carvalho, A. P. L. F., and Ludermir, T. B. (2007). Redes Neurais Artificiais.
LTC, Rio de Janeiro, second edition.
Cireşan, D., Meier, U., and Schmidhuber, J. (2012). Multi-column deep neural networks for
image classification. In IEEE Conference on Computer Vision and Pattern Recognition,
pages 3642–3649, Providence.
Cireşan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2010). Deep, big,
simple neural nets for handwritten digit recognition. Neural Computation, 22(12):3207–
3220.
Cireşan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2011a). Convolu-
tional neural network committees for handwritten character classification. In Interna-
tional Conference on Document Analysis and Recognition, pages 1135–1139, Beijing.
Cireşan, D. C., Meier, U., Gambardella, L. M., and Schmidhuber, J. (2011b). Handwritten
digit recognition with a committee of deep neural nets on gpus. Technical Report IDSIA-
03-11, Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Manno, Switzerland.
Decoste, D. and Schölkopf, B. (2002). Training invariant support vector machines. Machine
Learning, 46(1-3):161–190.
Duda, R. O., Hart, P. E., and Stork, D. G. (2000). Pattern Classification. Wiley-Interscience,
New York, second edition.
Freund, Y., Schapire, R. E., et al. (1996). Experiments with a new boosting algorithm. In
Icml, volume 96, pages 148–156.
Glorot, X., Bordes, A., and Bengio, Y. (2011). Deep sparse rectifier neural networks. Jour-
nal of Machine Learning Research, 15(106):275.
Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press, Cam-
bridge, MA, USA.
Hagan, M. T. and Menhaj, M. B. (1994). Training feedforward networks with the marquardt
algorithm. IEEE Transactions on Neural Networks, 5(6):989–993.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning.
Data Mining, Inference and Prediction. Springer Series in Statistics. Springer, New York,
second edition.
Haykin, S. (2009). Neural Networks and Learning Machines. Prentice-Hall, Upper Saddle
River, third edition.
Hochreiter, S., Bengio, Y., Frasconi, P., and Schmidhuber, J. (2001). Gradient flow in
recurrent nets: The difficulty of learning long-term dependencies. In Kramer, S. C. and
Kolen, J. F., editors, A field guide to dynamical recurrent neural networks. IEEE Press,
Piscataway, NJ, USA.
Huang, F. J., Boureau, Y.-L., LeCun, Y., et al. (2007). Unsupervised learning of invariant
feature hierarchies with applications to object recognition. In 2007 IEEE conference on
computer vision and pattern recognition, pages 1–8. IEEE.
Hubel, D. H. and Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional
architecture in the cat’s visual cortex. The Journal of physiology, 160(1):106–154.
Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures of
local experts. Neural computation, 3(1):79–87.
Handwriting Recognition: Overview, Challenges and Future Trends 31
Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. (2001). Improve-
ments to platt’s smo algorithm for svm classifier design. Neural Computation, 13(3):637–
649.
Keysers, D., Deselaers, T., Gollan, C., and Ney, H. (2007). Deformation models for
image recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence,
29(8):3207–3220.
Keysers, D., Gollan, C., and Ney, H. (2004a). Classification of medical images using
non-linear distortion models. In Tolxdorff, T., Braun, J., Handels, H., Horsch, A., and
Meinzer, H., editors, Bildverarbeitung für die Medizin 2004: Algorithmen — Systeme —
Anwendungen, pages 366–370. Springer Berlin Heidelberg, Berlin.
Keysers, D., Gollan, C., and Ney, H. (2004b). Local context in non-linear deformation mod-
els for handwritten character recognition. In 17th International Conference on Pattern
Recognition, pages 511–514, Cambridge, UK.
Kimura, F., Kayahara, N., Miyake, Y., and Shridhar, M. (1997). Machine and human recog-
nition of segmented characters from handwritten words. In International Conference on
Document Analysis and Recognition, pages 866–869, Ulm, Germany.
Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep
convolutional neural networks. In Advances in neural information processing systems,
pages 1097–1105.
LeCun, Y. and Bengio, Y. (1995). Convolutional networks for images, speech, and time
series. The handbook of brain theory and neural networks, 3361(10):1995.
Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied
to document recognition. Proceedings of the IEEE, 86(11):2278–2324.
Lecun, Y., Jackel, L. D., Bottou, L., Cortes, C., Denker, J. S., Drucker, H., Guyon, I.,
Müller, U. A., Säckinger, E., P., S., and Vapnik, V. (1995). Learning algorithms for
classification: A comparison on handwritten digit recognition. In Oh, J. H., Kwon, C.,
and Cho, S., editors, Neural Networks: The Statistical Mechanics Perspective, pages
261–276. World Scientific.
Mello, C. A. B., Olivera, A. L. I., and Santos, W. P. (2012). Digital Document Analysis and
Processing. Nova Science Publishers, Inc., Commack, NY, USA.
Platt, J. C. (1999). Fast training of support vector machines using sequential minimal op-
timization. In Schölkopf, B., Burges, C. J. C., and Smola, A. J., editors, Advances in
Kernel Methods, pages 185–208. MIT Press, Cambridge, MA, USA.
Riedmiller, M. and Braun, H. (1993). A direct adaptive method for faster backpropagation
learning: the rprop algorithm. In IEEE International Conference on Neural Networks,
pages 586–591, San Francisco.
32 Everton Barbosa Lacerda, Thiago Vinicius M. de Souza, Cleber Zanchettin et al.
Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986a). Learning internal rep-
resentations by error propagation. In Rumelhart, D. E., McClelland, J. L., and PDP
Research Group, C., editors, Parallel Distributed Processing: Explorations in the Mi-
crostructure of Cognition, Vol. 1, pages 318–362. MIT Press, Cambridge, MA, USA.
Russel, S. and Norvig, P. (2010). Artificial Intelligence. Prentice-Hall, Upper Saddle River,
third edition.
Simard, P. Y., Steinkraus, D., and Platt, J. C. (2003). Best practices for convolutional
neural networks applied to visual document analysis. In Proceedings of the Seventh
International Conference on Document Analysis and Recognition-Volume 2, pages 958–,
San Mateo, CA. IEEE Computer Society.
Vapnik, V. (2000). The Nature of Statistical Learning. Springer, New York, second edition.
Yuan, A., Bai, G., Jiao, L., and Liu, Y. (2012). Offline handwritten english character recog-
nition based on convolutional neural network. In Document Analysis Systems (DAS),
2012 10th IAPR International Workshop on, pages 125–129. IEEE.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 2
T HRESHOLDING
Edward Roe1,∗ and Carlos Alexandre Barros de Mello2,†
1 CESAR - Centro de Estudos Avançados do Recife,
Recife, Brazil
2 Centro de Informática
1. Introduction
Thresholding can be seen as a classification problem where, usually, there are two classes
(Mello, 2012). In particular, for document images, it is expected that a thresholding algo-
rithm correctly classifies the ink (the foreground) and the paper (the background). If to the
ink is attributed the black color and to the paper the white color, the result will be a bi-
level image or a binarized image. This is why thresholding is also known as binarization.
Considering, for example, a digital grayscale image, the process is quite simple: given a
threshold value, says th, the colors above this value are converted into white, while the col-
ors below are converted into black, separating both classes. The problem is to correctly find
the threshold value that makes a perfect match for foreground and background elements.
This is the major concern of thresholding algorithms. The problem is much more complex
when we deal with natural scenes images (as the concept of foreground and background is
not so clear). For document images, it is easier to understand what is the expected result
although there are several issues that make this domain so difficult as aging degradations
like foxing (the brownish spots that form on the paper surface), back-to-front ink interfer-
ence, illumination artifacts (like shadows, due to acquisition process), crumpled paper and
adhesive tape marks. Some of these problems can be seen in Figure 1
For document images, thresholding is useful and the first step for several processes as
skew estimation (Brodić et al., 2014) and correction (Ávila and Lins, 2005), line segmen-
tation and text extraction (Sánchez et al., 2011), word spotting (Almazán et al., 2014), etc.
∗ E-mail address: [email protected].
† E-mail address: [email protected].
34 Edward Roe and Carlos Alexandre Barros de Mello
Figure 1. Examples of many kinds of problems caused by ageing process as (top right and
bottom right) foxing, (bottom left and middle right) back-to-front interference and (top left)
caused by human manipulation as adhesive tapes and (bottom right) crumpled paper.
Even more, several algorithms for character recognition work with bi-level images (Mello,
2012). An incorrect thresholding can bring consequences to all of these further processes.
This can be seen in Figure 2 with samples of correct and incorrect thresholding
Figure 2. (left) Original document image in grayscale; (center) result after a correct sepa-
ration of background and foreground; and (right) a threshold value to high converted many
gray tones into black, misclassifying some of them, making it hard to read some words with
no knowledge about the original document.
There are several different features that can be used to try to provide the correct sepa-
ration between tones. Besides, a thresholding algorithm can be applied globally (a unique
threshold value is used for the complete image) or locally (the image is divided into regions
and each region has its own threshold value and even different algorithms). As examples of
global algorithms we can cite Otsu, Pun, Kapur, Renyi, two peaks, and percentage of black;
local algorithms can be exemplified by Sauvola, Niblack, White and Bernsen algorithms.
For a general view of all these methods and many more, we suggest the reading of Sezgin
and Sankur survey (Sezgin et al., 2004).
Entropy, statistical properties, stroke width estimation and histogram shape are just a
few examples of common features that can be used to define the threshold value. In the
rest of this chapter, we present more recent algorithms for the problem with more unusual
approaches.
It is also important to observe that by document we mean any kind of paper with infor-
mation stored. This generalizes the concept just to make easier the comprehension. Thus,
Thresholding 35
by document, we mean letters (both handwritten or typewritten), book pages, forms, topo-
graphic maps, floor plans, blueprints, sheet music, postcards, and bank checks. Of course,
in most part of the chapter, we are going to present examples applied to letters or usual doc-
uments. Section “Graphical Documents” deals with other types of documents highlighting
the major features that make them unique.
The next three sections present algorithms with different kind of approaches. As said
before, Section “Graphical Documents” introduces the problem for unusual types of docu-
ments. In Section “Thresholding Evaluation”, we discuss the problem of automatic evalua-
tion of binarization, followed by the conclusions of the chapter.
The candidate text stroke edge pixels are detected being the ones having either the
maximum Vh (x, y) or the maximum Vv (x, y).
The real text stroke edge pixels are detected by using Otsu’s global thresholding method
(Otsu, 1979) based on local image variation of the detected candidate stroke edge pixels,
which histogram usually has a bimodal pattern.
Once the text stroke edges are detected, the document text can be extracted based on the
observation that the document text is surrounded by text stroke edges and also has a lower
intensity level compared with the detected stroke edge pixels.
Finally, three post-processing operations, based on the estimated document background
surface and some document domain knowledge, are applied to correct the document thresh-
olding error:
• Remove text components of a very small size that often result from image noise such
as salt and pepper noise;
• Remove the falsely detected text components that have a relatively large size;
• Remove single-pixel holes, concavities, and convexities along the text stroke bound-
ary.
The method proposed by Roe and Mello (2013) makes use of local image equalization
and an extension to the standard difference of Gaussians edge detection operator, XDoG.
The binarization is achieved after three main steps:
1. First binarization:
A. Local image equalization
B. Binarization using the Otsu algorithm
2. Second binarization:
C. Global image equalization
D. Edges detection using XDoG
3. Cleanup and restoration:
Thresholding 37
The main goal of local equalization is to prepare the image for the final binarization.
The idea is to change the intensity differences between pixels, emphasizing it at opposite
sides of a sharp edge and minimizing it for pixels in soft edges.
The local equalization, presented in Roe and Mello (2013), is performed through the
following steps: first the image is converted into values in (0, 1] interval (0 value is con-
verted into 0.01 to avoid division by zero) and it is scanned using a neq × neq window and
the higher pixel intensity is found at each window. This intensity is used to divide the pixel
intensity at the center of the window and the result is placed in a new image at the same
position. As the degraded documents are yellowish/brownish, we get better results consid-
ering only the red channel of the image. To increase the contrast after the equalization, the
gamma function c = rγ is applied to entire the image with γ = 1.1. Figure 3 shows the result
of applying local equalization directly on sample images shown in Figure 1. The window
size neq has impact on resulting edge thickness; larger windows result in thicker edges.
Otsu binarization algorithm (Otsu, 1979) is used in this step to separate degenerated
background regions from the text. In this process, some text could be also removed but this
is not a great problem, as the idea here is just to use the result of the Otsu method as a guide
in a further step. A cleanup is performed in Otsu’s result to remove some remaining noise.
This cleanup just removes contiguous areas having less than 10 pixels in size.
The Global image equalization step, applied to enhance image contrast before the
XDoG filter, is similar to the first step but with two important differences: the equaliza-
38 Edward Roe and Carlos Alexandre Barros de Mello
tion is done directly over the entire image and using the three channels, red (r), green (g)
and blue (b), independently. The results from each channel are then combined together.
The Difference of Gaussians (DoG) is an edge detector that involves the subtraction be-
tween two versions of the original grayscale image, blurred with a Gaussian filter, creating
a band-pass filter which attenuates all frequencies that are not between the cutoff frequen-
cies of the two Gaussians (Marr and Hildreth, 1980; Gonzalez and Woods, 2007). Better
results were achieved using an extension of DoG, called XDoG (Winnemöller, 2011) given
by Equation (2.4).
where σ is the standard deviation, k is a factor relating the radii of the standard deviations of
the two Gaussian functions G and τ changes the relative weighting between the larger and
smaller Gaussians. As the resulting image has too much noise then, before noise removal,
the XDoG result (Bxdog) is combined with the result from the Otsu binarization (binOtsu)
using Equation (2.5).
This combination is used to enhance the XDoG result without increasing the amount
of noise. The noise from XDoG is removed using Otsu binarization result as a reference
mask. The idea is to keep, in the cleaned image, only regions in XDoG image that satisfies
at least one of two conditions: have more than 20 black pixels (ink) in size or match at least
one black pixel in the Otsu binarized image
Figure 4. (left) Original image and (right) the result obtained by Roe and Mello algorithm.
where
where
!
ΣSW
i=−SW Σ j=−SW I(Pkx − i, Pky − j)
SW
MSW (Pk ) = (2.11)
(2 × SW + 1)2
with I(x, y) being the intensity of pixel (x, y), Pkx and Pky are Pk coordinates and SW is the
stroke width as defined in Valizadeh et al. (2009). Thus, the pixel (x, y) is mapped into a 2D
feature space, A = [A1, A2], where A1 = SC(x, y) and A2 = I(x, y). The level of separability
reached by the structural contrast can be analyzed in Figure 6-left.
In the space partitioning phase, a 2D histogram is evaluated from A. The mode asso-
ciation clustering algorithm (Li et al., 2007) is applied to the histogram. This technique
partitions the feature space (Figure 6-left) into N small regions (as in Figure 6-right).
Niblack’s local thresholding algorithm (Niblack, 1986) was proposed as the method to
label the N regions. Suppose that IMNiblack is a bi-level image generated from the original
image using Niblack’s thresholding algorithm. To classify a region Ri , the total amount of
pixels classified as text (Nt ) or background (Nb ) in the bi-level image are counted and the
classification runs as follows:
(
text, if Nt (Ri ) > Nb (Ri )
Ri =
background, otherwise
After this process, the feature space has just two regions as in Figure 8-right, defining
the final bi-level image in a new thresholding operation based on this feature space. Fig-
ure 7 presents a sample image and the result after the application of Valizadeh and Kabir’s
algorithm.
Figure 6. (left) Feature space partitioned into small regions and (right) these small regions
are grouped as text or background edges.
Thresholding 41
Figure 7. (left) Original old document and (right) resultant image by Valizadeh and Kabir.
It was noticed two situations where Valizadeh-Kabir’s algorithm does not work prop-
erly:
• Sometimes, the structural contrast may not improve the separability between ink and
paper. This usually happens when the ink has faded and its color becomes very close
to the colors of the background; or, with the same consequences, there are smudges
in the paper that darken it to turn its colors close to the ink.
• There is a high dependency on the results of Niblack’s algorithm. This algorithm
evaluates the average and standard deviation of the colors in a window and relates
them through a variable k. The authors suggested k = −0.2 for all images. It is
the same value for any image which is not reasonable and it is easy to find counter
examples.
These issues led to the development of a new algorithm (Arruda-Mello) based on the
work of Valizadeh and Kabir and published in Arruda and Mello (2014). The original
method of Valizadeh and Kabir is used jointly with a new algorithm which creates a so-
called “weak image” (an image where the major goal is background removal which can
lead to the removal of some ink elements).This is done by the use of a normalized structural
contrast image (SCNorm ):
and
′
M(x, y)min = min[MSW (Pk ), MSW (Pk+1 ), MSW (Pk )] (2.14)
I(x, y) is the gray value of the pixel p(x, y) and is an infinitely small positive number used
to avoid a division by zero. The neighborhood used for evaluation of the structural contrast
is the same presented in Figure 7. MSW (Pk ) is the average of the pixel intensities inside a
window with center at (x, y), evaluated as previously defined in Equation (2.11).This nor-
malization enhances the text regions and softens the effects of contrast/brightness variations
between text and background.
The normalized structural contrast, however, does not have good results for regions with
very low contrast. To improve it, SCNorm image is combined with SC image to compensate
this problem:
with α = (σ/128)γ , where σ is the standard deviation of the document image intensity and
is a pre-defined parameter as proposed in Su et al. (2013). For a 256 gray level image, α
∈ [0, 1] for any value of γ > 0. Both SCnorm and SCcomb are combined:
Following there are two binarization processes both based on Valizadeh and Kabir
method. As said before, one creates a “weak image”, i.e., an image with possible loss
of ink elements, while the other creates a “strong” image. The difference between them is
that the first is created using SC image and the second is created using SCMult . They have
different settings for the Niblack phase. A final step (the post-processing) is applied to the
weak image restoring lost strokes based on the strong image. Figure 8 presents a sample
image and the results generated by Valizadeh and Kabir and Arruda-Mello.
Figure 8. (left) Original image and its bi-level version created by (center) Valizadeh and
Kabir and (right) Arruda-Mello.
perception (Goldstein and Brockmole, 2013). So if one goes far from a document, the
details of the document will not be perceived anymore; in this case, the text or the ink part.
However, the major colors that belong to the background will still be perceived. Figure 9
shows a simulation of what is expected to happen in this situation. It can be seen that as
the observer moves away from the document, it fails to see the text. But the smudges of the
paper are still visible.
Figure 9. (left) Original image with smears; (right) a simulation of what is perceived at
distance: although the text is not seen anymore, the marks of the smears are still perceived.
With this idea in mind, the method simulates the increasing of the distance between
observer and document image through the use of resizing and morphological operations.
Other operations are also applied as histogram equalization. As different stroke widths
require different distances, the method starts by evaluating the thickness of the character
so that the correct distance can be simulated. The stroke width is estimated as the median
of all the nearest edge pixels found by the application of Sobel’s operator (in the vertical
direction) to the original image. Snellen’s acuity test is the inspiration for the definition of
the distance required to do not perceive the estimated ink anymore. Snellen visual acuity
44 Edward Roe and Carlos Alexandre Barros de Mello
test evaluates an individual’s ability to detect a letter by measuring the minimum angle of
resolution (the smallest target estimation in angular subtends that a person is capable of
resolving).
In details, the method (called POD - Perception of Objects by Distance) works as fol-
lows:
2. Two morphological operations of closing are applied to the original image with disk
as structuring elements (to achieve the rounded corners of objects);
3. Downsize the image to the size associated to the estimated distance (the size of the
image that is formed on the observer’s retina);
5. The absolute difference between the resized image and the original one is evaluated;
6. Dark pixels of the difference image are converted into white (as they represent perfect
match of tones from the background);
These steps create a grayscale image that still needs to be binarized. However, although
in grayscale, it is mostly composed just by background pixels; a fixed cut-off value, in
general, already provides a good result. However, to guarantee a better quality image, a
specific approach for binarization is also proposed. Otsu’s thresholding algorithm (Otsu,
1979) and K-means clustering algorithm (MacQueen et al., 1967) are applied separately to
the image generated after the 8th step presented before. A transition map is applied to the
image produced by K-means in order to identify the text lines. These text lines are then
used as a reference to clean the image produced by Otsu. A composition between this Otsu
image and the K-means image creates the final bi-level image. Figure 10 illustrates the final
result of the application of the algorithm on the image of Figure 9-left.
Most part of the algorithm is based on the application of image processing operations.
However, one major step is focused in an aspect related to human vision. Thus, this step is
being presented in more details. We are talking about the first step, the distance estimation
based on stroke width. As explained before, this estimation is the core of the algorithm as
the original idea comes exactly from what is perceived by the human visual system as the
distance between observer and object increases.
The objective of this step is to lose the information about the ink so that just the pattern
of the paper is perceived. It is natural to consider the stroke width as a feature to define the
required distance. The stroke thickness on the image is estimated through the application
of Sobel’s edge detector in the vertical direction. It is measured, for each edge pixel, its
distance to the nearest edge pixel in a horizontal direction. Most of the points detected by
an edge detector (in the document image) may belong to the edge of a character; on the
other hand, the edge detector usually detects some points that do not belong to the edge of
Thresholding 45
a character, like edge points that belong to a smudge region, for example. The thickness of
the characters is defined as the median of all the nearest edge pixels distances calculated.
One weakness of the method is that just one stroke width is considered in an image; no
variation is considered.
Figure 10. Final result of the application of perception based binarization algorithm on the
image of Figure 9-left.
With the estimated stroke width, the distance can be evaluated. This is proposed with
the inspiration of Snellen’s acuity test. In this test, an observer is placed in front of a
flowchart by certain distance. The flowchart has letters with different complexities and it
estimates the individual’s ability to recognize a letter by measuring the minimum angle of
resolution (MAR): the smallest target estimate in angular subtends that the observer can
perceive. Snellen acuity test is based on the standard that, to produce a minimal image in
the retina, an object must subtend a visual angle of one minute of arc, 1’. As a consequence,
as characters used in the test are 5 rows-high, each row subtends an angle of 1’, the angle
subtended by the characters is equal to 5’. Due to contrast variations between the real test
and what it is displayed in a computer, it was considered a 3’ angle. Thus, to define how far
the image must be from the observer so that the ink is not perceived anymore, it is evaluated
at which distance an object of the size of the estimated stroke thickness subtends an angle
of 3’.
The needs for a perfect setting of the parameters of the algorithm motivated another
study presented in Mesquita et al. (2015). In the case, the algorithm presented in Mesquita
et al. (2014) is dependent on three parameters: the minimum angle resolution and the ra-
diuses of both disks used in the closing morphological operation. For the radiuses, instead
of considering them as parameters, it was used the difference between them. So, just two
variables need to be optimized. I/F race algorithm (Birattari et al., 2010) is used to find
the best solution to the problem. This algorithm with the best settings was submitted to
H-DIBCO 2014 contest (Ntirogiannis et al., 2014) and ranked the first place.
46 Edward Roe and Carlos Alexandre Barros de Mello
5. Graphical Documents
It is usual to think of documents as the standard “white” paper (in a sense that it is a paper
with no previous element besides, maybe, guide lines) and the ink of the text (hand or type
written). However, there are several different types of documents, considering as informa-
tion stored in paper. For example, maps, floor plans, postcards, blueprints, are documents
but with very different features if compared to a usual letter. Even a letter can be written
on a letterhead which adds a graphical element to its contents. These types of documents
are being called as Graphical Documents. They are documents where there is also some
importance in the graphical information presented in it.
There are several applications to this kind of documents. Some of them are common to
all these different types of graphical documents (as segmentation); others are more specific
(as raster vector conversion for topographical maps).
A topographic map can be understood as a representation of a landscape which can con-
tain contour lines, relief, constructions and natural features (as forests and lakes). Usually,
maps contain its description in text regions. Other elements can also be found as illustra-
tions or frames; these are very common in old maps. These features can also be found in
floor plans, making these two types of documents very similar in some sense. When talk-
ing about old maps, blue prints or plans it is also usual to find them drawn in texturized
or very soft papers (as butter paper) which make them more susceptible to degradation.
One of the differences is that, in general, topographic maps are drawn by dark pens and
sometimes painted with different colors to represent different kinds of regions; floor plans,
however, can be drawn by pencil. Through time, the paper deteriorates (by the action of
insects, fungi, humidity or just because of its natural fragility) and the ink can also fade
away. Figure 11 presents (left) part of an old map and (right) part of an old floor plan. They
are presented zooming in so that details can be better perceived.
From the images of Figure 11 it is possible to observe the following aspects:
As an application of image processing of maps and plans we can fund the automatic
indexation of them which can be reached with a more clean map. For this, the first step
is binarization. This is not a simple task because of the degradation of the paper, folding
marks (as the original maps and floor plans have in general high dimensions), damages, and
so. For floor plans, some of them are drawn by pencil which let the strokes very clear. In
Daniil et al. (2003), it is presented a study on scanning settings for digitization of old maps.
In Shaw and Bajcsy (2011), it is introduced a segmentation algorithm for automatic
identification of regions in a map using reference regions in another map. The method
makes a perfect match even if there is some level of differences. The authors also presented
a map scale estimation method to evaluate the real area of the region according to the scale.
Thresholding 47
Figure 11. Zooming into (left) an old map (uncertain date) and (right) an old floor plan
(dated from 1871).
Valizadeh-Kabir algorithm is sensitive to the presence of the texturized paper so that some
noise (remains of the texture) is present. The images produced by Arruda-Mello and POD
are cleaner but POD’s image has a better preservation of the stroke width as is observed in
the comparison of the handwritten text “Corpo posterior” (in Portuguese).
Figure 12. Sample map of Figure 11-left binarized by: (left) Valizadeh-Kabir, (center)
Arruda-Mello and (right) POD.
Figure 13. Sample floor plan of Figure 11-right binarized by: (top-left) Valizadeh-Kabir,
(top-right) Arruda-Mello and (bottom) POD.
6. Thresholding Evaluation
One of the major problems of any new approach to solve a problem is how to show that this
approach is better than what was already done. In some domains where the challenge is to
develop faster algorithms, it is simple to measure a result. However, for thresholding this is
still a problem. Even if you have a typewritten document, and the result of an optical char-
acter recognition tool could be used to measure the quality of your thresholding algorithm,
there are issues that can be observed. For thresholding, any known evaluation strategy re-
quires a gold standard (or ground truth). It is the expected best solution for your image. For
document images, this could be the expected text file or the expected bi-level image. The
major problem is how to create this gold standard and how to use it for comparison.
For a type written document, the ground truth can be a text file. The final bi-level
image can be submitted to an optical character recognition tool and the resultant text file
can be compared to the ground truth text file. In this case, someone has to have made the
Thresholding 49
transcription of the original document image into text. This is quite a problem when you
have thousands of documents to transcribe as in an archive of old documents.
In the case of text analysis, text similarity algorithms are used for comparison. One of
the most common metrics is the Levenshtein distance which measures the total amount of
changes (insertions, deletions or substitutions) required to change a word into another. More
robust approaches are presented in Gomaa and Fahmy (2013), a survey on text similarities
methods. As our focus is image processing we are not going deeper into this line.
For binary document images, the problem is also complex and it also begins with the
ground truth generation. Figure 14 makes the problem clear. With a grayscale image, there
are two well defined regions: the inside of the character (the right white square in the ink)
and the outside of the character (the left white square in the paper). However, there is a
third area in which this classification is not so clear. It is the frontier between ink and paper;
the region where the digitization process implies an aliasing between ink and paper areas
in order to make the final image more pleasant to the human perception. And this is the
area that can create difficulties in the ground truth production, possibly generating different
responses.
One of the possible solutions is to use an edge detector algorithm (as Canny (Canny,
1986)) and let it detect the border of the characters. A sample result can be seen in Fig-
ure 15; it is possible to see that the result of the algorithm (the black edge) is not what we
could call the best solution. However, due to the multiplicity of solutions, even a supervised
approach would not reach a unique solution. More about the construction of ground truth
images can be found in Ntirogiannis et al. (2008).
Figure 14. There is a fuzzy area between the certain paper and certain ink (left and right
white squares respectively). It is not clear in this area which pixels belong to paper or ink.
With the ground truth images in hands, the next step is to determine the quality of a
binarization algorithm and for this, a quantitative assessment is needed. The following
measures, described in Ntirogiannis et al. (2014, 2008), can be used to get such quantitative
50 Edward Roe and Carlos Alexandre Barros de Mello
Figure 15. (left) Zooming into an old document and (b) the borders of the characters as
detected by Canny’s algorithm (in black).
estimates:
• Precision
• Recall
• Accuracy
• Specificity
• F-Measure
Before the description of the measures, some definitions necessary in the context of
document imaging are presented:
• False positive (FP): number of pixels that are part of the paper but are wrongly clas-
sified as ink;
For the use of these measures, it is necessary to have a ground truth reference image.
Precision. Is the fraction of retrieved instances that are relevant and is defined by Equa-
tion (2.17):
TP
Precision = (2.17)
T P + FP
A good algorithm must have Precision ∼
= 1 and for this is necessary that FP tend to
zero meaning few errors.
Recall. Also known as sensitivity, is the fraction of true positives that are retrieved. Recall
is defined by Equation (2.18):
TP
Recall = (2.18)
T P + FN
A good algorithm must have Recall ∼
= 1 then FN must tend toward zero.
TP+TN
Accuracy = (2.19)
P+N
where:
P = T P + FN and N = FP + T N (2.20)
Specificity. Is also called the true negative rate and measures the proportion of negatives
that are correctly identified as such. Specificity is defined by Equation (2.21).
TN
Speci f icity = (2.21)
FP + T N
A good algorithm must have Speci f icity ∼
= 1 e, then FP must tend toward zero.
F-Measure:. Is the Precision and Recall weighted harmonic mean, as defined by Equa-
tion (2.22).
2 × Recall × Precision
FM = (2.22)
Recall + Precision
Misclassification penalty metric (MPM). Evaluates the prediction against the ground and
misclassified pixels are penalized by their distances from the ground truth object’s border.
The calculation of the MPM is given by Equation (2.23).
MPFN + MPFP
MPM = (2.23)
2
52 Edward Roe and Carlos Alexandre Barros de Mello
where:
ΣFP
j
ΣFN i
i=1 dFN j=1 dFP
MPFN = and MPFP = (2.24)
D D
i j
and dFN e dFP are the distances of the i-th false negative j-th false positive pixel from
the ground truth contour. The normalization factor D is the sum over all pixel to contour
distances of the ground truth. An algorithm with low MPM score means that it is good for
object’s boundary identification.
C2
PSNR = 10 × log (2.25)
MSE
where MSE (Mean Square Error) is given by Equation (2.26):
ΣM
x=1 Σy=1 (I(x, y) − I (x, y))
N ′ 2
MSE = (2.26)
M×N
and C is the maximum color intensity (255 for a 8 bits grayscale image).
Negative Rate Metric (NRM). Is based on the discrepancy between pixels of the resulting
image and the ground truth. The NRM combines the false negative rate (NRFN ) and the
false positive rate (NRFP ) and is defined by Equation (2.27):
NRFN + NRFP
NRM = (2.27)
2
where:
NFN NFP
NRFN = and NRFP = (2.28)
NFN + NT P NFP + NT N
and NT P represents the number of true positives, NFP the number of false positives, NT N
the number of true negatives and NFN the number of false negatives. Unlike F − Measure
and PSNR, the binarization quality is best for low NRM values. The ideal algorithm should
have both FN and FP tending to 0 and Precision, Recall, Accuracy and Specificity tending
to 1. This is a way to compare the results of thresholding algorithms as stated in Mello et al.
(2008).
Conclusion
Document image thresholding, or binarization, is the initial step of most document image
analysis system and refers to the conversion of a color or grayscale document image into
Thresholding 53
a bi-level image. The goal is to distinguish the text (ink) from the background (generally
paper).
Although document image thresholding has been studied for many years, with sev-
eral approaches proposed, it is still an unsolved problem. Different types of document
degradation such as foxing, uneven illumination, image contrast variation, back-to-front
ink interference etc. (as shown in Figure 1) are responsible for making this problem with a
non-trivial solution.
In this chapter, some state-of-the-art algorithms, including the winner of the H-DIBCO
2014, were presented. The algorithms presented cover different approaches like edge based,
structural contrast, algorithms to deal with graphical documents and a new approach based
on the human visual perception system.
In Addition, it was discussed and presented measures for quantitative evaluations of the
binarization algorithm’s quality. For more information about the recent advances on thresh-
olding and other document image processing techniques, we recommend to look for the
proceedings of the following conferences: International Conference on Document Analysis
and Recognition (ICDAR), International Conference on Document Engineering (DocEng),
International Conference on Frontiers in Handwritten Recognition (ICFHR) and Workshop
on Document Analysis Systems (DAS). We also recommend to follow the International
Journal on Document Analysis Recognition (IJDAR). Look for the DIBCO and H-DIBCO
contests annually in some of these previous conferences.
References
Ahmed, S., Weber, M., Liwicki, M., and Dengel, A. (2011). Text/graphics segmentation in
architectural floor plans. In Document Analysis and Recognition (ICDAR), 2011 Inter-
national Conference on, pages 734–738. IEEE.
Almazán, J., Gordo, A., Fornés, A., and Valveny, E. (2014). Segmentation-free word spot-
ting with exemplar svms. Pattern Recognition, 47(12):3967–3978.
Arruda, A. and Mello, C. A. B. (2014). Binarization of degraded document images based
on combination of contrast images. In Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on, pages 615–620. IEEE.
Ávila, B. T. and Lins, R. D. (2005). A fast orientation and skew detection algorithm for
monochromatic document images. In Proceedings of the 2005 ACM symposium on Doc-
ument engineering, pages 118–126. ACM.
Bernsen, J. (1986). Dynamic thresholding of grey-level images. In International conference
on pattern recognition, volume 2, pages 1251–1255.
Birattari, M., Yuan, Z., Balaprakash, P., and Stützle, T. (2010). F-race and iterated f-race:
An overview. In Experimental methods for the analysis of optimization algorithms, pages
311–336. Springer.
Brodić, D., Mello, C. A. B., Maluckov, Č. A., and Milivojevic, Z. N. (2014). An ap-
proach to skew detection of printed documents. Journal of Universal Computer Science,
20(4):488–506.
54 Edward Roe and Carlos Alexandre Barros de Mello
Daniil, M., Tsioukas, V., Papadopoulos, K., and Livieratos, E. (2003). Scanning options
and choices in digitizing historic maps. In Proc. of CIPA 2003 International Symposium,
Antalya, Turkey, September.
Dezso, B., Elek, I., and Máriás, Z. (2009). Image processing methods in raster-vector
conversion of topographic maps. In Proceedings of the 2009 International Conference
on Artificial Intelligence and Pattern Recognition, pages 83–86.
El-Hussainy, M. S., Baraka, M. A., and El-Hallaq, M. A. (2011). A methodology for image
matching of historical maps. e-Perimetron, 6(2):77–95.
Ghircoias, T. and Brad, R. (2011). Contour lines extraction and reconstruction from topo-
graphic maps. Ubiquitous Computing and Communication Journal, 6(2):681–691.
Joseph, A., Babu, J. S., Jayaraj, P., and KB, B. (2012). Objective quality measures in
binarization. International Journal of Computer Science and Information Technologies,
3(4):4784–4788.
Leyk, S. and Boesch, R. (2010). Colors of the past: color image segmentation in historical
topographic maps based on homogeneity. GeoInformatica, 14(1):1–21.
Li, J., Ray, S., and Lindsay, B. G. (2007). A nonparametric statistical approach to clustering
via mode identification. Journal of Machine Learning Research, 8(Aug):1687–1723.
Lu, S., Su, B., and Tan, C. L. (2010). Document image binarization using background esti-
mation and stroke edges. International Journal on Document Analysis and Recognition
(IJDAR), 13(4):303–314.
MacQueen, J. et al. (1967). Some methods for classification and analysis of multivariate
observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics
and probability, volume 1, pages 281–297. Oakland, CA, USA.
Marr, D. and Hildreth, E. (1980). Theory of edge detection. Proceedings of the Royal
Society of London B: Biological Sciences, 207(1167):187–217.
Mello, C. A. B. (2012). Digital document analysis and processing. Nova Science Publish-
ers, New York.
Mello, C. A. B., Sanchez, A., Oliveira, A., and Lopes, A. (2008). An efficient gray-level
thresholding algorithm for historic document images. Journal of Cultural Heritage,
9(2):109–116.
Thresholding 55
Mello, C. A. B. and Machado, S. (2014). Text segmentation in vintage floor plans and
maps using visual perception. In Systems, Man and Cybernetics (SMC), 2014 IEEE
International Conference on, pages 3476–3480. IEEE.
Mesquita, R. G., Mello, C. A. B., and Almeida, L. (2014). A new thresholding algorithm for
document images based on the perception of objects by distance. Integrated Computer-
Aided Engineering, 21(2):133–146.
Mesquita, R. G., Silva, R. M., Mello, C. A. B., and Miranda, P. B. (2015). Parameter
tuning for document image binarization using a racing algorithm. Expert Systems with
Applications, 42(5):2593–2603.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2008). An objective evaluation methodol-
ogy for document image binarization techniques. In Document Analysis Systems, 2008.
DAS’08. The Eighth IAPR International Workshop on, pages 217–224. IEEE.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2014). Icfhr2014 competition on handwritten
document image binarization (h-dibco 2014). In Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on, pages 809–813. IEEE.
Otsu, N. (1979). A threshold selection method from gray-level histogram. IEEE Transac-
tions on Systems, Man and Cybernetics, 9(1):62–66.
Roe, E. and Mello, C. A. B. (2013). Binarization of color historical document images using
local image equalization and xdog. In Document Analysis and Recognition (ICDAR),
2013 12th International Conference on, pages 205–209. IEEE.
Sánchez, A., Mello, C. A. B., Suárez, P. D., and Lopes, A. (2011). Automatic line and word
segmentation applied to densely line-skewed historical handwritten document images.
Integrated Computer-Aided Engineering, 18(2):125–142.
Sezgin, M. et al. (2004). Survey over image thresholding techniques and quantitative per-
formance evaluation. Journal of Electronic imaging, 13(1):146–168.
Shaw, T. and Bajcsy, P. (2011). Automated image processing of historical maps. SPIE
Newsroom.
Su, B., Lu, S., and Tan, C. L. (2010). Binarization of historical document images using the
local maximum and minimum. In Proceedings of the 9th IAPR International Workshop
on Document Analysis Systems, pages 159–166. ACM.
Su, B., Lu, S., and Tan, C. L. (2013). Robust document image binarization technique for
degraded document images. IEEE transactions on image processing, 22(4):1408–1417.
Valizadeh, M., Komeili, M., Armanfard, N., and Kabir, E. (2009). Degraded document
image binarization based on combination of two complementary algorithms. In Advances
in Computational Tools for Engineering Applications, 2009. ACTEA’09. International
Conference on, pages 595–599. IEEE.
Chapter 3
1. Introduction
Historical manuscript collections can be considered as an important source of original infor-
mation in order to provide access to historical data and develop cultural documentation over
the years. This chapter reports on recent advances and ongoing developments for historical
handwritten document processing. It outlines the main challenges involved, the different
tasks that have to be implemented as well as practices and technologies that currently exist
in the literature. The focus is given on the most promising techniques as well as on exist-
ing datasets and competitions that can be proved useful to historical handwritten document
processing research.
The main tasks that have to be implemented in the historical document image recogni-
tion pipeline, include preprocessing for image enhancement and binarization, segmentation
for the detection of main page elements, text lines and words and, finally, recognition. In
cases where optical recognition is expected to give poor results, keyword spotting has been
proposed to substitute full-text recognition.
The organization of this chapter is as follows. Section “Preprocessing” gives an
overview of document image enhancement and binarization methods while section “Seg-
mentation” presents layout analysis, text line and word segmentation state-of-the-art tech-
niques for historical handwritten documents. In section “Handwritten Text Recognition
(HTR)” the focus is on the pure recognition task which can be accomplished on text line,
∗ E-mail address: [email protected].
† E-mail address: [email protected].
‡ E-mail address: [email protected].
§ E-mail address: [email protected].
58 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
word or character level. Finally, in section “Keyword spotting” recent advances on search-
ing for a keyword directly on the historical document images are presented.
2. Preprocessing
The conservation and readability of historical manuscripts are often compromised by sev-
eral types of degradations which not only reduce the legibility of the historical documents
but also affect the performance of subsequent processing such as document layout analysis
(DLA) and handwritten text recognition (HTR); therefore a preprocessing procedure be-
comes essential. Once an efficient preprocessing stage has been applied, the performance
of processing systems is improved while at the same time preprocessed and enhanced doc-
uments become more attractive to users such as humanities scholars.
The term degradation has been defined by Baird (2000) as follows: “By degradation (or
defects), we mean every sort of less-than-ideal properties of real document images”. On
the basis of origin, degradations can be classified into three different categories. Historical
document images may contain degradations due to (i) the image acquisition process as
well as (ii) the environmental conditions and ageing (i.e. humidity, manipulation, unfitted
storage). Specifically, concerning handwritten documents, (iii) the use of quill pens is also
responsible for several degradations (i.e. seeping of ink from the reverse side, different
amount of ink and pressure by the writer). According to this categorization, degradations
of historical manuscripts can be:
(i) Speckles, salt and pepper noise, blurring, shadows, low-resolution artifacts, curva-
ture of the document
(ii) Non-uniform background and illumination changes due to paper deterioration and
discoloration, spots and poor contrast due to humidity, smudges, holes, folding marks
(iii) Faint characters, bleed-through, presence of large stains and smears
Taking into account the type of the enhancement methodology which should be applied,
historical document image degradations are also categorized into background degradations,
foreground degradations as well as global degradations (Drira, 2006). Concerning the first
category, degradations consist of artifacts in the background (e.g. bleed-through) in which
classification methods should be applied in order to separate these artifacts from the useful
textual information. Foreground degradations affect textual information (e.g. faint charac-
ters) and should be restored by the enhancement procedure. Finally, the last category refers
to degradations which affect the entire document, such as geometrical distortions, in which
the enhancement stage is oriented towards modelling the image degradations. Examples of
degraded historical manuscripts are depicted in Figure 1.
Several historical handwritten document image preprocessing techniques have been re-
ported in the literature. Each of these techniques depends on a certain context of use and is
intended to process a precise type of degradations or a combination of them. These tech-
niques fall broadly into two main categories according to the type of the produced document
image: (i) document image enhancement methods and (ii) document image binarization
methods.
Document image enhancement methods aim to improve the quality of the original color
or grayscale image. The produced document image after the enhancement procedure is also
a color or grayscale image. On the other hand, document image binarization refers to the
Historical Document Processing 59
conversion of a color/grayscale image into a binary image. The main goal is not only to
enhance the readability of the image but also to separate the useful textual content from
the background by categorizing all the pixels as text or non-text without missing any useful
information. Techniques of the former category are used also as a preparation stage for the
binarization methods.
In the remainder of this section, the major enhancement and binarization techniques for
historical handwritten documents will be presented along with the corresponding evaluation
protocols.
fixed number of iterations. Another blind approach has been proposed by Wolf (2010). It
is based on separate Markov Random Field (MRF) regularization for the recto and verso
side, where separate priors are derived from the full graph. The segmentation algorithm is
based on Bayesian Maximum a Posteriori (MAP) estimation. Finally, Villegas and Toselli
(2014) presented an enhancement method based on learning a discriminative color channel
by considering a set of labeled local image patches. The user should point out explicitly
for some sample pages which parts are bleed-through as well as which parts are clean text,
with the aim that the method will be adapted to the characteristics of each document. The
technique is intended to be part of an interactive transcription system in which the objective
is obtaining high quality transcriptions with the least human effort.
All the above mentioned techniques focus on the correction of the bleed-through effect.
Several other degradations have been addressed by enhancement methods. For example,
Shi and Govindaraju (2004a) proposed a background light intensity normalization algo-
rithm suitable for historical manuscripts with uneven background. A linear model is used
adaptively to approximate the paper background. Then the document image is transformed
according to the approximation to a normalized image that shows the foreground text on a
relatively even background. The method works for grayscale as well as color images. An
example of an enhanced manuscript produced by this method is depicted in Figure 3. In
Gangamma et al. (2012), a restoration method was proposed in order to eliminate noise, un-
even background and enhance the contrast of the manuscripts. The proposed method com-
bines two image processing techniques, a spatial filtering technique and grayscale mathe-
matical morphology operations. Furthermore, Saleem et al. (2014) proposed a restoration
method in order to reduce the background noise and enhance the text information. A sliding
window is applied in order to calculate the local minimum and maximum pixel intensities
which are used for image normalization.
Finally, enhancement techniques which are based on the hyperspectral imaging system
(HSI) using special equipment have been reported in the literature. HSI is useful for many
tasks related to document conservation and management as it provides detailed quantitative
measurements of the spectral reflectance of the document that is not limited to the visible
spectrum. Joo Kim et al. (2011) proposed an enhancement strategy for historical documents
captured by a hyperspectral imaging system. This method tries to combine an original RGB
62 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
image with images taken in Near IR range in order to preserve the texture of the image.
Therefore an enhancement step is performed in the gradient domain which is dedicated to
the removal of artifacts. In a similar way, Hollaus et al. (2014) presented an enhancement
method for multispectral images of historical manuscripts. The proposed method is based
on the Linear Discriminant Analysis (LDA) technique. LDA is a supervised technique and
hence a labeling of training data is required. For this purpose, two different labeling strate-
gies are proposed, which are both based on spatial information. One method is concerned
with the enhancement of non-degraded image regions and the other technique is applied
on degraded image portions. The resulting images are afterwards merged into the final
enhancement result.
Although various enhancement techniques have been proposed, no standard perfor-
mance evaluation methodology exists. Most of the evaluations concentrate in visual in-
spection of the resulted document image. The performance of these techniques is based on
subjective human evaluation; hence objective evaluations among the different techniques
cannot be obtained. For example, in Tan et al. (2002) the enhanced manuscripts were vi-
sually inspected in order to count the number of words that are fully restored. The per-
formance of the system is measured in terms of precision and recall according to the total
number of words in the original image.
Another strategy to evaluate enhancement techniques is the use of OCR as a means for
indirect evaluation by comparing the OCR performance on original and enhanced images.
However, in many cases, such as in historical handwritten documents, a meaningful OCR
is not always feasible. In Tonazzini et al. (2007) and Wolf (2010), the authors presented
restoration examples of historical manuscripts but they carried out the OCR evaluation on
historical printed documents. On the other hand, in Villegas and Toselli (2014), Saleem
et al. (2014) and Hollaus et al. (2014) the restoration performance is evaluated by means of
HTR using historical handwritten datasets.
Figure 4. Background surface estimation using (Gatos et al., 2006) method (a) Original
image and (b) background surface.
using the estimated document background surface. The original image was normalized
and the text stroke edges were detected. Finally, the local threshold was based on the
local number of the detected text stroke edges and their mean intensity. This method is
based on the local contrast for the final thresholding and hence some bleed-though or noisy
background components of high contrast remain. Another binarization method which is
based on background subtraction is presented by Ntirogiannis et al. (2014a). The proposed
binarization method was developed specifically for historical handwritten document images
and it comprises several steps. In the first step, background estimation is performed using
an inpainting procedure initialized by a local binarization method. In the sequel, image
normalization is applied to correct large background variations. Then, a global and a local
binarization method are applied to the normalized image and their results are combined
at connected component level. Intermediate processing to remove small noisy connected
components is also applied. This method could miss textual information in an attempt to
clear the background from noisy components or bleed-through.
Edge-based techniques are another category of binarization methods which usually use
a measure of the intensity changes across an edge (local contrast computation). For exam-
ple, in Su et al. (2010) the image contrast is calculated (based on the local maximum and
minimum intensity) and the edges are detected using a global binarization method. Com-
pared with the image gradient, the image contrast evaluated by the local maximum and
minimum has a nice property which makes it more tolerant to the uneven illumination and
other types of document degradation such as smear. The document text is then segmented
by using local thresholds that are estimated from the detected high contrast pixels within a
local neighborhood window. This method is capable of removing the majority of the back-
ground noise and bleed-through but it is not so efficient in faint characters detection. An
extension of this work is presented in Su et al. (2013) by the authors which addresses the
over-normalization problem of the previous work. The proposed method is simple, robust
and capable of handling different types of historical manuscript degradations with mini-
64 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
mum parameter tuning. It makes use of the adaptive image contrast that combines the local
image contrast and the local image gradient adaptively and therefore is tolerant to the text
and background variation.
A decomposition method has been presented by Chen and Leedham (2005) for thresh-
olding degraded historical documents. Chen and Leedham proposed an algorithm which
recursively decomposes a document image into subregions until appropriately weighted
values can be used to select a suitable single-stage thresholding method for each region.
The decomposition algorithm uses local feature vectors to analyze and find the best ap-
proach to threshold a local area. A new mean-gradient-based method to select the threshold
for each subregion is also proposed. Moreover, multi-scale approaches have been used in
some works in order to separate the text from the background. A grid-based modeling has
been introduced by Farrahi Moghaddam and Cheriet (2010). This method is able to improve
the binarization results and restore weak connections and strokes, especially in the case of
degraded historical documents. Using the fast, grid-based versions of adaptive methods,
multi-scale methods are created which are able to recover the text on several scales and
restore document images with complex backgrounds that suffer from intensity degradation.
The authors presented also an adaptive modification of the global Otsu binarization method,
called AdOtsu. Finally, in the recent work (Afzal et al., 2015) the binarization problem is
treated as a sequence learning problem. The document image is considered as a 2D se-
quence of pixels and in accordance to this, a 2D Long Short-Term Memory (LSTM) is
employed for the classification of each pixel as text or background. The proposed method
processes the information using local context and then propagates the information globally
in order to achieve better visual coherence. It requires no parameter tuning and works well
without any feature extraction. While learning methods require a large amount of training
data and also similar type of images, this method works efficiently with limited amount of
data.
Performance evaluation strategies of document image binarization techniques can be
classified in three main categories: (i) visual inspection of the final result (Gatos et al.,
2006), (ii) indirect evaluation by taking into account the OCR performance of the binary
image with respect to character and word accuracy (Farrahi Moghaddam and Cheriet, 2010)
and (iii) direct evaluation by taking into account the pixel-to-pixel correspondence between
the ground truth and the binary image. Direct evaluation is based either on synthetic or real
images. A performance evaluation methodology which focuses on historical documents
containing complex degradations has been proposed by Ntirogiannis et al. (2013). It is a
pixel-based evaluation methodology which introduces two new measures, namely pseudo-
Recall and pseudo-Precision. It makes use of the distance from the contour of the ground
truth to minimize the penalization around the character borders, as well as the local stroke
width of the ground truth components to provide improved document-oriented evaluation
results. In addition, useful error measures, such as broken and missed text, character en-
largement and merging, background noise and false alarms, were defined that make more
evident the weakness of each binarization technique being evaluated.
Furthermore, a series of document image binarization contests (DIBCO and H-DIBCO)
have been organized in the context of the ICDAR and ICFHR conferences in order to iden-
tify current advances in document image binarization using established evaluation perfor-
mance measures. DIBCO contests (Gatos et al. (2009), Pratikakis et al. (2011), Pratikakis
Historical Document Processing 65
et al. (2013)) consist of handwritten and machine-printed document images and, on the
other hand H-DIBCO contests (Pratikakis et al. (2010), Pratikakis et al. (2012), Ntirogiannis
et al. (2014b)) contain only handwritten document images. The ground-truth binary images
were created following a semi-automatic procedure. Tables 1- 3 illustrate performance eval-
uation results of several binarization methods which were mentioned above, using DIBCO
2009 (Gatos et al., 2009) and H-DIBCO 2010 (Pratikakis et al., 2010) datasets in terms
of F-Measure, PSNR, Negative Rate Metric (NRM) and Misclassification Penalty Metric
(MPM). The final ranking was calculated after sorting the accumulated ranking value for all
measures. Concerning DIBCO 2009 dataset, which consists of handwritten and machine-
printed document images, evaluation results using only the handwritten images are also
presented. As the evaluation results indicate, the method developed by Ntirogiannis et al.
(2014a) outperforms all the other techniques concerning the handwritten document images.
3. Segmentation
Document segmentation is introduced in the first steps of the document processing pipeline
and corresponds to the correct localization of the main page elements of a document. This
step is further analyzed to the layout analysis, text line segmentation and word segmenta-
tion stages. All the abovementioned stages are very important since their success plays a
66 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
significant role to the accuracy of the final recognition result. This section is dedicated to
the analytical presentation of the three stages, with respect to the challenges appearing on
historical handwritten documents, the latest achievements found in the literature as well as
evaluation results which reflect the level of maturity of each stage.
Figure 5. (a) Latin document of two columns with ornamental characters for each paragraph
(Baechler and Ingold, 2011), (b) Arabic document with complex layout due to the existence
of side-note text (Bukhari et al., 2012), (c) Latin document image with complex layout from
the Bentham dataset (Gatos et al., 2014). Notice the existence of ruler lines, the stamp and
page number on the top right as well as the deleted text on the first text line.
Layout analysis methods reported in the literature can be classified into two distinct
categories, namely bottom-up and top-down approaches. Bottom-up methods start from
small entities of the document image (e.g. pixels, connected components). These entities
are grouped into larger homogeneous areas leading to the creation of the final regions of
interest. On the contrary, top-down methods start from the document image and repeatedly
split it to smaller areas according to specific rules which, finally, correspond to distinct
regions of interest. An alternative taxonomy can be defined in the case that training data
Historical Document Processing 67
exist. According to this taxonomy, there exists the category of supervised methods which
assume the existence of an already annotated dataset serving as the training part used to
train an algorithm for distinguishing the regions of interest. Methods that do not make use
of any prior knowledge and thus no training is involved, are said to belong to the category
of unsupervised methods.
Several layout methods for historical handwritten documents have been reported in the
literature. Nicolas et al. (2006) proposed to use Markov Random Fields for the task of
complex handwritten document segmentation and presented an application of the method on
Flaubert’s manuscripts. The authors report 90,3% in terms of global labeling rate (GLR) and
88,2% in terms of normalized labeling rate (NLR) using the Highest Confidence First (HCF)
image labeling method on a set of 23 document images of the Flaubert’s manuscripts. The
task considered consists of labeling the main region of a manuscript i.e. text body, margins,
header, footer, page number and marginal annotations. Bulacu et al. (2007) presented a
layout analysis method applied on the archive of the cabinet of the Dutch Queen which
consists of the generation of a course layout of the document by finding the page borders, the
rule lines of the index table and the handwritten text lines grouped into decision paragraphs.
Visual evaluation is performed due to lack of ground truth information on a dataset of 1040
document images showing encouraging results. Baechler and Ingold (2011) described a
generic layout analysis system for historical documents. Their implementation used a so
called Dynamic Multi-Layer perceptron (DMLP), which is a natural extension of MLP
classifiers. The system was evaluated on medieval documents for which a multi-layer model
was used to discriminate among 10 classes organized hierarchically.
Bukhari et al. (2012) introduced an approach which segments text appearing in page
margins (see Figure 5b). A MLP classifier was used to classify connected components to
the relevant class of text together with a voting scheme in order to refine the resulting seg-
mentation and produce the final classification. The authors report a segmentation accuracy
of 95% on a dataset of 38 Arabic historical document images. Asi et al. (2014), worked
on the same Arabic dataset proposing a learning-free approach to detect the main text area
in ancient manuscripts. They refine an initial segmentation using a texture-based filter by
formulating the problem as an energy minimization task and achieving the minimum us-
ing graph cuts. This method is shown to outperform (Bukhari et al., 2012) achieving an
accuracy of 98.5%.
Cohen et al. (2013) presented a method to segment historical document images into
regions of different content. A first segmentation is achieved using a binarized version of
the document, leading to a separation of text elements from non-text elements. A refine-
ment of the segmentation of the non-text regions into drawings, background and noise is
achieved by exploiting spatial and color features to guarantee coherent regions. The authors
report approximately 92% and 90% segmentation accuracy of drawings and text elements,
respectively, on a historical dataset of 252 pages. Gatos et al. (2014) proposed a text zone
detection aiming to handle several challenging cases such as horizontal and vertical rule
lines overlapping with the text as well as two column documents. The authors reported an
accuracy of 84.7% for main zone detection on a dataset consisting of 300 pages.
A general remark concerning the abovementioned methods is that a direct compari-
son cannot be made in order to clearly understand which method is superior with respect
68 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
to the others. The main reason is that each work uses different data for evaluation, dif-
ferent evaluation metrics and, most importantly, different page elements are detected per
method. Table 4 presents the categorization of the abovementioned methods to the different
taxonomies described as well as the number of different page elements detected by each
method.
Table 4. Categorization of state-of-the-art layout analysis methods
Figure 6. Challenges encountered on historical handwritten document images for text line
segmentation. (a) Difference in the skew angle between lines on the page or even along the
same text line, (b) overlapping text lines, (c) touching text lines, (d) additions above a text
line, e) deleted text.
Smearing methods include the fuzzy run length smoothing algorithm (RLSA) (Shi and
Govindaraju, 2004b), the adaptive local connectivity map method (Shi et al., 2005) and the
proposal of Kennard and Barrett (2006). The fuzzy RLSA measure is calculated for every
pixel on the initial image and describes “how far one can see when standing at a pixel along
horizontal direction”. By applying this measure, a new grayscale image is created which
is binarized and the lines of text are extracted from the new image. The input to the adap-
tive local connectivity map method is a grayscale image (Shi et al., 2005). A new image
is calculated by summing the intensities of each pixel’s neighbors in the horizontal direc-
tion. Since the new image is also a grayscale image, a thresholding technique is applied
and the connected components are grouped into location maps by using a grouping method.
Kennard and Barrett (2006) presented a novel method for locating lines within free-form
handwritten historical documents. Their method used an approach to find initial text line
candidates which resembles the adaptive local connectivity map. The fuzzy RLSA as well
as the adaptive local connectivity map method were evaluated using manuscripts written by
Galileo, Newton and Washington showing a correct location rate of 93% and 95%, respec-
tively. The method proposed by Kennard et al. was tested on 20 document images which
were part of the Washington collection as well as 6 document images downloaded from the
“Trails of Hope” showing encouraging performance.
Garz et al. (2012) proposed a text line segmentation method belonging to the grouping
category which is binarization-free (input is a grayscale image), robust to noise and can
cope with overlapping and touching text lines. First, interest points representing parts of
characters are extracted from gray-scale images. At a next step, word clusters are identified
70 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
in high density regions and touching components such as ascenders and descenders are sep-
arated using seam carving. Finally, text lines are generated by concatenating neighboring
word clusters, where neighborhood is defined by the prevailing orientation of the words
in the document. Experiments conducted on the Latin manuscript images of the Saint
Gall database (historical dataset) showed promising results for real-world applications in
terms of both accuracy and efficiency. The work of Kleber et al. (2008) also belongs to the
grouping category of methods. In this work, the authors presented an algorithm for ruling
estimation of Glagolitic texts based on text line extraction which is suitable for degraded
manuscripts by extrapolating the baselines with the a priori knowledge of the ruling. The
algorithm was tested on 30 pages of the Missale Sinaiticum and the evaluation was based
on visual criteria.
Hough based methods include the work of Louloudis et al. (2009). In this work, text line
segmentation was achieved by applying the Hough transform on a subset of the document
image connected components. A post-processing step included the correction of possible
false alarms, the detection of text lines that the Hough transform failed to create and finally
the efficient separation of vertically connected characters using a novel method based on
skeletonization. The authors evaluated the method on a historical dataset of 40 images
coming from the historical archive of the University of Athens as well as from the collection
of George Washington using an established evaluation protocol which was first described
in ICDAR 2007 Handwriting Segmentation Contest (Gatos et al., 2007). They reported
an F-Measure of 99%. A hybrid method belonging to the Hough transform and grouping
categories was proposed by Malleron et al. (2009). In this work, text line detection was
modelled as an image segmentation problem by enhancing text line structure using Hough
transform and a clustering of connected components in order to detect text line boundaries.
Experiments showed that the proposed method can achieve high accuracy for detecting
text lines in regular and semi-regular handwritten pages in the corpus of digitized Flaubert
manuscripts.
Text line segmentation methods based on the seam carving principle were recently pre-
sented (Saabni et al. (2014), Arvanitopoulos and Süsstrunk (2014)). They try to segment
text lines by finding an optimal path on the background of the document image travelling
from the left to the right edge. Saabni et al. (2014) proposed a method which computes
an energy map of a text image and determines the seams that pass across and between text
lines. Two different algorithms were described (one for binary and one for grayscale im-
ages). Concerning the first algorithm (binary case), each seam passed on the middle and
along a text line, and marked the components that make the letters and words of it. At a final
step, the unmarked components were assigned to the closest text line. For the second algo-
rithm (grayscale case) the seams were calculated on the distance transform of the grayscale
image. Arvanitopoulos and Süsstrunk (2014) proposed an algorithm based on seam carving
to compute separating seams between text lines. Seam carving is likely to produce seams
that move through gaps between neighboring lines, if no information about the text geom-
etry is incorporated into the problem. By constraining the optimization procedure inside
the region between two consecutive text lines, robust separating seams can be produced
that do not pass through word and line components. Extensive experimental evaluation on
diverse manuscript pages showed improvement compared with the state-of-the-art for text
line extraction in grayscale images.
Historical Document Processing 71
Other methodologies which cannot be grouped to a specific category include the works
of Baechler et al. (2013), Chen et al. (2014) and Pastor-Pellicer et al. (2015). In more detail,
Baechler proposed a text line extraction method for historical documents which works in
two steps. In the first step, layout analysis is performed to recognize the physical structure
of a given document using a classification technique. In the second step, the algorithm ex-
tracts the text lines starting from the layout recognition result. The system was evaluated
on three historical datasets with a test set of 49 pages. The best obtained hit rate for text
lines was 96.3%. Chen et al. (2014) used a pyramidal approach where at the first level,
pixels are classified into: text, background, decoration, and out of page; at the second level,
text regions are split into text line and non text areas. Finally, the text line segmentation re-
sults were refined by a smoothing post-processing procedure. The proposed algorithm was
evaluated on three historical manuscript image datasets of diverse nature and achieved an
average precision of 91% and recall of 84%. Finally, Pastor-Pellicer et al. (2015) proposed
a text line extraction method with two contributions: first, supervised machine learning was
used for the extraction of text-specific interest points; second, reformulating the problem of
bottom-up text line aggregation as noise-robust combinatorial optimization. In a final step,
unsupervised clustering eliminates invalid text lines. Experimental evaluation on the IAM
Saint Gall historical dataset showed promising results.
Although a direct comparison of the abovementioned techniques cannot be made due
to the fact that most methods use their own datasets and evaluation measures for measuring
their method’s performance, Table 5 briefly summarizes the size of the datasets as well the
accuracy achieved by the methods just to give an idea on the performance of state-of-the-art
methods.
Table 5. Comparison of performance for state-of-the-art text line segmentation
methods.
word spotting methods. There are several challenges that need to be addressed by a word
segmentation method (see Figure 7). These include the skew along a text line, the existence
of slant angle among characters as well as punctuation marks which tend to reduce the inter
word distance and the non uniform spacing of words.
Algorithms dealing with word segmentation in the literature are based primarily on the
analysis of the geometric relationship between adjacent components. Related work for the
problem of word segmentation differs in two aspects. The first aspect is the way the distance
of adjacent components is calculated, while the second aspect concerns the approach used
to classify the previously calculated distances as either between-word gaps or within-word
gaps. Most of the methodologies described in the literature have a preprocessing stage
which includes noise removal, skew and slant correction.
Many distance metrics are defined in the literature. Seni and Cohen (1994) presented
eight different distance metrics. These include the bounding box distance, the minimum
and average run-length distance, the Euclidean distance and different combinations of them
which depend on several heuristics. Louloudis et al. (2009) proposed to use a combination
of the Euclidean and the convex hull distance for the distance calculation stage, while using
a novel gap classification method based on Gaussian mixture modeling. The authors report
an F-Measure accuracy of 85.5% on a collection of 40 historical document images. It
is assumed that the input of the word segmentation algorithm is the automatic text line
segmentation result produced by their method.
A different approach was proposed by Manmatha and Rothfeder (2005). In this work,
a novel scale space algorithm for automatically segmenting handwritten (historical) doc-
uments into words was described. The first step concerns image cleaning, followed by
a gray-level projection profile algorithm for finding lines in images. Each line image is
then filtered with an anisotropic Laplacian at several scales. This procedure produces blobs
which correspond to portions of characters at small scales and to words at larger scales.
Crucial to the algorithm is scale selection, i.e. finding the optimum scale at which, blobs
correspond to words. This is done by finding the maximum over scale of the extent or area of
Historical Document Processing 73
the blobs. This scale maximum is estimated using three different approaches. The blobs re-
covered at the optimum scale are then bounded with a rectangular box to recover the words.
A postprocessing filtering step is performed to eliminate boxes of unusual size which are
unlikely to correspond to words. The approach was tested on a number of different data
sets and it was shown that, on 100 sampled documents from the George Washington histor-
ical corpus of handwritten document images, a total error rate of 17% was observed. The
technique outperformed a state-of-the-art word segmentation algorithm on this collection.
As it can be observed by the above mentioned descriptions, there is a lack of works
dealing with the problem of word segmentation on historical documents. The main reason
is related to the fact that recent methods for handwritten text recognition avoid the error-
prone stage of word segmentation and thus start from the text lines in order to produce the
final transcription. In addition, challenges which are met for the word segmentation step
for historical document collections, do not exhibit large differences from the challenges
encountered on modern collections. To this end, word segmentation methods developed for
modern data may be used for the cases of historical data.
focuses on the language modelling aspect and demonstrates a recognition system that can
cope with very large vocabularies of several hundred thousand words. It uses limited but
accurate n-grams obtained from the training set and augment the language model with a
very large vocabulary obtained from different sources. A sliding window is moved over
the binary text line image to extract a sequence of 9 geometric features (Marti and Bunke,
2001): 3 global features which include the fraction of black pixels, the center of gravity
and the second order moment as well as 6 local features which consist of the position of
the upper and lower contour, the gradient of the upper and lower contour, the number of
Historical Document Processing 75
black-white transitions and the fraction of black pixels between the contours. The database
used in this work is the RODRIGO database which corresponds to a single-writer Spanish
text written in 1545. Most of the pages consist of a single block of well-separated lines of
calligraphical text (853 pages, 20356 lines) (see Figure 8b). The set of lines was divided
into three different sets: training (10000 lines), validation (5010 lines), and test (5346 lines).
The out-of-vocabulary rate of the test set is 6% given the vocabulary of the training and
validation set. With the inclusion of external language sources, the out-of-vocabulary rate
was significantly reduced from 6.15% to 2.80% (-3.35%) and by doing so, the recognition
rate increased from 82.73% to 85.22% (+2.49%).
Traditional modelling approaches based on Hidden Markov optical character models
(HMM) and an N-gram language model (LM) have been used in Toselli and Vidal (2015)
for the recognition of the historical Bentham dataset used in the ICFHR-2014 HTRtS com-
petition (Sanchez et al., 2014) (see Figure 8a). A set of 433 page images is used in this
competition while 9198 text lines are used for training, 1415 for validation and 860 for
testing. Departing from the very basic N-gram-HMM baseline system provided in HTRtS,
several improvements are made in text image feature extraction, LM and HMM modelling,
including more accurate HMM training by means of discriminative training. A narrow
sliding window is horizontally applied to the line image for feature extraction. Geometric
moments are used to perform some geometric normalizations to the images within each
analysis window. The word error rate (WER) reported for the proposed system is 18.5%
while the character error rate is 7.5%. These results are close to those achieved by deep
and/or recurrent neural networks, including networks using BLSTM units.
Figure 8. Representative pages from (a) the Bentham, (b) the RODRIGO, (c) the George
Washington and (d) the Parzival datasets.
is proposed in Fischer et al. (2010). The proposed graph similarity features rely on the
idea of first transforming the image of a handwritten text into a large graph. Then local
subgraphs are extracted using a sliding window that moves from left to right over the large
graph. Each local subgraph is compared to a set of prototype graphs (each representing a
letter from the alphabet) using a well-known graph distance measure. This process results
in a vector consisting of n distance values for each local subgraph. Finally, the sequence of
vectors obtained for the complete word image is input to a HMM-recognizer. The proposed
method is tested on the medieval Parzival dataset (13th century). The manuscript is written
in the Middle High German language with ink on parchment. Although several writers
have contributed to the manuscript, the different writing styles are very similar (see Figure
Historical Document Processing 77
Figure 9. Exemplary image portions of (a) Old Greek Early Christian manuscripts, (b)
Glagolitic characters in the Missale Sinaiticum, (c) Nom scripts and (d) historical Chinese
scripts.
8d). 11,743 word images are considered that contain 3,177 word classes and 87 characters
including special characters that occur only once or twice in the dataset. The word images
are divided into three distinct sets for training, validation, and testing. Half of the words
is used for training and a quarter of the words for validation and testing, respectively. For
each of the 74 characters present in the training set, a prototype graph is extracted from a
manually selected template image. For five characters, two prototypes are chosen because
two completely different writing styles were observed, resulting in a set of 79 prototypes.
Consequently, the graph similarity features have a dimension of 79. A word recognition
accuracy of 94% is reported.
Two state-of-the art recognizers originally developed for modern scripts are applied to
medieval handwritten documents in Fischer et al. (2009). The first is based on HMMs
and the second uses a Neural Network with a BLSTM architecture. Both word recogniz-
ers are based on 9 geometric features after applying a sliding window in the word image
(Marti and Bunke, 2001). A Middle High German vocabulary is used without taking any
language model into account. Each word is modelled by an HMM built from the trained
letter HMMs and the most probable word is chosen using the Viterbi algorithm. For the
NN based approach, the input layer contains one node for each of the nine geometrical fea-
78 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
tures and is connected with two distinct recurrent hidden layers. Both hidden layers are in
turn connected to a single output layer. The network is bidirectional, i.e., the feature vector
sequence is fed into the network in both the forward and the backward mode. The output
layer contains one node for each possible letter in the sequence plus a special e node to
indicate “no letter”. For experimental evaluation, the Parzival database was used (45 pages
of 4478 text lines). The set of all words is divided into a distinct training, validation, and
test set. Half of the words is used for training and a quarter of the words for validation and
testing, respectively. The NN-based recognizer with a BLSTM architecture outperformed
the HMM-based recognizer with statistical significance (recognition rate: 93.32% for the
NN-based recognizer, 88.69% for the HMM-based recognizer).
checked and corrected through a graphical user interface. The class labels of character
patterns can be also fixed in the recognition step with another OCR version that can rec-
ognize an extended set of character categories. Finally, the documentation step completes
the document recognition process by adding the character codes and layout information.
The proposed character recognition system uses a k-d tree for coarse classification and the
modified quadratic discriminant function (MQDF2) for fine classification. Training patterns
were artificially generated from 27 Chinese, Japanese, and Nom character fonts since the
three languages share a considerable number of character categories, and ground truth real
patterns are not available for most Nom categories. Confining the character categories used
for recognition in the first stage to the 7660 most frequently appearing categories increased
the recognition rate to 66.92% from 55.50% for the extended set, which reduced the time
and labour needed to manually tag unrecognized patterns.
A transfer learning method based on Convolutional Neural Network (CNN) is proposed
in Tang et al. (2016) for historical Chinese character recognition. A CNN model is first
trained by printed Chinese character samples. The network structure and weights of this
model are used to initialize another CNN model, which is regarded as the feature extractor
and classifier in the target domain. This is then fine-tuned by a few labelled historical or
handwritten Chinese character samples, and used for final evaluation. The target domain
includes 57,409 historical Chinese characters collected from Dunhuang historical Chinese
documents (see Figure 9d). The results show that the recognition accuracy of the CNN
based transfer learning method increases significantly as a number of samples for fine-
tuning increases ( 70% when using up to 50 labelled samples of each character for fine-
tuning).
5. Keyword Spotting
In cases where optical recognition is deemed to be very difficult or expected to give poor
results, word spotting or keyword spotting has been proposed to substitute full text recog-
nition. In word spotting the user queries the document database for a given word, and the
spotting system is expected to return to the user a number of possible locations of the query
in the original document. Keyword spotting has originally been proposed as an alternative
for Automatic Speech Recognition (Rohlicek et al., 1989), as far back as 1989; in the mid-
90’s the first keyword spotting systems for document content began to appear (Manmatha
and Croft, 1997).
The user may select the query by drawing a bounding box around the desired word in
the scanned document image, or select the word from a collection of presegmented word
images. This scenario is known as Query-by-example (QBE) in the literature. QBE key-
word spotting is akin to content-based image retrieval (CBIR) (Sfikas et al., 2005). Both
approaches follow the same paradigm, in the sense that the user defines an image query and
the underlying system is required to detect matches of the query over a database. Features
are extracted from the query and all database images, which are then used to build image
descriptors. The descriptor of the query is then matched against the database images using
a suitable distance metric. The matches that are found to be closest to the query are labelled
as matches and are returned to the user.
The alternative to QBE is to expect from the user to type in as a string his query, in which
80 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
case we have a Query-by-string (QBS) scenario. QBS also presupposes a pool of words,
available as either a set of segmented words or segmented lines, for which the corresponding
transcription is available. As in QBS we do not have image information about the query,
the QBE/CBIR scheme of description building and matching cannot be applied.
The taxonomy of word spotting and recognition systems further includes the distinc-
tion into segmentation-based and segmentation-free systems. Segmentation-based methods
assume that the scanned document image is segmented into layout elements up to the level
of word. Segmentation-free approaches work with no such prior segmentation of the doc-
ument; such approaches may be advantageous when the layout of the input image is too
complex and the segmentation is expected to be of poor quality.
Machine learning methods have been employed in document understanding methods
with much success, compared with the more standard learning-free approaches in docu-
ment processing. Their basic assumption is that we can see word spotting and recognition
as a (usually) supervised learning problem, where part of the data are expected to be la-
belled. Labelling in the document analysis context typically means that image data is be-
forehand related with a known alphanumeric transcription. In training time, the parameters
of a suitable model are optimised using the labelled data. Learning-based methods in gen-
eral are much more accurate than learning-free methods, even though their performance is
dependent on the size and suitability of the training set compared to the test data. State-
of-the-art learning-based models today include models based on Hidden Markov Models
(HMM) and more recently models based on Neural Networks (NN) (España-Boquera et al.,
2011; Frinken et al., 2010a,b).
Zoning features have been used as an inexpensive way to build an efficient, fixed-length
descriptor. In zoning features, the image is split in a fixed number of zones, forming a
canonical grid over the image area. For each zone, a local descriptor is computed. In Sfikas
et al. (2016), features extracted from a pre-trained Convolutional Neural Network (CNN)
are computed per image zone, and then combined into a single, word-level fixed-length
descriptor. Fixed-length descriptors have the advantage that they can be easily compared
using an inexpensive distance such as the Euclidean or the Cosine distance (provided of
course, that the comparison makes sense).
Since the beginning of the past decade, gradient-based features have succesfully been
used for various computer vision applications (Dalal and Triggs, 2005; Lowe, 2004; Aho-
nen et al., 2006). Histograms of Gradients (HoG) (Dalal and Triggs, 2005) and Scale In-
variant Feature Transform (SIFT) (Lowe, 2004) features describe an image area based on
local gradient information. In the context of word image description, they can be used to
efficiently encode stroke information locally. Gradient-based features are encoded into a
single, word-level descriptor with an encoding / aggregation technique. In this direction,
the Bag of Visual Words (BoVW) model has been employed (Aldavert et al., 2015). Input
descriptors are used to learn a database-wide model that plays the role of a visual codebook,
used subsequently to encode local descriptors of new images. Fisher Vectors (FV) extend
on the BoVW paradigm, by learning a Gaussian Mixture Model (GMM) over the pool of
gradient-based features, and using a measure of dissimilarity to the learned GMM to en-
code descriptors. FVs, coupled with SIFTs, have shown to lead to very powerful models
for keyword spotting (Almazan et al., 2014b; Sfikas et al., 2015).
When word-level segmentation is not available, the word-to-word matching paradigm is
evidently not applicable directly. Segmentation-free QBE can be useful when the scanned
page is deemed too difficult to be segmented into word images correctly. One family of
segmentation-free QBE approaches follows the approach of computing local keypoints on
the unsegmented image. These keypoints are then matched with corresponding keypoints
on the query image. In Leydier et al. (2007), an elastic matching method is used to match
gradient-based keypoints. Heat kernel signature-based features are used as keypoints in
Zhang and Tan (2013). Another approach to segmentation-free QBE spotting is to use a slid-
ing window over the unsegmented image (Rothacker et al., 2013; Almazan et al., 2014a).
As the process of matching a template versus a whole document image can be a compu-
tationally expensive process, assuming that a canonical grid of matching positions is used,
heuristics have been proposed to bypass scanning the entire grid (Kovalchuk et al., 2014),
or techniques to speed up matching, like product quantization (Almazan et al., 2014a).
In Almazan et al. (2014b) an attribute-based model has been proposed that uses super-
vised learning to learn its parameters. In this model, word attributes (Farhadi et al., 2009)
are defined as a set of variates, each one of which corresponds to a certain word charac-
teristic. This charasteristic may be the presence or absence of a certain character, bigram,
or character diacritic (Sfikas et al., 2015). For each word image a vector of attributes is
learned, encoding image information for each input. Attributes are then used together with
ground-truth transcriptions to learn a projection from either attributes or transcriptions to
a common latent subspace. It is worth noting that this model can be used to perform both
QBE and QBS.
QBS keyword spotting for handwritten documents is performed typically with systems
82 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
that are based either on Hidden Markov Models (HMM) or the more recently used Recurrent
Neural Networks (RNN). Both families of methods use supervised learning, so training with
an annotated set is necessary before the models can be used in test mode, i.e. word spotting
per se.
HMM models (Bishop, 2006) consist of two basic components. The one component is a
series of hidden states that form a Markov chain. A finite number of possible hidden states is
assigned to each possible character beforehand. These states are not directly observed from
the data features, hence their characterisation as ”hidden”. Given each state, a distribution of
observed features is defined. Emissions are typically modelled with a GMM (Bishop, 2006;
Toselli and Vidal, 2013). Replacing GMM emissions with a standard feed-forward Neural
Network has also shown good results (España-Boquera et al., 2011; Thomas et al., 2015).
HMMs training is performed using the Baum-Welch algorithm to learn model parameters
(Bishop, 2006). One HMM for each character is used to model the appearance of possible
characters. Using a lexicon of possible words, the score of each word can be computed
using the Viterbi algorithm for decoding (Bishop, 2006). Character HMMs can be used to
also create a single HMM-”Filler” model, which can be used to decode inputs and detect
words without a lexicon (Puigcerver et al., 2014).
HMM-based models were the state-of-the-art models for keyword spotting in handwrit-
ten documents (as well as for handwriting recognition systems), before the recent success
of RNN-based models. Following the success of Fully-Connected and Convolutional Neu-
ral Networks in just about every field of Computer Vision, Recurrent NNs have shown to
be well-suited especially for sequential data. Document data are modelled as sequences of
column-based features. Text lines are typically used as both input and test data for RNNs,
as is the case with the Bidirectional Long-Short term Memory Neural Network models
(BLSTM-NN) (Frinken et al., 2012). BLSTM-NNs owe their name to a special part of their
architecture, called Long Short-term Memory Blocks (LSTMs). LSTMs are used in order to
mitigate the vanishing gradient problem (Frinken et al., 2012). The more recent Sequential
Processing Recurrent Neural Networks (SPRNN) replace BLSTM-NN’s LSTM cells and
bidirectional architecture with a different kind of architectural cell and Multidirectional /
Multidimensional layers.
where rel(k) is an indicator function equal to 1 if the word at rank k is relevant and 0 other-
wise. After a set of queries are defined, and the test system is used to retrieve matches for
the set queries, metrics are evaluated per-query. Mean Average Precision (MAP), defined
as the average value over all evaluation queries of the AP is then computed.
In tables 7 and 8 we show evaluation results for several recent keyword spotting meth-
ods. These methods are the NN-based Zoning Aggregated Hypercolumns (ZAH) (Sfikas
et al., 2016), Attribute-based model (Almazan et al., 2014b), HOG/LBP-based method (Ko-
valchuk et al., 2014), Inkball model (Howe, 2013), Projections of Oriented Gradients (POG)
(Retsinas et al., 2016), BoVW-based (Aldavert et al., 2015), elastic-matching model (Ley-
dier et al., 2009) and template-matching model (Pantke et al., 2014). Some of these methods
are available both as segmentation-based and segmentation-free methods (Kovalchuk et al.,
3 http://www.iam.unibe.ch/fki/databases/iam-historical-document-database
4 http://users.iit.demokritos.gr/
˜nstam/GRPOLY-DB/GRPOLY-DB-Handwritten.rar
84 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
Table 8. Comparison of performance of segmentation-free keyword spotting methods
on the Bentham database.
%
Kovalchuk et al. (2014) Howe (2013) Pantke et al. (2014) Leydier et al. (2009)
MAP 41.6 36.3 33.7 20.5
P@5 60.9 55.6 54.3 33.5
2014; Howe, 2013). Also, we must note that the CNN-based ZAH model (Sfikas et al.,
2016) and the attribute-based model (Almazan et al., 2014b) require a learning step; how-
ever, learning is assumed to be performed on a different set than the one that was used for
testing (”pre-training”). ZAH is pre-trained on a large collection of street-view text, and
the attribute-based model is pre-trained on the George Washington collection. All other
methods do not require a learning phase. The best performance is given by the POG model
(Retsinas et al., 2016) on the segmentation-based track, and the HOG/LBP-based model
(Kovalchuk et al., 2014) on the segmentation-free track. It is worth noting that both winning
methods rely on extracting gradient-based features. This fact validates the effectiveness of
such features as descriptors of handwritten content.
%
QbS QbE
Strauß et al. (2016) Puigcerver et al. (2015) Strauß et al. (2016) Puigcerver et al. (2015)
MAP 87.1 38.3 85.2 19.5
P@5 87.4 48.3 85.5 23.5
References
Afzal, M. Z., Pastor-Pellicer, J., Shafait, F., Breuel, T. M., Dengel, A., and Liwicki, M.
(2015). Document image binarization using lstm: A sequence learning approach. In
Historical Document Processing 85
Ahonen, T., Hadid, A., and Pietikainen, M. (2006). Face description with local binary
patterns: Application to face recognition. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 28(12):2037–2041.
Aldavert, D., Rusiñol, M., Toledo, R., and Llados, J. (2015). A study of bag-of-visual-words
representations for handwritten keyword spotting. International Journal on Document
Analysis and Recognition, 18(3):223–234.
Almazan, J., Gordo, A., Fornes, A., and Valveny, E. (2014a). Segmentation-free word
spotting with exemplar SVMs. Pattern Recognition, 47(12):3967 – 3978.
Almazan, J., Gordo, A., Fornes, A., and Valveny, E. (2014b). Word spotting and recog-
nition with embedded attributes. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 36(12):2552–2566.
Arvanitopoulos, N. and Süsstrunk, S. (2014). Seam carving for text line extraction on color
and grayscale historical manuscripts. In Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on, pages 726–731.
Asi, A., Cohen, R., Kedem, K., El-Sana, J., and Dinstein, I. (2014). A coarse-to-fine
approach for layout analysis of ancient manuscripts. In Frontiers in Handwriting Recog-
nition (ICFHR), 2014 14th International Conference on, pages 140–145.
Avidan, S. and Shamir, A. (2007). Seam carving for content-aware image resizing. ACM
Trans. Graph., 26(3).
Baechler, M., Liwicki, M., and Ingold, R. (2013). Text line extraction using dmlp classifiers
for historical manuscripts. In Proceedings of the 2013 12th International Conference on
Document Analysis and Recognition, ICDAR ’13, pages 1029–1033, Washington, DC,
USA. IEEE Computer Society.
Baird, H. (2000). State of the art of document image degradation modeling. In 4th Interna-
tional Workshop on Document Analysis Systems (DAS) Invited talk, pages 1–16. IAPR.
Bar-Yosef, I., Hagbi, N., Kedem, K., and Dinstein, I. (2009). Line segmentation for de-
graded handwritten historical documents. In Proceedings of the 2009 10th International
Conference on Document Analysis and Recognition, ICDAR ’09, pages 1161–1165,
Washington, DC, USA. IEEE Computer Society.
Bukhari, S. S., Breuel, T. M., Asi, A., and El-Sana, J. (2012). Layout analysis for arabic
historical document images using machine learning. In Proceedings of the 2012 Interna-
tional Conference on Frontiers in Handwriting Recognition, ICFHR ’12, pages 639–644,
Washington, DC, USA. IEEE Computer Society.
Bulacu, M., van Koert, R., Schomaker, L., and van der Zant, T. (2007). Layout analy-
sis of handwritten historical documents for searching the archive of the cabinet of the
dutch queen. In Ninth International Conference on Document Analysis and Recognition
(ICDAR 2007), volume 1, pages 357–361.
Chen, K., Wei, H., Hennebert, J., Ingold, R., and Liwicki, M. (2014). Page segmentation for
historical handwritten document images using color and texture features. In Frontiers in
Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 488–
493. IEEE.
Chen, Y. and Leedham, G. (2005). Decompose algorithm for thresholding degraded his-
torical document images. IEE Proceedings - Vision, Image and Signal Processing,
152(6):702–714.
Cohen, R., Asi, A., Kedem, K., El-Sana, J., and Dinstein, I. (2013). Robust text and draw-
ing segmentation algorithm for historical documents. In Proceedings of the 2Nd Inter-
national Workshop on Historical Document Imaging and Processing, HIP ’13, pages
110–117, New York, NY, USA. ACM.
Dalal, N. and Triggs, B. (2005). Histograms of oriented gradients for human detection.
In Schmid, C., Soatto, S., and Tomasi, C., editors, Proceedings of the IEEE Computer
Society Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
pages 886–893.
Drira, F. (2006). Towards restoring historic documents degraded over time. In Proceed-
ings of the Second International Conference on Document Image Analysis for Libraries,
DIAL ’06, pages 350–357, Washington, DC, USA. IEEE Computer Society.
Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009). Describing objects by their at-
tributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-
nition (CVPR), pages 1778–1785.
Fischer, A., Riesen, K., and Bunke, H. (2010). Graph similarity features for HMM-based
handwriting recognition in historical documents. In Proceedings of the 12th International
Conference on Frontiers in Handwriting Recognition (ICFHR), pages 253–258.
Historical Document Processing 87
Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., and Stolz,
M. (2009). Automatic transcription of handwritten medieval documents. In Proceedings
of the 15th International Conference on Virtual Systems and Multimedia (VSMM), pages
137–142.
Frinken, V., Fischer, A., and Bunke, H. (2010a). A novel word spotting algorithm using
bidirectional long short-term memory neural networks. In Proceedings of the 4th Work-
shop on Artificial Neural Networks in Pattern Recognition, volume 5998, pages 185–196.
Frinken, V., Fischer, A., Bunke, H., and Manmatha, R. (2010b). Adapting BLSTM neural
network based keyword spotting trained on modern data to historical documents. In Pro-
ceedings of the 12th International Conference on Frontiers in Handwriting Recognition
(ICFHR), pages 352–357, IEEE Computer Society, Washington, DC, USA.
Frinken, V., Fischer, A., Manmatha, R., and Bunke, H. (2012). A novel word spotting
method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(2):211–224.
Frinken, V., Fischer, A., and Martı́nez-Hinarejos, C.-D. (2013). Handwriting recognition
in historical documents using very large vocabularies. In Proceedings of the 2nd Inter-
national Workshop on Historical Document Imaging and Processing (HIP2013), pages
66–72.
Garz, A., Fischer, A., Sablatnig, R., and Bunke, H. (2012). Binarization-free text line
segmentation for historical documents based on interest point clustering. In Document
Analysis Systems (DAS), 2012 10th IAPR International Workshop on, pages 95–99.
Gatos, B., Louloudis, G., and Stamatopoulos, N. (2014). Segmentation of historical hand-
written documents into text zones and text lines. In Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on, pages 464–469.
Gatos, B., Ntirogiannis, K., and Pratikakis, I. (2009). Icdar 2009 document image binariza-
tion contest (dibco 2009). In 2009 10th International Conference on Document Analysis
and Recognition, pages 1375–1382.
Gatos, B., Pratikakis, I., and Perantonis, S. J. (2006). Adaptive degraded document image
binarization. Pattern Recogn., 39(3):317–327.
Gatos, B., Stamatopoulos, N., Louloudis, G., Sfikas, G., Retsinas, G., Papavassiliou, V.,
Sunistira, F., and Katsouros, V. (2015). Grpoly-db: An old greek polytonic document
88 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
image database. In Document Analysis and Recognition (ICDAR), 2015 13th Interna-
tional Conference on, pages 646–650. IEEE.
Hollaus, F., Diem, M., and Sablatnig, R. (2014). Improving ocr accuracy by applying
enhancement techniques on multispectral images. In Pattern Recognition (ICPR), 2014
22nd International Conference on, pages 3080–3085.
Howe, N. R. (2013). Part-structured inkball models for one-shot handwritten word spot-
ting. In Proceedings of the 12th International Conference on Document Analysis and
Recognition (ICDAR), pages 582–586.
Joo Kim, S., Deng, F., and Brown, M. S. (2011). Visual enhancement of old documents
with hyperspectral imaging. Pattern Recogn., 44(7):1461–1469.
Kleber, F., Sablatnig, R., Gau, M., and Miklas, H. (2008). Ancient document analysis based
on text line extraction. In Pattern Recognition, 2008. ICPR 2008. 19th International
Conference on, pages 1–4.
Kovalchuk, A., Wolf, L., and Dershowitz, N. (2014). A simple and fast word spotting
method. In Proceedings of the 14th International Conference on Frontiers in Handwriting
Recognition (ICFHR), pages 3–8.
Lavrenko, V., Rath, T., and Manmatha, R. (2004a). Holistic word recognition for handwrit-
ten historical documents. In Proceedings of the Workshop on Document Image Analysis
for Libraries (DIAL), pages 278–287.
Lavrenko, V., Rath, T. M., and Manmatha, R. (2004b). Holistic word recognition for hand-
written historical documents. In Proceedings of the 1st International Workshop on Doc-
ument Image Analysis for Libraries, pages 278–287.
Leydier, Y., Bourgeois, F. L., and Emptoz, H. (2007). Text search for medieval manuscript
images. Pattern Recognition, 40(12):3552– 3567.
Leydier, Y., Ouji, A., LeBourgeois, F., and Emptoz, H. (2009). Towards an omnilingual
word retrieval system for ancient manuscripts. Pattern Recognition, 42(9):2089–2105.
Likforman-Sulem, L., Zahour, A., and Taconet, B. (2007). Text line segmentation of histor-
ical documents: a survey. International Journal of Document Analysis and Recognition
(IJDAR), 9(2):123–138.
Louloudis, G., Gatos, B., Pratikakis, I., and Halatsis, C. (2009). Text line and word seg-
mentation of handwritten documents. Pattern Recogn., 42(12):3169–3183.
Lu, S., Su, B., and Tan, C. L. (2010). Document image binarization using background esti-
mation and stroke edges. International Journal on Document Analysis and Recognition
(IJDAR), 13(4):303–314.
Malleron, V., Eglin, V., Emptoz, H., Dord-Crousl, S., and Rgnier, P. (2009). Text lines and
snippets extraction for 19th century handwriting documents layout analysis. In 2009 10th
International Conference on Document Analysis and Recognition, pages 1001–1005.
Manmatha, R. and Croft, W. (1997). Word spotting: indexing handwritten archives, chap-
ter 3, pages 43–64. MIT Press.
Manmatha, R. and Rothfeder, J. L. (2005). A scale space approach for automatically seg-
menting words from historical handwritten documents. IEEE Trans. Pattern Anal. Mach.
Intell., 27(8):1212–1225.
Marti, U. V. and Bunke, H. (2001). Using a statistical language model to improve the
performance of an HMM-based cursive handwriting recognition system. Int. Journal of
Pattern Recognition and Artificial Intelligence, 15:65–90.
Nicolas, S., Paquet, T., and Heutte, L. (2006). Complex handwritten page segmentation us-
ing contextual models. In Second International Conference on Document Image Analysis
for Libraries (DIAL’06), pages 12 pp.–59.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2013). Performance evaluation methodology
for historical document image binarization. IEEE Transactions on Image Processing,
22(2):595–609.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2014a). A combined approach for the bina-
rization of handwritten document images. Pattern Recogn. Lett., 35:3–15.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2014b). Icfhr2014 competition on handwrit-
ten document image binarization (h-dibco 2014). In Frontiers in Handwriting Recogni-
tion (ICFHR), 2014 14th International Conference on, pages 809–813.
Ntzios, K., Gatos, B., Pratikakis, I., T., K., and S.J., P. (2007). An old greek handwritten
OCR system based on an efficient segmentation-free approach. International Journal on
Document Analysis and Recognition, 9:179–192.
Pantke, W., Dennhardt, M., Fecker, D., Märgner, V., and Fingscheidt, T. (2014). An histor-
ical handwritten arabic dataset for segmentation-free word spotting-hadara80p. In 14th
International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 15–
20. IEEE.
Pastor-Pellicer, J., Garz, A., Ingold, R., and Castro-Bleda, M.-J. (2015). Combining learned
script points and combinatorial optimization for text line extraction. In Proceedings of
90 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
the 3rd International Workshop on Historical Document Imaging and Processing, HIP
’15, pages 71–78, New York, NY, USA. ACM.
Pratikakis, I., Gatos, B., and Ntirogiannis, K. (2010). H-dibco 2010 - handwritten document
image binarization competition. In Frontiers in Handwriting Recognition (ICFHR), 2010
International Conference on, pages 727–732.
Pratikakis, I., Gatos, B., and Ntirogiannis, K. (2011). Icdar 2011 document image binariza-
tion contest (dibco 2011). In 2011 International Conference on Document Analysis and
Recognition, pages 1506–1510.
Pratikakis, I., Gatos, B., and Ntirogiannis, K. (2012). Icfhr 2012 competition on handwritten
document image binarization (h-dibco 2012). In Frontiers in Handwriting Recognition
(ICFHR), 2012 International Conference on, pages 817–822.
Pratikakis, I., Gatos, B., and Ntirogiannis, K. (2013). Icdar 2013 document image binariza-
tion contest (dibco 2013). In 2013 12th International Conference on Document Analysis
and Recognition, pages 1471–1476.
Pratikakis, I., Zagoris, K., Gatos, B., Louloudis, G., and Stamatopoulos, N. (2014). ICFHR
2014 competition on handwritten keyword spotting (H-KWS 2014). In Proceedings of
the 14th International Conference on Frontiers in Handwriting Recognition (ICFHR),
pages 814–819.
Puigcerver, J., Toselli, A., and Vidal, E. (2014). Word-graph-based handwriting keyword
spotting of out-of-vocabulary queries. In Proceedings of the 22nd International Confer-
ence on Pattern Recognition (ICPR), pages 2035–2040.
Puigcerver, J., Toselli, A., and Vidal, E. (2015). ICDAR2015 competition on keyword
spotting for handwritten documents. In Proceedings of the 13th International Conference
on Document Analysis and Recognition (ICDAR), pages 1176–1180.
Rath, T. M. and Manmatha, R. (2003). Word image matching using dynamic time warp-
ing. In Proceedings of the IEEE Computer Society Conference on Computer Vision and
Pattern Recognition (CVPR), volume 2, pages 521–527.
Reese, J., Murdock, M., S., R., and Hamilton, B. (2014). ICFHR2014 competition on
word recognition from historical documents (ANWRESH). In Proceedings of the 14th
International Conference on Frontiers in Handwriting Recognition (ICFHR), pages 803–
808.
Retsinas, G., Louloudis, G., Stamatopoulos, N., and Gatos, B. (2016). Keyword spotting
in handwritten documents using projections of oriented gradients. In Proceedings of
the IAPR International Workshop on Document Analysis Systems (DAS), pages 411–416.
IAPR.
Rohlicek, J., Russell, W., Roukos, S., and Gish, H. (1989). Continuous hidden Markov
modeling for speaker-independent word spotting. In Proceedings of the 14th IEEE In-
ternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages
627–630 vol.1.
Historical Document Processing 91
Rothacker, L., Rusiñol, M., and Fink, G. A. (2013). Bag-of-features HMMs for
segmentation-free word spotting in handwritten documents. In Proceedings of the 12th
International Conference on Document Analysis and Recognition (ICDAR), pages 1305–
1309.
Saabni, R., Asi, A., and El-Sana, J. (2014). Text line extraction for historical document
images. Pattern Recogn. Lett., 35:23–33.
Saleem, S., Hollaus, F., Diem, M., and Sablatnig, R. (2014). Recognizing glagolitic charac-
ters in degraded historical documents. In Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on, pages 771–776.
Sanchez, J. A., Romero, V., Toselli, A., and Vidal, E. (2014). ICFHR2014 competition on
handwritten text recognition on transcriptorium datasets (HTRtS). In Proceedings of the
14th International Conference on Frontiers in Handwriting Recognition (ICFHR), pages
785–790.
Sanchez, J. A., Romero, V., Toselli, A., and Vidal, E. (2015). ICDAR 2015 competition
htrts: Handwritten text recognition on the tranScriptorium dataset. In Proceedings of the
13th International Conference on Document Analysis and Recognition (ICDAR), pages
1166–1170.
Seni, G. and Cohen, E. (1994). External word segmentation of off-line handwritten text
lines. Pattern Recognition, 27(1):41 – 52.
Sfikas, G., Constantinopoulos, C., Likas, A., and Galatsanos, N. P. (2005). An analytic
distance metric for gaussian mixture models with application in image retrieval. In In-
ternational Conference on Artificial Neural Networks, pages 835–840. Springer.
Sfikas, G., Giotis, A. P., Louloudis, G., and Gatos, B. (2015). Using attributes for word spot-
ting and recognition in polytonic greek documents. In Document Analysis and Recogni-
tion (ICDAR), 2015 13th International Conference on, pages 686–690. IEEE.
Sfikas, G., Retsinas, G., and Gatos, B. (2016). Zoning aggregated hypercolumns for key-
word spotting. In 15th International Conference on Frontiers in Handwriting Recogni-
tion (ICFHR), page ”To appear”. IEEE.
Shi, Z. and Govindaraju, V. (2004a). Historical document image enhancement using back-
ground light intensity normalization. In Pattern Recognition, 2004. ICPR 2004. Proceed-
ings of the 17th International Conference on, volume 1, pages 473–476 Vol.1.
Shi, Z. and Govindaraju, V. (2004b). Line separation for complex document images using
fuzzy runlength. In Proceedings of the First International Workshop on Document Image
Analysis for Libraries (DIAL’04), DIAL ’04, pages 306–, Washington, DC, USA. IEEE
Computer Society.
92 Basilis Gatos, Georgios Louloudis, Nikolaos Stamatopoulos et al.
Shi, Z., Setlur, S., and Govindaraju, V. (2005). Text extraction from gray scale historical
document images using adaptive local connectivity map. In Proceedings of the Eighth
International Conference on Document Analysis and Recognition, ICDAR ’05, pages
794–798, Washington, DC, USA. IEEE Computer Society.
Strauß, T., Grüning, T., Leifert, G., and Labahn, R. (2016). Citlab ARGUS for keyword
search in historical handwritten documents - description of citlab’s system for the image-
clef 2016 handwritten scanned document retrieval task. In Working Notes of CLEF 2016
- Conference and Labs of the Evaluation forum, Évora, Portugal, 5-8 September, 2016.,
pages 399–412.
Su, B., Lu, S., and Tan, C. L. (2010). Binarization of historical document images using the
local maximum and minimum. In Proceedings of the 9th IAPR International Workshop
on Document Analysis Systems, DAS ’10, pages 159–166, New York, NY, USA. ACM.
Su, B., Lu, S., and Tan, C. L. (2013). Robust document image binarization technique for
degraded document images. IEEE Transactions on Image Processing, 22(4):1408–1417.
Tan, C. L., Cao, R., and Shen, P. (2002). Restoration of archival documents using a
wavelet technique. IEEE Transactions on Pattern Analysis and Machine Intelligence,
24(10):1399–1404.
Tang, Y., Peng, L., Xu, Q., Wang, Y., and A., F. (2016). CNN based transfer learning for
historical chinese character recognition. In Proceedings of the 12th IAPR Workshop on
Document Analysis Systems (DAS), pages 25–29.
Thomas, S., Chatelain, C., Heutte, L., Paquet, T., and Kessentini, Y. (2015). A deep hmm
model for multiple keywords spotting in handwritten documents. Pattern Analysis and
Applications, 18(4):1003–1015.
Tonazzini, A., Bedini, L., and Salerno, E. (2004). Independent component analysis for
document restoration. Document Analysis and Recognition, 7(1):17–27.
Tonazzini, A., Salerno, E., and Bedini, L. (2007). Fast correction of bleed-through dis-
tortion in grayscale documents by a blind source separation technique. International
Journal of Document Analysis and Recognition (IJDAR), 10(1):17–25.
Toselli, A. and Vidal, E. (2013). Fast HMM-Filler approach for key word spotting in hand-
written documents. In Proceedings of the 12th International Conference on Document
Analysis and Recognition (ICDAR), pages 501–505.
Toselli, A. H. and Vidal, E. (2015). Handwritten text recognition results on the bentham
collection with improved classical N-Gram-HMM methods. In Proceedings of the 3rd
International Workshop on Historical Document Imaging and Processing (HIP2015),
pages 15–22.
Van Phan, T., Nguyen, K., and Nakagawa, M. (2016). A Nom historical document recog-
nition system for digital archiving. International Journal on Document Analysis and
Recognition, 19:49–64.
Historical Document Processing 93
Wolf, C. (2010). Document ink bleed-through removal with two hidden markov random
fields and a single observation field. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 32(3):431–447.
Zhang, X. and Tan, C. (2013). Segmentation-free keyword spotting for handwritten docu-
ments based on heat kernel signature. In Proceedings of the 12th International Confer-
ence on Document Analysis and Recognition (ICDAR), pages 827–831.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 4
1. Introduction
The automatic transcription of text in handwritten documents has many applications, from
automatic document processing to indexing and document understanding. The automatic
transcription of historical handwritten documents is an incipient research field that has been
started to be explored in recent years. For some time in the past decades, the interest in
Off-line Handwritten Text Recognition (HTR) was diminishing under the assumption that
modern computer technologies will soon make paper-based documents useless. However,
the increasing number of on-line digital libraries publishing large quantities of digitized
legacy papers and the fact that the transcription of them into a textual electronic format
would provide historians and other researches new ways of indexing and easy retrieval,
have turned HTR up in a major research topic (Sánchez et al., 2014).
HTR for historical documents is a high complex task mainly because of the strong
variability of writing styles, different font types and sizes of characters, underlined and/or
crossed-out words. Moreover, this complexity is increased by the presence of typical degra-
dation problems of ancient documents such as background variability and the presence of
spots due to the humidity or marks resulting from the ink that goes through the paper. For
this reason, different methods and techniques of document analysis and recognition fields
are needed.
The nowadays common technology for HTR is based on a segmentation-free approach,
where the recognition system is able to recognize all the text elements (sentences, words,
∗ E-mail address: [email protected].
† E-mail address: [email protected].
96 Leticia M. Seijas and Byron L. D. Bezerra
and characters) as a whole, without any prior segmentation of the image into these elements
(Sánchez et al., 2014; Marti and Bunke, 2002; Toselli et al., 2004a; Espana-Boquera et al.,
2011).
The use of segmentation-free (holistic) techniques which tightly integrate an optical
character model and a language model has yielded the best performance on standard bench-
marks. Although the N-grams language models and the Gaussian Mixture Hidden Markov
Models (HMM-GMM) have been considered the most traditional and better-understood
approaches in the last years, recently some Artificial Neural Networks (ANN) have gained
considerable popularity in the HTR research community (Toselli and Vidal, 2015; Gouveia
et al., 2014; Bezerra et al., 2012). A large amount of research has been done to improve
these recognition models and to develop the corresponding training and decoding algo-
rithms (Bluche, 2015).
On the other hand, feature extraction strategies have not been widely explored. In the
segmentation-free method, the preprocessed line image is segmented into frames using a
sliding window to extract features from each slice. Some of the techniques for feature
extraction presented in previous works are based on the computation of pixel densities, raw
gray levels and their gradients, geometric moment normalization, Principal Components
Analysis (PCA) to reduce and decorrelate the pixel dimensions.
Menasri et al. (2011) presented an efficient word recognition system resulting from the
combination of three handwriting recognizers. The main component of this combined sys-
tem is a HMM-based recognizer which considers dynamic and contextual information for a
better modeling of writing units. Neural networks (NN) are also used. Feature extraction is
based on the work of Mohamad et al. (2009); El-Hajj et al. (2005). Using the segmentation-
free approach, the windows are divided vertically into a fixed number of cells. Within each
window, a set of geometric features is extracted: w features are related to pixel densities
within each window’s column (w is the width of the extraction window, in pixels). There
are three density features extracted from the whole frame and above and under the lower
baseline. Two features are related to background/foreground transitions between adjacent
cells. Three features are for the gravity center position, including a derivative feature (
difference between y positions). Twelve other features are related to local pixel configura-
tions to capture stroke concavities (Menasri et al., 2011). A subset of features is baseline
dependent. The final descriptor has 28 components.
Michal et al. (2013); Kozielski et al. (2012) proposed a HMM-based system for off-line
handwriting recognition based on successful techniques from the domains of large vocab-
ulary speech recognition and image object recognition. This work introduces a moment-
based scheme for preprocessing and feature extraction. The preprocessing stage includes
normalizing the contrast of the gray-scale image and fixing the slant on the text line seg-
mented from the text pages. Then, frames are extracted with and overlapping sliding win-
dow and the 1st- and 2nd-order moments are calculated for each frame independently. The
1st-order moments represent the center of gravity which is used to shift the content of the
frame to the center of the image. The 2nd-order moments correspond to the weighted stan-
dard deviation of the distance between pixels in the frame and the center of gravity. They
are used to compute the scaling factors for size and translation normalization. This way,
every frame extracted with the sliding window is normalize using the scaling factors. Then,
the gray-scale values of all pixels in a normalized frame are used as features and are fur-
Wavelet Descriptors for Handwritten Text Recognition in Historical Documents 97
ther reduced by PCA to 20 components. During normalization, the aspect ratio is not kept
because the vertical and horizontal moments are computed and normalized separately. For
this reason, four values related to the original moments are added in order to map the class
specific moment information, which was originally distributed over the whole image, to
specific components of the feature vector. The final feature vector has 24 dimensions.
Toselli and Vidal (2015) presented HTR experiments and results on the historical
Bentham text image dataset used in the ICFHR-2014 HTRtS competition, adopting the
segmentation-free holistic framework and using traditional modeling approaches based on
Hidden Markov optical character models (HMM) and an N-gram language model (LM).
Departing from the very basic N-gram-HMM baseline system provided in HTRtS, several
improvements were made in LM and HMM modeling. It includes more accurate HMM
training through discriminative training, achieving similar recognition accuracy as some
of the best performing (single, uncombined) systems based on (recurrent) Neural Net-
works, using identical training and testing data. For feature extraction, a narrow sliding
window was horizontally applied to the preprocessed line image. For each window posi-
tion i; 1 ≤ i ≤ n, a 60-dimensional feature vector was obtained by computing three kinds
of features: normalized vertical gray level s for 20 evenly distributed vertical positions and
horizontal and vertical gray level derivatives at each of these vertical positions. The features
proposed in Michal et al. (2013) were also used.
In the thesis work of Bluche (Bluche, 2015), a study of different aspects of optical
models based on Deep Neural Networks in a hybrid Neural Network / HMM scheme was
conducted, to better understand and evaluate their relative importance. First, it is showed
that Deep Neural Networks produce consistent and significant improvements in networks
with one or two hidden layers, independently of the kind of neural network, MLP or RNN,
and of input, handcrafted features or pixels. Despite the dominance of LSTM-RNNs in
the recent literature on handwriting recognition, it can be seen that deep MLPs achieve
comparable results. This work also evaluated different training criteria reporting similar
improvements for MLP/HMMs as those observed in speech recognition, with sequence-
discriminative training. The proposed approach was validated by taking part in the HTRtS
contest in 2014. For feature extraction, the sliding window framework is applied. Two
kinds of features are extracted from each window: handcrafted and pixel intensities. The
first ones are geometrical and statistical features used in the work of Menasri et al. (2011),
that result in a descriptor of size 56. The “pixel features” are calculated from a downscale
frame with its aspect ratio kept constant, and then transformed to lie in the interval [0,1].
An 800-dimensional feature vector is obtained for the Bentham set. The results of the final
systems presented in this thesis, namely MLPs and RNNs, with handcrafted feature or pixel
inputs, are comparable to the state-of-the-art on Rimes and IAM datasets. The combination
of these systems outperformed all published results on the considered databases.
This work proposes a different approach for feature extraction based on the applica-
tion of the CDF 9/7 Wavelet Transform for the HTR problem. The wavelet transform has
been applied to related areas such as handwritten character recognition (Patel et al., 2012;
Seijas and Segura, 2012) and speech recognition (Trivedi et al., 2011; Shao and Chang,
2005). Our approach improves data representation as a result of considerably reducing the
feature vector size while retaining the basic structure of the pattern, and provides compet-
itive HTR results. Section “HTR Systems Based on HMM/GMM” gives an overview of
98 Leticia M. Seijas and Byron L. D. Bezerra
HTR systems based on the HMM-GMM model. In Subsection “The Wavelet Transform”
fundamentals of the Discrete Wavelet Transform are introduced while in Subsection “The
proposed WT-descriptors” the wavelet-based descriptors are proposed. Section “Experi-
ments and Results” reports the experiments and results and finally, the conclusions of the
work are presented in Section “Conclusion”.
the simplest, most basic training approach for HMMs (Toselli and Vidal, 2015). Figure 1
shows a prototype of HMM with left-to-right topology having six states.
The decoding problem in Equation 1 can be solved by the Viterbi algorithm (Jelinek,
1999). Figure 2 depicts this decoding process and the models involved. More details can
be found in Romero et al. (2012); Young et al. (2009).
Figure 2. HTR decoding. For a text line image that could include different characters (in
the example a handwritten number “36”), a feature vector sequence is produced. Then, this
sequence is decoded into a word sequence using three knowledge sources: optical character
HMMs, a lexicon (word models) and a language model (Toselli and Vidal, 2015).
without relying on heuristics to find character boundaries, and limiting the risk of under-
segmentation. This category of approaches is the most popular nowadays, receiving a lot
of research interest, and achieving the best performance on standard benchmarks (Bluche,
2015). Algorithm 1 shows the basic steps of our feature extraction proposal.
Subsection ”Experimental Setup” describes this process with values extracted from ex-
periments.
where x[n] = x[−n] and h,g are high and low frequency filters, respectively. Equations 3 and
4 correspond to the decomposition stage (see Figure3). In each level, the high-pass filter
generates detail information d j while the low-pass filter, associated with the scale func-
tion, produces approximations, smoothed representations a j of the signal. Implementing
the FWT requires just O(N) operations for signals of size N providing a non-redundant
representation and allowing for lossless reconstruction. The filter bank algorithm with per-
fect reconstruction can also be applied to the Fast Biorthogonal Wavelet Transform. The
wavelet coefficients are computed by successive convolutions with filters h and g while for
reconstruction the dual filters h̃ and g̃ are used. If the signal consists of N non-zero sam-
ples, to compute its representation based on the biorthogonal wavelet, O(N) operations are
needed (S. Mallat, 1999).
So far we have dealt with the DWT in one dimension. Digital image processing requires
a bidimensional WT. It is computed by application of the one-dimensional FWT first to
rows and then to columns. Let Ψ(x) be the one-dimensional wavelet associated to the one-
dimensional scale function ϕ(x), then the scale function in two dimensions is:
Figure 5. MCDF 9/7 with filters for signal decomposition (Seijas and Segura, 2012).
After several preliminary experiments, we concluded the detail coefficients did not con-
tribute to improving HTR rates. Therefore, descriptors using only the approximation sub-
bands from level 1 to 3 with non-thresholded representation were finally considered because
of the best results obtained. We think that this approach retains the basic structure of the
pattern, eliminating details that do not improve the classification. The feature vector is
subject to PCA transformation to reduce the size of the descriptor using the directions that
contain most of the data variance while disregarding those with little information.
The following are the descriptors selected:
gray-scale values, the LL subband at level 1 (LL1) has 32x16=512 wavelet coefficients,
reducing the size of the representation in a quarter and retaining the structure of the pattern.
LL at level 2 (LL2) has 16x8=128 coefficients and LL at level 3 (LL3) has 8x4=32 values.
With this technique, we achieve a strong reduction of the pattern representation in size,
while the image becomes coarser, retaining basic shapes.
For the experiments, we chose the particular data set and partitions used in the ICFHR
Wavelet Descriptors for Handwritten Text Recognition in Historical Documents 105
2014 HTRtS competition (Causer and Wallace, 2012) described in Table 1. Figure 8 shows
sample lines extracted from the page images and provided for experimentation. A total of
11473 lines were used, 10613 for training and tuning the system and 860 for testing. The
training set was divided into two subsets of an equal number of lines to evaluate and select
the feature vectors with better performance. Final percentages were obtained by training
the recognition system using the training and validation set (see Table 1) and testing with
the 860 test lines as suggested in the ICFHR 2014 HTRtS Restricted Track.
Table 1. Bentham dataset used in ICHFR-2014 HTRtS contest (Causer and Wallace,
2012; Toselli and Vidal, 2015)
Figure 8. Sample lines such as they were provided for experimentation (Toselli and Vidal,
2015).
(we adjusted these values experimentally). The application of the CDF 9/7 to each window
resulted in a feature vector of 512 coefficients in the case of the approximation subband
at level 1 of the WT for the A1 descriptor (see Subsection 3.2). The PCA transformation
allowed reducing the feature vector to 24 components retaining the 89.24% of the data vari-
ance. In the case of A2, 128 wavelet coefficients were obtained and reduced with PCA to
16, retaining 85.35% of the variance, while for A3, 32 coefficients were obtained and re-
duced to 16 with PCA, retaining 92.87% of the variance. Reduction of descriptor size has
a decisive impact on training time and on the processing of large databases and, sometimes
allows an improvement in recognition percentages. For this reason, we considered a com-
promise between size and variance. Figure 10 depicts the process of feature extraction for
descriptor A2 on a text line image from Bentham database.
Figure 10. Feature extraction process for descriptor A2 on a preprocessed text line image
from Bentham database. (1) A frame is extracted with the sliding window; (2) WT is
applied; (3) Size is reduced by PCA and the descriptor is obtained.
The recognition system used was the basic N-gram/HMM-GMM baseline system pro-
vided to the entrants of the ICFHR 2014 HTRtS competition (Causer and Wallace, 2012)
implemented with the SRLIM (Stolcke, 2002) and HTK (Young et al., 2009) toolkits.
We chose a left-right HMM topology for all the characters. Each state has one transition
to itself, and one to the next state (see example in Figure 1). Best results were obtained by
defining the number of states of the HMM related to each character variable instead of
setting the same number of states for all HMMs. For instance, it can be observed that
some punctuation marks such as colon, semicolon, and parenthesis, are usually narrower
Wavelet Descriptors for Handwritten Text Recognition in Historical Documents 107
than other characters. Therefore, the number of states defined for the HMMs related to
these alphabet elements is lower. As an example, we established a 3-state HMM for colon
and semicolon, 4-state HMM for parenthesis, and 6-state HMM for the majority of letters.
These values were set heuristically, and this is a topic to be better investigated in future
work (Toselli and Vidal, 2015; Günter and Bunke, 2004). Finally, we built 88 HMMs (total
number of alphabet elements) with 128 Gaussian densities per HMM state. These models
were trained with the embedded Baum-Welch algorithm (Young et al., 2009), using the
training and validation line images and their corresponding transcripts.
4.3. Results
For preliminary results, we trained the model through 5 iterations using a half of training
and validation line images of the Bentham dataset used in ICHFR-2014 HTRtS contest
(see Table 1), reducing the time of the training process considerably. The Word Error Rate
(WER) is adopted to assess the HTR performance. WER is defined as the minimum number
of words that need to be substituted, deleted or inserted to convert a recognized hypothesis
into the corresponding reference transcription, divided by the total number of reference
words (Pastor et al., 2014). Table 2 shows error percentages for each descriptor proposed
in Section 3. It can be observed that values of WER are similar for the three descriptors.
However, the size of feature vectors is lower for the A2 and A3 cases, speeding up training
time.
Results were improved by using the entire training and validation sets for training and
applying 20 iterations of BW algorithm. Also, training time was considerably increased.
For descriptor A1, a 26.19% of WER was obtained, while for A2 and A3 WER values were
26.81% and 26.47% respectively. Error percentages were similar. However, training time
for A2 and A3 was reduced by almost 50 % over the training time of A1. This result is of
considerable impact when the learning process lasts several days.
WER (%)
Descriptor Dimension
(A) (B)
A1 24 28.45 26.19
A2 16 28.49 26.81
A3 16 28.44 26.47
Table 3 compares our proposal with HTR results for Bentham dataset reported in the
literature. The WT descriptors with a baseline recognition system outperform HTR per-
centages from published work (Bluche, 2015; Toselli and Vidal, 2015), obtaining the lower
WER with A1. Additionally, data representation is improved through reducing the feature
vector size by more than half in the case of A1 and by more than 70 % in case of A2 and
A3. Reduction of descriptor size has a decisive impact on training and processing times of
large databases.
108 Leticia M. Seijas and Byron L. D. Bezerra
Better results than ours were obtained by enhancing the N-gram/HMM-GMM sys-
tem (tokenization and training algorithm) (Toselli and Vidal, 2015) using a moment-based
descriptor of size 24. We believe that further improvements in data representation and
recognition rates are possible to achieve by using WT descriptors with the enhanced N-
gram/HMM-GMM system. We plan to apply these strategies for future work.
Conclusion
In this work descriptors for handwritten text recognition based on multiresolution features
by the use of the CDF 9/7 Wavelet Transform and Principal Component Analysis are pro-
posed. The approximation subbands from level 1 to 3 of the Wavelet Transform were con-
sidered for data representation because of the possibility of retaining the basic structure of
the pattern achieving a strong reduction of the descriptor dimension. The feature vector is
subject to PCA transformation to obtain a further reduction in size.
The recognition system is based on a segmentation-free approach which tightly inte-
grate an optical character model and a language model that has yielded the best perfor-
mance on standard benchmarks. The traditional N-gram/HMM-GMM model was adopted,
implementing a baseline system trained with the embedded Baum-Welch algorithm. Ex-
periments were performed on the challenging Bentham dataset used in the ICFHR 2014
HTRtS contest.
Our proposal outperformed HTR percentages reported in the literature for N-
gram/HMM-GMM baseline systems. Additionally, data representation was improved as
a result of reducing the feature vector size by more than 70 %. Reduction of descrip-
tor size has a decisive impact on training and processing times of large databases. Better
results than ours were obtained by enhancing the tokenization and training algorithm of
the N-gram/HMM-GMM recognizer. In particular, the discriminative training strategy has
Wavelet Descriptors for Handwritten Text Recognition in Historical Documents 109
yielded good results. We plan to apply these strategies to our system for future work.
Acknowledgments
Work supported by the National Postdoctoral Program PNPD / CAPES of the government
of Brazil during 2015. The authors would like to thank the CNPQ for supporting the de-
velopment of this work through the research projects granted by “Edital Universal” (Pro-
cess 444745/2014-9) and “Bolsa de Produtividade DT” (Process 311912/2015-0). The au-
thors would also like to thank Alejandro Toselli, Moisés Pastor, and Enrique Vidal from
the Pattern Recognition and Human Language Technology Research Center at Universidad
Politécnica de Valencia for the advice provided.
References
Bazzi, I., Schwartz, R., and Makhoul, J. (1999). An omnifont open-vocabulary OCR system
for english and arabic. IEEE Transactions on Pattern Analysis and Machine Intelligence,
21(6):495–504.
Bluche, T. (2015). Deep Neural Networks for Large Vocabulary Handwritten Text Recog-
nition. Theses, Université Paris Sud - Paris XI.
Bosch, V., Toselli, A. H., and Vidal, E. (2012). Statistical text line analysis in handwritten
documents. In 2012 International Conference on Frontiers in Handwriting Recognition.
Institute of Electrical and Electronics Engineers (IEEE).
Causer, T. and Wallace, V. (2012). Building a volunteer community: results and findings
from Transcribe Bentham. Digital Humanities Quarterly.
Chen, C.-M., Chen, C.-C., and Chen, C.-C. (2006). A comparison of texture features based
on SVM and SOM.
El-Hajj, R., Likforman-Sulem, L., and Mokbel, C. (2005). Arabic handwriting recognition
using baseline dependant features and hidden markov modeling. In Proceedings of the
Eighth International Conference on Document Analysis and Recognition, ICDAR ’05,
pages 893–897, Washington, DC, USA. IEEE Computer Society.
Gouveia, F. M., Bezerra, B. L. D., Zanchettin, C., and Meneses, J. R. J. (2014). Handwriting
recognition system for mobile accessibility to the visually impaired people. In Systems,
Man and Cybernetics SMC, number 4 in 3, pages 3918–3981.
Günter, S. and Bunke, H. (2004). HMM-based handwritten word recognition: on the op-
timization of the number of states, training iterations and gaussian components. Pattern
Recognition, 37(10):2069–2079.
Kozielski, M., Forster, J., and Ney, H. (2012). Moment-based image normalization for
handwritten text recognition. In Proceedings of the 2012 International Conference on
Frontiers in Handwriting Recognition, ICFHR ’12, pages 256–261, Washington, DC,
USA. IEEE Computer Society.
Likforman-Sulem, L., Zahour, A., and Taconet, B. (2006). Text line segmentation of his-
torical documents: a survey. volume 9, pages 123–138. Springer Nature.
Marti, U.-V. and Bunke, H. (2002). Hidden markov models. chapter Using a Statistical
Language Model to Improve the Performance of a HMM-based Cursive Handwriting
Recognition Systems, pages 65–90. World Scientific Publishing Co., Inc., River Edge,
NJ, USA.
Michal, Kozielski, Doetsch, P., and Ney, H. (2013). Improvements in rwth’s system for
off-line handwriting recognition. In 2013 12th International Conference on Document
Analysis and Recognition. Institute of Electrical and Electronics Engineers (IEEE).
Pastor, M., Sánchez, J., Toselli, A. H., and Vidal, E. (2014). Handwritten Text Recognition:
Word-Graphs, Keyword Spotting and Computer Assisted Transcription.
Pastor, M., Toselli, A., and Vidal, E. (2004a). Projection profile based algorithm for slant
removal. pages 183–190.
Pastor, M., Toselli, A., and Vidal, E. (2004b). Projection profile based algorithm for slant
removal. pages 183–190.
Pastor, M., Toselli, A. H., and Vidal, E. (2006). Criteria for handwritten off-line text size
normalization .
Wavelet Descriptors for Handwritten Text Recognition in Historical Documents 111
Patel, D. K., Som, T., Yadav, S. K., and Singh, M. K. (2012). Handwritten character recog-
nition using multiresolution technique and euclidean distance metric. Journal of Signal
and Information Processing, 03(02):208–214.
Romero, V., Toselli, A. H. and Vidal, E. (2012). Multimodal Interactive Handwritten Text
Transcription. In Series in Machine Perception and Artificial Intelligence (MPAI), World
Scientific.
Sánchez, J. A., Bosch, V., Romero, V., Depuydt, K., and de Does, J. (2014). Handwritten
text recognition for historical documents in the transcriptorium project. In Proceedings
of the First International Conference on Digital Access to Textual Cultural Heritage,
DATeCH ’14, pages 111–117, New York, NY, USA. ACM.
Sánchez, J. A., Mühlberger, G., Gatos, B., Schofield, P., Depuydt, K., Davis, R. M., Vidal,
E., and de Does, J. (2013). tranScriptorium. In Proceedings of the 2013 ACM symposium
on Document engineering - DocEng’13. Association for Computing Machinery (ACM).
Sanchez, J. A., Toselli, A. H., Romero, V., and Vidal, E. (2015). ICDAR 2015 competi-
tion HTRtS: Handwritten text recognition on the tranScriptorium dataset. In 2015 13th
International Conference on Document Analysis and Recognition (ICDAR). Institute of
Electrical and Electronics Engineers (IEEE).
Shao, Y. and Chang, C.-H. (2005). Wavelet transform to hybrid support vector machine and
hidden markov model for speech recognition. In 2005 IEEE International Symposium on
Circuits and Systems. Institute of Electrical and Electronics Engineers (IEEE).
Skodras, A., Christopoulos, C., and Ebrahimi, T. (2001). JPEG2000: The upcoming still
image compression standard. Pattern Recognition Letters, 22(12):1337–1345.
Toselli, A. H., Juan, A., González, J., Salvador, I., Vidal, E., Casacuberta, F., Keyers, D., and
Ney, H. (2004a). INTEGRATED HANDWRITING RECOGNITION AND INTERPRE-
TATION USING FINITE-STATE MODELS. International Journal of Pattern Recogni-
tion and Artificial Intelligence, 18(04):519–539.
Toselli, A. H. and Vidal, E. (2015). Handwritten text recognition results on the bentham
collection with improved classical n-gram-hmm methods. In Proceedings of the 3rd
International Workshop on Historical Document Imaging and Processing, HIP ’15, pages
15–22, New York, NY, USA. ACM.
112 Leticia M. Seijas and Byron L. D. Bezerra
Trivedi, N., Kumar, V., Singh, S., Ahuja, S., and Chadha, R. (2011). Speech recognition by
wavelet analysis. International Journal of Computer Applications, 15(8):27–32.
Young, S., Evermann, G., Gales, M., Hain, T., and Kershaw, D. (2009). The HTK Book:
Hidden Markov Models Toolkit V3.4. Microsoft Corporation Cambridge Research Lab-
oratory Ltd.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 5
1. Introduction
We live in a digital world, where information is stored, processed, indexed and searched
by computer systems, making its retrieval a cheap and quick task. Handwritten documents
are no exception to the rule. The stakes of recognizing handwritten documents, and in par-
ticular handwritten texts, are manifold, ranging from automatic cheque or mail processing
to archive digitalization and document understanding. The regions of the image containing
handwritten text must be found, and converted into ASCII text, a process known as offline
handwriting recognition.
This field has benefited from over sixty years of research. Starting with isolated char-
acters and digits, the focus shifted to the recognition of words. The current strategy is to
recognize lines of text directly, and use a language model to constrain the transcription, and
help retrieve the correct sequence of words. One of the most popular approaches nowadays
consists in scanning the image with a sliding window, from which features are extracted.
The sequences of such observations are modeled with character Hidden Markov Models
(HMMs). Word models are obtained by concatenation of character HMMs. The standard
model of observations in HMMs is Gaussian Mixture Models (GMMs). In the nineties, the
theory to replace Gaussian mixtures and other generative models by discriminative models,
such as Neural Networks (NNs), was developed (Bourlard and Morgan, 1994). Discrimina-
tive models are interesting because of their ability to separate different HMM states, which
improves the capacity of HMMs to differentiate the correct sequence of characters.
A drawback of HMMs is the local modeling, which fails to capture long-term depen-
dencies in the input sequence, that are inherent to the considered signal. Recent improve-
∗ E-mail address: [email protected].
114 Théodore Bluche, Christopher Kermorvant and Hermann Ney
• Is it still important to design handcrafted features when using deep neural networks,
or are pixel values sufficient?
• Can deep neural networks give rise to big improvements over neural networks with
one hidden layer for handwriting recognition?
• How (deep) Multi-Layer Perceptrons compare to the very popular Recurrent Neural
Networks, which are now widespread in handwriting recognition and achieve state-
of-the-art performance?
• What are the important characteristics of Recurrent Neural Networks, which make
them so appropriate for handwriting recognition?
• What are the good training strategies for neural networks for handwriting recogni-
tion? Can the Connectionist Temporal Classification (CTC, (Graves et al., 2006))
paradigm be applied to other neural networks? What improvements can be observed
with a discriminative criterion at the sequence level?
Impact of Inputs” through “The Impact of Outputs and Training Method”, we present an
experimental evaluation of many aspects of neural network optical models. We discuss the
type of inputs in section “The Impact of Inputs”, the network architectures in section “The
Impact of Architecture” and we evaluate training methods and choice of outputs in section
“The Impact of Outputs and Training Method”. In section ”Final Results”, we select the
best MLPs and RNNs, with features and pixel inputs, resulting from the conducted study.
We evaluate the impact of the linguistic constraints (lexicon and language model), and the
combination of these models. We compare the final results to previous publications and
report state-of-the-art performance. The last section concludes this chapter by answering
the proposed questions.
2. Experimental Setup
2.1. Databases
The Rimes database (Augustin et al., 2006) consists of images of handwritten letters
from simulated French mail. We followed the setup of the ICDAR 2011 competition. The
available data are a training set of 1,500 paragraphs, manually extracted from the images,
and an evaluation set of 100 paragraphs. We held out the last 149 paragraphs (approximately
10%) of the training set as a validation set and trained the systems on the remaining 1,391
paragraphs. Table 1 presents the number of words and characters in the different subsets.
There are 460k characters distributed in more than 10k text lines, and 97 different symbols
to be modeled (lowercase and capital letters, accentuated letters, digits and punctuation
marks). The average character length, computed from the line widths, is 37.6 pixels at 300
DPI.
The IAM database (Marti and Bunke, 2002) consists of images of handwritten pages.
They correspond to English texts extracted from the LOB corpus (Johansson, 1980), copied
by different writers. The database is split into 747 images for training, 116 for validation,
and 336 for evaluation. Note that this division is not the one presented in the official publi-
cation or on the website1, but the one found in various publications (Bertolami and Bunke,
1 http://www.iam.unibe.ch/fki/databases/iam-handwriting-database
116 Théodore Bluche, Christopher Kermorvant and Hermann Ney
2008; Graves et al., 2009; Kozielski et al., 2013b). We obtained the subdivision from H.
Bunke, one of the creators of the database. Table 1 presents the number of words and char-
acters in the different subsets. There are almost 290k characters distributed in more than 6k
text lines, and 79 different symbols to be modeled (lowercase and capital letters, digits and
punctuation marks). The average character length, computed from the line widths, is 39.1
pixels at 300 DPI.
The Bentham database contains images of personal notes of the British philosopher
Jeremy Bentham, written by himself and his staff in English, around the 18th and 19th cen-
turies. The data were prepared by University College, London, during the tranScriptorium
project2 (Sánchez et al., 2013). We followed the setup of the HTRtS competition (Sánchez
et al., 2014). The training set consists of 350 pages. The validation set comprises 50 im-
ages, and the test set 33 pages. Table 1 presents the number of words and characters in
the different subsets. There are 420k characters distributed in almost 10k text lines, and
93 different symbols to be modeled (lowercase and capital letters, digits and punctuation
marks). The average character length, computed from the line widths, is 32.7 pixels at 300
DPI.
Recurrent Neural Networks (RNNs) are networks with a notion of internal state, evolv-
ing through time, achieved by recurrent connections. In its simplest form, an RNN is an
MLP with recurrent layers. A recurrent layer does not only receive inputs from the previous
layers, but also from itself, as depicted on the left-hand side of Figure 1. The activations atk
of such a layer evolve through time with the following recurrence
I H
atk = ∑ win t rec t−1
ki xi + ∑ wkh zh (4)
i=1 h=1
t−1
where xi s are the inputs and winki the corresponding weights, and zh the layer’s outputs at
rec
the previous timestep and wkh the corresponding weights.
Bidirectional RNNs (BRNNs, (Schuster and Paliwal, 1997)) process the sequence in
both directions. In these networks, there are two recurrent layers: a forward layer, which
takes inputs from the previous timestep, and a backward layer, connected to the next
timestep. Both layers are connected to the same input and output layers.
Figure 2. Neurons for RNNs: (left) Simple Neuron (right) LSTM unit.
p(s|xt )
p(xt |s) = p(xt ) (5)
p(s)
H. Bourlard and his colleagues thoroughly studied the theoretical foundations of hybrid
NN/HMM systems in (Bourlard and Wellekens, 1989; Bourlard and Morgan, 1994; Renals
et al., 1994; Konig et al., 1996; Hennebert et al., 1997). In particular, they show in Konig
et al. (1996) how a discriminant formulation of HMMs (Bourlard and Wellekens, 1989),
able to compute p(W|x) leads to a particular MLP architecture predicting local conditional
transition probabilities p(qt |qt−1 , xt ), which allow to estimate global posterior probabilities.
One may train the neural network as a classifier, with a labeled dataset S = {x(i) , s(i) }. In
the hybrid NN/HMM approach, x(i) s are frames and s(i) s, HMM states. The targets may be
obtained with uniform segmentation of observation sequences, or by alignment of the data
using a trained system, e.g. GMM-HMM. One may re-align the observations with HMM
states during the training procedure to refine the targets. The neural network is then plugged
into the whole system, and its predictions provide scores for the decoding procedure. The
How to Design Deep Neural Networks for Handwriting Recognition 119
labels are the same, for example, if we want to predict AAB. This is one of the reasons
why a blank symbol is introduced in Graves et al. (2006), corresponding to observing
no label. Therefore, in the CTC framework, the network has one output for each label in
an alphabet L, plus one blank output, i.e. the output alphabet is L0 = L ∪ {}. A mapping
B : L0 T 7→ L≤T is defined, which removes duplicates, then blanks in the network prediction.
For example: B (AA B) = B (AAA BB) = AB.
Provided that the network outputs for different timesteps are independent, given the
input, the probability of a label sequence π ∈ L0 T for a given x in terms of the RNN outputs
is
p(π|x) = ∏ ytπt (x) (8)
t
and the mapping B allows to calculate the posterior probability of a label (character) se-
quence l ∈ L≤T by summing over all possible segmentations:
Through the Equantion 9, we can train the network to maximize the probability of the
correct labelling of the unsegmented training data S = {(x, z), z ∈ L≤|x| } by minimizing the
following cost function
The computation of p(z|x) implies a sum over all paths in B −1 (z), each of which of length
T = |x|, which is expensive. Graves et al. (2006) propose to use the forward-backward
algorithm in a graph representing all possible labelling alternatives.
2.3.4. Sequence-Discriminative
Sequence-discriminative training optimizes criteria to increase the likelihood of the correct
word sequence, while decreasing the likelihood of other sequences. This kind of training is
similar to the discriminative training of GMM-HMMs with the Maximum Mutual Informa-
tion (MMI) or the Minimum Phone Error (MPE) criteria.
The MMI criterion is defined as follows:
p(Wr |x)
EMMI (S ) = ∑ log
∑W p(W|x)
(11)
(x,Wr )∈S
The Minimum Phone Error (MPE, (Povey, 2004)) class of criteria has the following
formulation:
∑W p(W|x)A(W, Wr)
EMBR (S ) = ∑ (12)
(x,W )∈S ∑W0 p(W0 |x)
r
is difficult to compute in practice. Instead, recognition lattices are extracted with the optical
and language models, and only word sequences in these lattices are considered in the cost
function.
Sequence-discriminative training is popular in hybrid NN/HMM in speech recognition.
As already mentioned earlier, (Bengio et al., 1992; Haffner, 1993) applied the MMI criterion
to the global training of a NN/HMM system. In the past few years, these training methods
arouse much interest with the advent of deep neural networks (Kingsbury, 2009; Sainath
et al., 2013; Veselý et al., 2013; Su et al., 2013). Usually, a neural network is first trained
with a framewise criterion. Then, lattices are generated, and the network is further trained
with MMI, MPE, or sMBR. Regenerating the lattices during training may be helpful, but
the gains are limited beyond the first epoch (Veselý et al., 2013). In speech recognition,
experiments with sequence-discriminative training yielded relative WER improvements in
the range of 5-15%.
2.4. Evaluation
We carried out a thorough evaluation of different aspects of neural networks for handwriting
recognition, with a particular focus on deep neural networks. We tried, as much as possible,
to compare shallow and deep networks on the one hand, and feature and pixel inputs on the
other hand. We evaluated several aspects and design choices for neural networks, including
inputs, output space, training method, depth and architecture. All our experiments were
conducted on the three databases (Rimes, IAM and Bentham).
Unless stated otherwise (in section “The Impact of Outputs and Training Method”), the
MLPs are trained with the bootstrapping method, to classify each frame into one of the
HMM state. The targets states are obtained with forced alignment of the training set with
the baseline GMMs. The minimized cost is the one defined in Equation 6. The performance
of the MLPs alone is evaluated with the Frame Error Rate (FER%), defined as the ratio of
incorrectly classified frames over the total number of frames.
The RNNs are trained with the CTC method to directly predict character sequences, by
minimizing the cost function defined in Equation 10. The performance of the RNNs alone
is evaluated with the Character Error Rate (RNN-CER%), defined as the edit distance
between the ground-truth and predicted character sequences, normalized by the number of
characters in the ground-truth.
Keeping in mind that the networks will be used in a complete pipeline, we also measured
the Character (CER%) and Word Error Rates (WER%), defined similarly as the RNN-
CER%, after the integration of the language models.
We explore different aspects of the neural networks: their structure, their parameters,
and their training procedure. We present results of the whole recognition systems. Most of
the components of these systems, excluding the neural networks, are kept fixed throughout
the experiments, unless stated otherwise. This section is dedicated to the presentation of the
fixed components: text line image preprocessing, feature extraction, modeling with Hidden
Markov Models (HMMs), and language models.
Figure 3 shows an overview of the recognition system and of its components. The ones
with thick, dashed lines are the fixed ones, presented in this section. In this section, we will
also present a baseline optical model – a Gaussian Mixture Model.
from Roeder (2009). The results reported in Table 2 show that the latter method is better,
for two different sliding window size.
We also tried to normalize the height of images, either to a fixed value of 72px or with
region-dependent methods (Toselli et al., 2004; Pesch et al., 2012), with fixed height for
each region (ascenders, core, descenders – 22, 33 and 17px, or 24px for each). The regions
are found after deskew and deslant with the algorithm of Vinciarelli and Luettin (2001). We
selected the normalization of each region to 24px, based on the results (WER) of Table 3.
• 3 pixel density measures: in the whole window, and in the regions above and below
the baseline,
• pixel densities in each column of pixels (w f values, where w f is the width of the
sliding window),
• 2 measures of the center of gravity: relative vertical positions with respect to the
baseline and to the center of gravity in the previous window,
124 Théodore Bluche, Christopher Kermorvant and Hermann Ney
All these features form a (25 + w f )-dimensional vector, to which deltas are appended,
resulting in feature vectors of dimension 56 (w f = 3px). The parameters of the sliding win-
dow (width and shift) for the handcrafted features have been tuned using the same method
as for preprocessing. The best parameters were a shift and width of 3px each (no overlap
between windows).
the LMs with the ngram counts from the corpus. The hyphenated word chunks are added to
the unigrams with count 1. We generated 4grams with Kneser-Ney discounting (Kneser and
Ney, 1995). Table 4 presents the perplexities of different ngrams. They are better without
hyphenation, but we found that the hyphenated version gave better recognition results.
Table 4. Perplexities of Bentham LMs with different ngram orders and hyphenation,
on the validation set.
Hyphenation Size OOV% 1gram 2gram 3gram 4gram
No 7,318 7.1 348.7 129.4 101.7 96.7
Yes 32,692 5.6 656.1 137.6 108.4 103.1
We evaluated the system by comparing the recognition outputs to the ground-truth tran-
scriptions. We decoded with the tools implemented in the Kaldi speech recognition toolkit
(Povey et al., 2011), which consists of a beam search in a Finite-State Transducer (FST),
with a token passing algorithm. The FST is the composition of the representation of each
component (HMM, vocabulary and Language Model) as FSTs (H, L, G). The HMM and
vocabulary conversion into FST is straightforward. The LM generated by SRILM is trans-
formed by Kaldi. The final graph computation not only involves composition, but also other
FST operations. The method proposed in Kaldi is a variation of the technique explained in
Mohri et al. (2002):
F = min(rm(det(H ◦ min(det(L ◦ G))))) (13)
where ◦ denotes FST composition, and min, det and rm are respectively FST minimization,
determination, and removal of some ε-transitions and potential disambiguation symbols.
Refer to Mohri et al. (2002); Povey et al. (2011) for more details concerning the FST cre-
ation.
et al., 2013b) uses an open-vocabulary approach, able to recognize any word (no OOV). On
4.1. Types
In the introduction, we presented two kinds of inputs for the neural networks: handcrafted
features and pixel values. In many pattern recognition problems, the advent of deep neural
networks allowed to replace handcrafted features by the raw signal. The relevant features
are learnt by the recognition system. Using raw inputs has some advantages, such as reliev-
ing the architect of the system from implementing feature extraction methods.
How to Design Deep Neural Networks for Handwriting Recognition 127
Figure 4. Comparison of pixels and handcrafted features as inputs to shallow and deep
MLPs and RNNs.
On Figure 4, we plot the WER% of complete hybrid systems, in which the optical model
is an MLP or an RNN, with one or several hidden layers. We compare the performance with
handcrafted features and pixel values. Although the raw pixel values always seem a little
worse than the handcrafted features, we observe a big reduction of the performance gap
when using deep neural networks. This is especially striking for recurrent neural networks,
which give a high WER% with pixels and only one hidden layer. We will see later in
this chapter that when using better training methods, such as the sequence-discriminative
training and dropout, the gap almost disappears.
4.2. Context
4.2.1. Context through Frame Concatenation
The inputs of the neural networks are sequences of feature vectors, extracted with a slid-
ing window. This window is usually relatively small, ofter smaller than a character. For
handcrafted features, increasing the size of the window would result in a probable loss of
information, as more pixels will be summarized in the same number of features. A com-
mon alternative approach to providing more context to the neural networks is to concatenate
successive frames.
We report the results of that experiment in Figure 5. On the lefthand side, for MLPs, we
observe that the Frame Error Rate (FER) decreases when more context is provided (from
around 50% without context to less than 30% with a lot of context). The improvements are
not as big in the complete system because the HMM and language model help to correct
many mistakes made by the MLP. Yet we observed up to 20% relative WER% improvement
by choosing the right amount of context. On the other hand, we notice a performance drop
when explicitly adding context in the inputs of RNNs (Figure 5b). Although surprising, it
should be noted that concatenating frames increases the dimension of the input space. While
MLPs classify frames independently and need more context to compensate small frames,
RNNs process the whole sequence, and can learn to propagate context from adjacent frames,
as we will see in the next section.
128 Théodore Bluche, Christopher Kermorvant and Hermann Ney
For pixel inputs, the sliding windows are overlapping, and the dimension of the feature
vectors is D = wh, where w is the width of the window, and h its height. Therefore, we can
reshape the feature vector to get the shape of the frame, and a sensitivity map with the same
shape as the image. In overlapping regions, the magnitude of the derivative for consecutive
windows are summed:
w/2δ
∂yτ
Si, j = ∑ (15)
k=−w/2δ
∂xi+δk,i+w( j−1)−δk
where δ = 3px is the step size of the sliding window. This way, we can see the sensitivity
in the image space.
On Figure 6, we display the results for BLSTM-RNNs with 7 hidden layers of 200
units. On each plot, we show on top the preprocessed image, the position τ and the RNN
prediction yτ , as well as the sliding window at this position to put the sensitivity map in the
perspective of the area covered by the window at t = τ.
The step size δ of all sliding windows is 3px, i.e. the size of the sliding window for
features displayed on the top plots. We observe that the input sensitivity goes beyond ±5
frames, as well as beyond the character boundaries in some cases, as if the whole word
could help to disambiguate the characters. It is also an indication that RNNs actually use
their ability to model arbitrarily long dependency, an ability that MLPs lack.
How to Design Deep Neural Networks for Handwriting Recognition 129
5.1. Depth
The first experiment consists of adding hidden layers and measure the effect on the perfor-
mance of the neural network, outside of the complete pipeline. We measured the FER% for
MLPs and the RNN-CER% for RNNs (trained with CTC). The MLPs have 1,024 neurons
in each hidden layer. The RNNs have 200 units in each layer. For RNNs, we actually add
a BLSTM layer, i.e. 200 units in each scanning direction, plus a linear layer with 200 units
too to merge the information coming from both directions.
Figure 7. Effect of increasing the number of hidden layers in the performance of MLPs and
RNNs alone.
The results are displayed in Figure 7, for MLPs and RNNs, on all three databases, with
pixel and handcrafted features as inputs. For MLPs, we notice relative FER improvements
up to 20% going from one to several hidden layers. The biggest improvement is observed
130 Théodore Bluche, Christopher Kermorvant and Hermann Ney
from one to two hidden layers, but we still get better results with more layers. Overall, four
or five hidden layers look like a good choice to get optimal FER: the improvements beyond
that number are relatively small.
For RNNs, we observe that almost every time we add layers, the performance of the
RNN is increased. For handcrafted features, adding a second LSTM layer and a feed-
forward one brings around a relative 20-25% CER improvement. Adding a third one yields
another 6-12% relative improvement. For pixels, one hidden layer only is not sufficient,
and adding another LSTM layer divides the error rates by more than two. A third LSTM
layer is also significantly better, by another 20-25% relative CER improvement.
In Figure 8, we show the WER% results when shallow 1-hidden layer and deep net-
works are included in the complete pipeline. The obtained improvement are not as impres-
sive as those of the networks evaluated alone, but remain significant, especially with pixels
as inputs. This is particularly striking for RNNs, which yield high error rates when there
is only one hidden layer, but which achieve similar results as the feature-based RNNs with
several hidden layers.
Figure 8. Comparison of recognition results (WER%) of the full pipeline with shallow (one
hidden layer) and deep neural networks.
It should be noted that adding hidden layers increases the total number of free param-
eters, hence the global capacity of the network. We may also control the number of free
parameters by varying the number of neurons in each layer. In Figure 9, we show the num-
ber of parameters and error rates when changing the depth on the one hand, and the number
of neurons on the other hand. We see that for a fixed number of parameters, deeper networks
tend to perform better.
From these experiments, we can conclude that depth, not only the number of free pa-
rameters plays an important role in the reduction of error rates.
5.2. Recurrence
In this second set of experiments, we measure the impact of recurrence in the neural net-
works. The results are summarized in Figure 10. As one can notice, similar error rates are
achieved by the two kinds of optical models, and of inputs, making a definite conclusion
How to Design Deep Neural Networks for Handwriting Recognition 131
Figure 9. Comparison of performance of neural networks when the number of free param-
eters is adjusted by varying the depth and the number of neurons in hidden layers.
hard to draw about what are the best choices. Yet, since pixel values yield similar perfor-
mance as handcrafted features, the need to design and implement features vanishes, and
one may simply use the pixels directly. Moreover, although RNNs are found in all the best
published systems for handwritten text line recognition, they are not the only option, and
MLPs should not be neglected.
Figure 10. Comparison of MLPs and RNNs, for both kinds of inputs, in the complete
pipeline.
Table 7. Effect of recurrence on the character error rate of the RNN alone
(RNN-CER%)
Features Pixels
Rimes IAM Rimes IAM
FFF 44.0 39.6 38.0 32.8
RFF 13.2 13.7 62.2 61.3
FRF 12.3 13.7 20.6 19.2
FFR 13.0 12.5 17.5 17.5
RRF 11.6 23.1 20.8 20.3
RFR 11.6 11.8 23.0 19.6
FRR 11.6 12.0 15.3 17.5
RRR 9.7 11.4 16.7 18.9
conclusions. When dropout is only applied at one position (top plots of Figure 12):
• it is generally better in lower layers of the RNNs, rather than in the top LSTMs,
except when it is after the LSTM.
• it is almost always better before the LSTM layer than inside or after it, and better
after than inside, except for the bottom layer.
When it is applied in all layers (bottom, middle and top; bottom plot of Figure 12):
• among all relative positions to the LSTM, when dropout is applied to every LSTM,
placing it after was the worst choice in all six configurations
• before LSTMs seems to be the best choice for Rimes and Bentham, and inside
LSTMs is better for IAM.
In the complete pipeline, with language models, we studied the results with dropout at
different positions relative to the LSTM layers (Figure 13). We observe that for Rimes, the
best results are achieved with dropout after LSTMs, despite the superior performance of
dropout before for the RNN alone. For IAM, dropout inside LSTMs is only slightly better
for features. With pixel inputs, placing dropout before LSTMs seems to be a good choice.
The main difference between the RNN alone and the complete system is that the former
only considers the best character hypothesis at each timestep, whereas the latter potentially
considers all predictions in the search of the best transcription with lexical constraints.
Therefore, applying dropout after the LSTM in the top layer(s) might be beneficial
for the beam search in the decoding with complete systems. Indeed, dropout after the
last LSTMs forces the classification to rely on more units. Conversely, a given LSTM
unit will contribute to the prediction of more labels. On the other hand, the values of
neighboring pixels are highly correlated. If the model can always access one pixel, it might
be sufficient to infer the values of neighboring ones, and the weights will be used to model
more complicated correlations. With dropout on the inputs, the local correlations are less
visible. With half the pixels missing, the model cannot rely on regularities in the input
signal and should model them to make the most of each pixel. As a result, we decided to
apply dropout before the lower LSTMs and after the topmost LSTM, which consistently
improved the recognition results (rightmost bars in the plots of Figure 13).
134 Théodore Bluche, Christopher Kermorvant and Hermann Ney
Figure 12. Effect of dropout at different positions on RNN-CER%. The relative improve-
ment is represented by the color intensity. The top-left plot shows the result at different
positions in each configuration. The top-right plot is the average. The bottom plot contains
the results when dropout is applied to all layers.
How to Design Deep Neural Networks for Handwriting Recognition 135
Figure 13. Effect of dropout at different positions in the complete pipeline (WER%).
Figure 15. Effect of sMBR training. The cross-entropy corresponds to framewise training,
as oposed to sMBR, which is a sequence-discriminative criterion.
p(st |xt )
EFwdBwd (S ) = − ∑ log ∑ ∏
p(st )
p(st |st−1 ) (18)
(x,z)∈S s7→z t
We notice that the main difference between framewise and CTC training is the summation
over alternative labelings, that we also find in the forward-backward criterion. On the other
hand, the main difference between CTC and forward-backward is the absence of transition
and prior probabilities in the former. Hence, CTC is quite similar to HMM training without
transition or prior probabilities, and with only one HMM state per character, and a “blank”
state shared by all character models. It raises the question of whether it is interesting to (i)
have this “blank” model, (ii) consider alternative labelings, and (iii) have only one state (or
output of the network) per character.
In this section, we compare the results of framewise and CTC training of neural net-
works. Note that in the literature, the comparison of framewise and CTC training is carried
out with the standard HMM topology with several states and no blank for framewise train-
ing, and with the CTC topology for CTC training (Graves et al., 2006; Morillot et al., 2013).
Maas et al. (2014) compare CTC-trained deep neural networks with and without recurrence,
using the topology defined by the CTC framework, and report considerably better results
with recurrence, which we confirm in these experiments. Here, we take one step further,
comparing framewise and CTC training using the same topology in each case, and observ-
ing the effect of both the training procedure and the output topology, for MLPs and RNNs.
For each topology (1 to 7 states, with and without blank), we trained MLPs and RNNs.
In Figure 16, we plot the WERs of MLPs (left) and RNNs (right), without blank in solid
lines, and with blank in dashed ones, and using framewise training (circles) and CTC (or
summation over alternatives; squares). We observe that systems with blanks are better with
138 Théodore Bluche, Christopher Kermorvant and Hermann Ney
a few states and worse with many states. The summation over alternative labelings does not
seem to have a significant impact. Moreover, all curves but one have a similar shape: the
error decreases when the number of states increases, and starts increasing when there are
too many states. This increase appears sooner when we add a blank model.
(a) MLPs
(b) RNNs
Figure 16. Comparison of WER% with CTC and framewise training, with and without
blank (top: MLP; bottom: RNN).
The only different case concerns the RNN with CTC training and blank symbol. The
best CER is achieved with one state per character. The CTC framework, including the
single state per character, blank symbol and forward-backward training is especially suited
to RNNs. Moreover, CTC training without blank and with less than 5 states per character
converged to a poor local optimum, for both neural networks, and most of the predictions
How to Design Deep Neural Networks for Handwriting Recognition 139
were whitespaces. The training algorithm did not manage to find a reasonable alignment,
and the resulting WERs / CERs where above 90%. To obtain the presented results, we had
to initialize the networks with one epoch of framewise training. This problem did not occur
when a blank model was added, suggesting that this symbol plays a role in the success of
the alignment procedure in early stages of CTC training.
7. Final Results
In the previous sections, we have carried out an evaluation of many aspects of the considered
neural networks. This involved training many different neural networks and comparing the
results. In particular, we have measured the impact of several modeling and design choices,
leading to significant improvements. Among these networks, we selected one MLP and one
RNN for each kind of inputs, achieving the best results. They are summarized in Table 8.
In this section, we evaluate their performance when we optimize the complete recognition
pipeline, and compare them with published results.
We performed the decoding with different levels of linguistic constraints. The simplest
one is to recognize sequences of characters. In the next level, the lexicon is added, so
that output sequences of characters form sequences of valid words. Finally, the language
model is added, to promote likely sequences of words. The results are reported in Table 9.
For MLPs, when the only constraint is to recognize characters, i.e. valid sequences of
HMM states, the results are not good. The WERs are high, partly because when training
the models, the recognition of a whitespace between words was optional. Therefore, the
missing whitespaces in the predictions induce a high number of word merges in the output,
i.e. a large number of deletions and substitutions. When a vocabulary is added, the error
rates are roughly divided by two. Another reduction by a factor two is achieved when a
language model is present. These results show the importance of the linguistic constraints
to correct the numerous errors of the MLP/HMM system.
140 Théodore Bluche, Christopher Kermorvant and Hermann Ney
For RNNs, we notice that the differences between no constraints and lexicon with LM
are not as dramatic as for MLPs. The WERs are only multiplied by 2 to 2.5 when we
remove the constraints, when it was roughly multiplied by 5 for MLPs. As mentioned pre-
viously, a lot of context is used by the network through the recurrent connections, which
seems to enable the network to predict characters with some knowledge about the words.
Yet, both the lexicon and the language model bring significant improvements, and remain
very important to achieve state-of-the-art results. The fact that the RNNs produce reason-
ably good transcriptions by themselves should make them more suited to open-vocabulary
scenarios (e.g. the approaches of (Kozielski et al., 2013b; Messina and Kermorvant, 2014)),
where the language model is either at the character level, or a hybrid between a word and a
character language model.
The final results, comparing different models and input features, and comparing our
proposed systems with other published results, are reported in Tables 10 (Rimes) and 11
(IAM). The error rates are reported on both the validation and evaluation sets. The conclu-
sions of the previous sections about the small differences in performance between MLPs
and RNNs and between features and pixels are still applicable to the evaluation set results.
The systems based on neural networks outperform the GMM-HMM baseline systems:
the relative improvement is about 30%. Moreover, on Rimes, we see that all of our single
systems achieve state of the art performance, competing with the systems of Pham et al.
(2014), which uses the same language model with an MDLSTM-RNN with dropout, trained
directly on the image, and of Doetsch et al. (2014), a hybrid BLSTM-RNN. On IAM, it is
worth noting that the decoders of Kozielski et al. (2013a) and Doetsch et al. (2014) include
an open-vocabulary language model which can potentially recognize any word, when the
error of our systems is bound to be higher than the OOV rate of 3.7%. For Kozielski et al.
(2013a), the second result in Table 11 corresponds to the closed vocabulary decoding with
the same system as the first one. Unfortunately, the results on the evaluation set are not
reported with this setup, but from the validation set errors, we may consider that our single
systems achieve similar performance as the best closed-vocabulary systems of Pham et al.
(2014) and Kozielski et al. (2013a).
For each database, we have selected four systems, two MLPs and two RNNs, with fea-
ture and pixel inputs. We have seen that their performance was comparable. However, the
How to Design Deep Neural Networks for Handwriting Recognition 141
Dev. Eval.
WER% CER% WER% CER%
GMM-HMM Features 17.2 5.9 15.8 6.0
MLP Features 12.5 3.4 12.7 3.7
Pixel 12.6 3.8 12.4 3.9
RNN Features 12.8 3.8 12.6 3.9
Pixels 12.7 4.0 13.8 4.6
ROVER combination 11.3 3.5 11.3 3.7
Lattice combination 11.2 3.3 11.2 3.5
(Pham et al., 2014) - - 12.3 3.3
(Doetsch et al., 2014) - - 12.9 4.3
(Messina and Kermorvant, 2014) - - 13.3 -
(Kozielski et al., 2013a) - - 13.7 4.6
(Messina and Kermorvant, 2014) - - 14.6 -
(Menasri et al., 2012) - - 15.2 7.2
Dev. Eval.
WER% CER% WER% CER%
GMM-HMM Features 15.2 6.3 19.6 9.0
MLP Features 10.9 3.7 13.3 5.4
Pixel 11.4 3.9 13.8 5.6
RNN Features 11.2 3.8 13.2 5.0
Pixels 11.8 4.0 14.4 5.7
ROVER combination 9.6 3.6 11.2 4.7
Lattice combination 9.6 3.3 10.9 4.4
(Doetsch et al., 2014) 8.4 2.5 12.2 4.7
(Kozielski et al., 2013a) 9.5 2.7 13.3 5.1
(Pham et al., 2014) 11.2 3.7 13.6 5.1
(Kozielski et al., 2013a) 11.9 3.2 - -
(Messina and Kermorvant, 2014) - - 19.1 -
(Espana-Boquera et al., 2011) 19.0 - 22.4 9.8
differences between these systems probably lead to different errors. Thus, we combined
their outputs, with two methods: ROVER (Fiscus, 1997), which combines the transcription
outputs, and a lattice combination technique (Xu et al., 2011), which extracts the final tran-
script from the combination of lattice outputs. For both methods, we started by computing
the decoding lattices, obtained with the decoder implemented in Kaldi. As one can see in
Tables 10 and 11, both combination methods clearly outperform the best published WERs
on Rimes and IAM, even those obtained with open-vocabulary systems.
142 Théodore Bluche, Christopher Kermorvant and Hermann Ney
Conclusion
In this chapter, we focused on the problem of offline handwritten text recognition, consisting
of transforming images of cursive text into their digital transcription. More specifically, we
concentrated on images of text lines, and we adopted the popular sliding window approach:
a sequence of feature vectors is extracted from the image, processed by an optical model,
and the resulting sequence is modeled by Hidden Markov Models and linguistic knowledge
(a vocabulary and a language model) to obtain the final transcription. In the interest of
gaining a deeper knowledge or understanding of these models, we have carried out thorough
experiments with deep neural network optical models for hybrid NN/HMM handwriting
recognition. We focused on two popular architectures: Multi-Layer Perceptrons, and Long
Short-Term Memory Recurrent Neural Networks. We investigated many aspects of those
models: the type of inputs, the output model, the training procedure, and the architecture
of the networks. We answered the following questions regarding neural network optical
models.
−→ Is it still important to design handcrafted features when using deep neural net-
works, or are pixel values sufficient?
Although we have seen that shallow networks tend to be much better when fed with
handcrafted features, we showed that the discrepancy between the performance of the sys-
tems with handcrafted feature and pixel inputs is largely decreased with deep neural net-
works. This supports the idea that an automatic extraction of learnt features happen in the
lower layers of the network. Neural networks with pixel inputs require more hidden layers,
but finally achieve similar performance as networks operating with handcrafted features.
The need to design and implement good feature extractions may therefore not be necessary.
−→ Can deep neural networks give rise to big improvements over neural networks
with one hidden layer for handwriting recognition?
We have trained two kinds of neural networks, namely Multi-Layer Perceptrons and
Recurrent Neural Networks, and we have evaluated the influence of the number of hidden
layers on the performance of the system. We trained neural networks of different depths,
and we have shown that deep neural networks achieve significantly better results than neu-
ral networks with a single hidden layer. With deep neural networks, we recorded relative
improvements of error rates in the range 5-10% for MLPs and 10-15% for RNNs. When
the inputs of the network are pixels, the improvement can be much larger.
−→ What are the important characteristics of Recurrent Neural Networks, which make
them so appropriate for handwriting recognition?
We have seen that explicitly including context in the observation sequences did not
improve the results, as it does for MLPs, and that RNNs could effectively learn the depen-
dencies in the input sequences, and the context necessary to make character predictions.
We have shown that the recurrence was especially useful in the top layers of RNNs, at
least in the CTC framework. We have also shown that RNNs can take advantage of the
CTC framework, which defines an objective function at the sequence level for training, but
also the output classes of the network. These are characters directly, and a special non-
character symbol, allowing the network to produce transcriptions with the neural network
alone, without relying on an HMM or any other elaborated model.
−→ How (deep) Multi-Layer Perceptrons compare to the very popular Recurrent Neu-
How to Design Deep Neural Networks for Handwriting Recognition 143
ral Networks, which are now widespread in handwriting recognition and achieve state-of-
the-art performance?
We have shown that deep MLPs can achieve similar performance to RNNs, and that
both kinds of model give comparable results to the state-of-the-art on Rimes and IAM. We
conclude that despite the dominance of RNNs in the literature of handwriting recognition,
MLPs, and possibly other kinds of models, can be a good alternative, and therefore should
not be put aside. However, we have also shown that MLPs are more sensitive to the number
of states in HMM models, and to the amount of input context provided. The RNNs, with
CTC training, model sequences of characters directly, and are much easier to train, coping
with the input sequence and the length estimation automatically.
−→ What are the good training strategies for Neural Networks for handwriting recog-
nition? Can the Connectionist Temporal Classification paradigm be applied to other Neu-
ral Networks? What improvements can be observed with a discriminative criterion at the
sequence level?
The optimized cost is an important feature of the training procedure of models with
machine learning algorithms and it may affect the quality of the system. The most common
approach to training neural networks for hybrid NN/HMM systems consists in first aligning
the frames to HMM states with a bootstrapping system, and then train the network on the
obtained a labeled dataset with a framewise classification cost function, such as the cross-
entropy. This strategy amounts to considering the segmentation of the input sequence into
HMM states fixed, and to have the network predict it. A softer approach, similar to the
Baum-Welch training algorithm, would consist in summing over all possible segmentations
of the input sequences yielding the same final transcription. We have seen that in general,
this approach produces only small improvements.
The CTC framework is such a training procedure but also defines the outputs of the
neural network to correspond to the set of characters, and a special non-character output
(blank label). We have shown that RNNs can achieve good results with the CTC criterion.
MLPs can be trained with CTC but do not benefit from it.
We have studied the effects of applying a discriminative training criterion at the se-
quence level, namely state-level Minimum Bayes Risk (sMBR). We have shown that fine-
tuning the MLPs with sMBR yields significant improvements, between 5 and 13% of WER,
which is consistent with the speech recognition literature. Moreover, we investigated a new
regularization technique, dropout, in RNNs, extending the work of (Pham et al., 2014;
Zaremba et al., 2014). We reported significant improvements over the method presented in
Pham et al. (2014) when dropout is applied before LSTM layers rather than after them.
Finally, all our models achieved error rates comparable to the state-of-the-art on Rimes
and IAM, independently of the type of inputs (handcrafted features or pixels), and of the
kind of neural network (MLP or RNN). The lattice combination of our systems, with the
method of Xu et al. (2011), outperformed the best published systems for all three databases,
showing the complementarity of the developed models.
References
Augustin, E., Carré, M., Grosicki, E., Brodin, J.-M., Geoffrois, E., and Preteux, F. (2006).
RIMES evaluation campaign for handwritten mail processing. In Proceedings of the
144 Théodore Bluche, Christopher Kermorvant and Hermann Ney
Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. (1992). Global optimization of a
neural network-hidden Markov model hybrid. Neural Networks, IEEE Transactions on,
3(2):252–259.
Bertolami, R. and Bunke, H. (2008). Hidden Markov Model Based Ensemble Methods for
Offline Handwritten Text Line Recognition. Pattern Recognition, 41(11):3452 – 3460.
Bianne, A.-L., Menasri, F., Al-Hajj, R., Mokbel, C., Kermorvant, C., and Likforman-Sulem,
L. (2011). Dynamic and Contextual Information in HMM modeling for Handwriting
Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 33(10):2066 –
2080.
Bloomberg, D. S., Kopec, G. E., and Lakshmi Dasari (1995). Measuring document image
skew and orientation. Proc. SPIE Document Recognition II, 2422(302):302–316.
Bluche, T., Louradour, J., Knibbe, M., Moysset, B., Benzeghiba, M. F., and Kermorvant, C.
(2014). The A2iA Arabic Handwritten Text Recognition System at the Open HaRT2013
Evaluation. In 11th IAPR International Workshop on Document Analysis Systems (DAS),
pages 161–165. IEEE.
Bourlard, H. and Wellekens, C. J. (1989). Links between Markov models and multi-
layer perceptrons. Pattern Analysis and Machine Intelligence, IEEE Transactions on,
12(12):1167–1178.
Buse, R., Liu, Z. Q., and Caelli, T. (1997). A structural and relational approach to handwrit-
ten word recognition. IEEE Transactions on Systems, Man and Cybernetics, 27(5):847–
61.
Doetsch, P., Kozielski, M., and Ney, H. (2014). Fast and robust training of recurrent neural
networks for offline handwriting recognition. pages –.
How to Design Deep Neural Networks for Handwriting Recognition 145
El-Hajj, R., Likforman-Sulem, L., and Mokbel, C. (2005). Arabic handwriting recognition
using baseline dependant features and hidden markov modeling. In Document Analysis
and Recognition, 2005. Proceedings. Eighth International Conference on, pages 893–
897. IEEE.
Fiscus, J. G. (1997). A post-processing system to yield reduced word error rates: Recog-
nizer output voting error reduction (ROVER). In IEEE Workshop on Automatic Speech
Recognition and Understanding (ASRU1997), pages 347–354. IEEE.
Gers, F. (2001). Long Short-Term Memory in Recurrent Neural Networks TH ESE. 2366.
Graves, A., Fernández, S., Gomez, F., and Schmidhuber, J. (2006). Connectionist temporal
classification: labelling unsegmented sequence data with recurrent neural networks. In
International Conference on Machine learning, pages 369–376.
Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., and Schmidhuber, J.
(2009). A novel connectionist system for unconstrained handwriting recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(5):855–68.
Graves, A., Mohamed, A.-R., and Hinton, G. (2013). Speech Recognition with Deep Re-
current Neural Networks. In proc ICASSP, number 3.
Hennebert, J., Ris, C., Bourlard, H., Renals, S., and Morgan, N. (1997). Estimation of
global posteriors and forward-backward training of hybrid HMM/ANN systems.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. R.
(2012). Improving neural networks by preventing co-adaptation of feature detectors.
arXiv preprint arXiv:1207.0580.
Janet Holmes, B. V. and Johnson, G. (1998). Guide to the wellington corpus of spoken new
zealand english.
146 Théodore Bluche, Christopher Kermorvant and Hermann Ney
Johansson, S. (1980). The LOB corpus of British English texts: presentation and comments.
ALLC journal, 1(1):25–36.
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In
Acoustics, Speech, and Signal Processing, 1995. ICASSP-95., 1995 International Con-
ference on, volume 1, pages 181–184. IEEE.
Konig, Y., Bourlard, H., and Morgan, N. (1996). Remap: Recursive estimation and maxi-
mization of a posteriori probabilities-application to transition-based connectionist speech
recognition. Advances in Neural Information Processing Systems, pages 388–394.
Kozielski, M., Doetsch, P., Hamdani, M., and Ney, H. (2014). Multilingual Off-line Hand-
writing Recognition in Real-world Images. pages 1–1.
Kozielski, M., Doetsch, P., Ney, H., et al. (2013a). Improvements in RWTH’s System for
Off-Line Handwriting Recognition. In Document Analysis and Recognition (ICDAR),
2013 12th International Conference on, pages 935–939. IEEE.
Kozielski, M., Rybach, D., Hahn, S., Schluter, R., and Ney, H. (2013b). Open vocabulary
handwriting recognition using combined word-level and character-level language mod-
els. In 38th IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP2013), pages 8257–8261. IEEE.
Maas, A. L., Hannun, A. Y., Jurafsky, D., and Ng, A. Y. (2014). First-Pass Large Vocabulary
Continuous Speech Recognition using Bi-Directional Recurrent DNNs. arXiv preprint
arXiv:1408.2873.
Marti, U.-V. and Bunke, H. (2002). The IAM-database: an English sentence database
for offline handwriting recognition. International Journal on Document Analysis and
Recognition, 5(1):39–46.
Menasri, F., Louradour, J., Bianne-Bernard, A.-L., and Kermorvant, C. (2012). The A2iA
French handwriting recognition system at the Rimes-ICDAR2011 competition. In Doc-
ument Recognition and Retrieval Conference, volume 8297.
Messina, R. and Kermorvant, C. (2014). Surgenerative Finite State Transducer n-gram for
Out-Of-Vocabulary Word Recognition. In 11th IAPR Workshop on Document Analysis
Systems (DAS2014), pages 212–216.
Mohri, M., Pereira, F., and Riley, M. (2002). Weighted finite-state transducers in speech
recognition. Computer Speech & Language, 16(1):69–88.
Morillot, O., Likforman-Sulem, L., and Grosicki, E. (2013). Comparative study of HMM
and BLSTM segmentation-free approaches for the recognition of handwritten text-lines.
In Document Analysis and Recognition (ICDAR), 2013 12th International Conference
on, pages 783–787. IEEE.
How to Design Deep Neural Networks for Handwriting Recognition 147
Moysset, B., Bluche, T., Knibbe, M., Benzeghiba, M. F., Messina, R., Louradour, J., and
Kermorvant, C. (2014). The A2iA Multi-lingual Text Recognition System at the sec-
ond Maurdor Evaluation. In 14th International Conference on Frontiers in Handwriting
Recognition (ICFHR2014), pages 297–302.
Otsu, N. (1979). A Threshold Selection Method from Grey-Level Histograms. IEEE Trans-
actions on Systems, Man and Cybernetics, 9(1):62–66.
Pesch, H., Hamdani, M., Forster, J., and Ney, H. (2012). Analysis of Preprocessing Tech-
niques for Latin Handwriting Recognition. ICFHR, 12:18–20.
Pham, V., Bluche, T., Kermorvant, C., and Louradour, J. (2014). Dropout improves recur-
rent neural networks for handwriting recognition. In 14th International Conference on
Frontiers in Handwriting Recognition (ICFHR2014), pages 285–290.
Povey, D. (2004). Discriminative training for large vocabulary speech recognition. Ph.D.
thesis, Cambridge University.
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M.,
Motlicek, P., Qian, Y., Schwarz, P., et al. (2011). The Kaldi speech recognition toolkit.
In Workshop on Automatic Speech Recognition and Understanding (ASRU2011), pages
1–4.
Renals, S., Morgan, N., Bourlard, H., Cohen, M., and Franco, H. (1994). Connectionist
probability estimators in HMM speech recognition. Speech and Audio Processing, IEEE
Transactions on, 2(1):161–174.
Roeder, P. (2009). Adapting the rwth-ocr handwriting recognition system to french hand-
writing. Master’s thesis, Human Language Technology and Pattern Recognition Group,
RWTH Aachen University, Aachen. Germany.
Rosenblatt, F. (1958). The perceptron: a probabilistic model for information storage and
organization in the brain. Psychological review, 65(6):386.
Sainath, T. N., Mohamed, A.-r., Kingsbury, B., and Ramabhadran, B. (2013). Deep convolu-
tional neural networks for LVCSR. In 38th IEEE International Conference on Acoustics,
Speech and Signal Processing (ICASSP2013), pages 8614–8618. IEEE.
Sánchez, J. A., Mühlberger, G., Gatos, B., Schofield, P., Depuydt, K., Davis, R. M., Vidal,
E., and de Does, J. (2013). tranScriptorium: a european project on handwritten text
recognition. In Proceedings of the 2013 ACM symposium on Document engineering,
pages 227–228. ACM.
Sánchez, J. A., Romero, V., Toselli, A., and Vidal, E. (2014). ICFHR 2014 HTRtS: Hand-
written Text Recognition on tranScriptorium Datasets. In International Conference on
Frontiers in Handwriting Recognition (ICFHR).
148 Théodore Bluche, Christopher Kermorvant and Hermann Ney
Su, H., Li, G., Yu, D., and Seide, F. (2013). Error back propagation for sequence
training of Context-Dependent Deep Networks for conversational speech transcription.
In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
(ICASSP2013), pages 6664–6668.
Toselli, A. H., Juan, A., González, J., Salvador, I., Vidal, E., Casacuberta, F., Keysers, D.,
and Ney, H. (2004). Integrated handwriting recognition and interpretation using finite-
state models. International Journal of Pattern Recognition and Artificial Intelligence,
18(04):519–539.
Toselli, A. H., Romero, V., Pastor, M., and Vidal, E. (2010). Multimodal interactive tran-
scription of text images. Pattern Recognition, 43(5):1814–1825.
Veselý, K., Ghoshal, A., Burget, L., and Povey, D. (2013). Sequence-discriminative train-
ing of deep neural networks. In 14th Annual Conference of the International Speech
Communication Association (INTERSPEECH2013), pages 2345–2349.
Vinciarelli, A. and Luettin, J. (2001). A new normalisation technique for cursive handwrit-
ten words. Pattern Recognition Letters, 22:1043–1050.
Xu, H., Povey, D., Mangu, L., and Zhu, J. (2011). Minimum Bayes Risk decoding and sys-
tem combination based on a recursion for edit distance. Computer Speech & Language,
25(4):802–828.
Yan, Y., Fanty, M., and Cole, R. (1997). Speech recognition using neural networks with
forward-backward probability generated targets. In Acoustics, Speech, and Signal Pro-
cessing, IEEE International Conference on, volume 4, pages 3241–3241. IEEE Com-
puter Society.
Zaremba, W., Sutskever, I., and Vinyals, O. (2014). Recurrent neural network regulariza-
tion. arXiv preprint arXiv:1409.2329.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 6
1. Introduction
The use of the database is of fundamental importance for pattern recognition processes,
supervised training and computing in general (Duda et al., 2012). For problems related to
cursive handwriting and optical character recognition it is not different. Over the years a
great effort has been developed by the scientific community with the goal to develop these
datasets (Yalniz and Manmatha, 2011; Lazzara and Géraud, 2014; Padmanabhan et al.,
2009; Fischer et al., 2012, 2011a; Marti and Bunke, 1999; Shahab et al., 2010). The con-
struction process of these however is not very automated, and involves a great effort by the
researcher (Marti and Bunke, 1999). Usually, there are two approaches to build datasets,
which are distinguished by documents are captured: natural datasets and artificial datasets.
Natural datasets are the one built from the scanning and processing of actual real docu-
ments. These datasets normally have a more real challenge but are more difficult to process
∗
E-mail address: [email protected].
†
E-mail address: [email protected].
‡
E-mail address: [email protected].
§
E-mail address: [email protected].
¶
E-mail address: [email protected].
k
E-mail address: [email protected].
150 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
due to the variety of the nature of the documents. Below we discuss some existing natural
datasets.
On the other hand, artificial datasets are built from the use of forms or applications for
assistance to third, making that the texts and elements of the datasets are not derived from
actual real documents. In this category we have the datasets like IamDB (Marti and Bunke,
1999), Rimes (Menasri et al., 2012) and CVL (Kleber et al., 2013).
In this chapter, we review and analyze the most used handwritten and printed text
datasets developed during the last few decades. We not only discuss the dataset format, but
also its structure, category, statistics and how they are used in the literature. The datasets
are extremely important for machine learning algorithms and in a review of the most ad-
vances in the field some researchers suggest a provocative explanation: perhaps many major
machine learning breakthroughs have actually been constrained by the availability of high-
quality training datasets, and not by algorithmic advances1 . Considering this importance
to machine learning and consequently to handwriting recognition advances, we discuss the
complexity of building datasets and present two techniques to generate datasets for both
handwritten and printed texts.
2. Dataset Review
There is a large number of handwritten and printed text datasets that have been used for
different purposes over the last years. These datasets could be categorized by different
aspects, for example, data acquisition type (online or offline), text type (handwritten or
printed), size, format, tasks supported, language and others. This section presents a detailed
discussion and review about the handwritten and printed text datasets developed in each of
those aspects.
THE MNIST DATABASE of handwritten digits proposed by LeCun et al. (2012), has a
training set of 60,000 examples2 , and a test set of 10,000 examples. It is a subset of a larger
set available from NIST. The digits have been size-normalized and centered in a fixed-size
image.
IAM Handwriting Database. The IAM Handwriting Database (Marti and Bunke, 1999)
was published in 1999 and is formed by English text forms. The dataset is mainly used for
handwritten recognition, writer identification and verification tasks. 657 writers contributed
with samples of their writing, resulting in 1,539 pages of scanned text, 5,685 sentences,
13,353 text lines and 115,320 words. Each form was scanned at a resolution of 300 dpi and
the images were stored in eps format using 256 gray levels, each one linked to a XML file
containing meta information (e.g. the labels).
IAM On-Line Handwriting Database. The IAM On-Line Handwriting Database (IAM-
OnDB) (Liwicki and Bunke, 2005b) was published at the ICDAR 2005 and contains forms
of handwritten English sentences acquired on a whiteboard. The dataset is stored in a XML
containing writer-id, writer gender, writer native language and the transcription. It contains
86,272 words with 13,049 text lines written by 221 writers. The transcription are available
in TIFF format and was used for recognition purposes (Liwicki and Bunke, 2005a, 2006)
and writer identification (Schlapbach et al., 2008).
IAM Online Document Database. The IAM Online Document Database (Indermühle
et al., 2010) contains 941 online handwritten and printed documents (diagrams, drawings,
tables, lists, text blocks, formulas and markings) acquired with a digital pen. The dataset
provides the collected metadata in XML format and has been used for different tasks such
as handwritten text recognition, document layout analysis. The IAM Online Document
Database consists of approximately 70,000 words and more than 7,500 text lines.
Saint Gall Database. The Saint Gall Database (Fischer et al., 2011a) is a historical
dataset from a single writer using ink on parchment. In the Latin language from the 9th
century, the dataset includes 60 pages corresponding 1,410 text lines, 11,597 words, 4,890
word labels and 49 unique letters. The original manuscript which the database was scanned
is housed at the Abbey Library of Saint Gall, Switzerland. The manuscript was scanned at
2
http//yann.lecun.com/exdb/mnist/
152 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
300 dpi on JPEG format and was pre-processed using binarization and normalization oper-
ations. The Saint Gall Database has been employed for text lines and words segmentation
as well as handwritten recognition (Fischer et al., 2010a).
Parzival Database. The Parzival Database (Fischer et al., 2009) is a historical dataset
published in 2009 containing a 13th century manuscript in medieval German language com-
posed by three writers. The original manuscript was written using ink on parchment and,
like the Saint Gall Database, is housed at the Abbey Library of Saint Gall, Switzerland. The
dataset has 47 pages including 4,477 lines, 23,478 words and 93 unique letters. It’s divided
into pages images (JPEG, 300 dpi) after binarization and normalization operations. The
Parzival Database has been used for text line segmentation (Fischer et al., 2012, 2011b) and
word recognition (Fischer et al., 2009, 2010b).
Washington Database. The Washington Database (Rath and Manmatha, 2007a) was cre-
ated from the George Washington Papers at the Library of Congress. The dataset contains
18th century English words alongside with their transcription. It includes 20 pages of 656
text lines, 4,894 words and 82 unique letters. All images were binarized and normalized.
This dataset is mainly used for word-level recognition (Fischer et al., 2012; Frinken et al.,
2012) and keyword spotting (Fischer et al., 2013; Rath and Manmatha, 2007b).
2.3. Bentham
The Betham dataset (Gatos et al., 2014) is a large set of scanned documents written in the
18th century by the philosopher Jeremy Bentham. The dataset was built using a crowded-
funded web platform where volunteers help transcribing the documents. There are more
than 6,000 documents and it’s provided in two parts: the images and the ground-truth. The
latter has information about the layout and the transcription for each line of the documents.
This dataset was used for a competition in ICFHR 2014 for handwritten text recognition.
2.4. RIMES
The RIMES dataset was designed focused on recognition of handwritten letters sent by
customers via postal mail to companies. To build the database, 1300 people have partic-
ipated writing up to 5 mails. Each email contains two to three pages resulting in 12,723
pages and a total of 5605 mails. The RIMES database has been used for several compe-
titions (ICFHR 2008, ICDAR 2009, ICDAR 2011) and is used for different tasks such as
handwritten recognition and mail classification (Kermorvant and Louradour, 2010).
2.5. Maurdor
The Maurdor database (Brunessaux et al., 2014) was published in 2013 by the French
National Metrology and Testing Laboratory with both handwritten and printed text. The
dataset contains a total of 2,500 documents in English, French and Arabic and in differ-
ent types (forms, business documents, letters, correspondences and newspaper articles).
This database was been used for a competition organized by the Laboratoire National de
Handwritten and Printed Image Datasets: A Review and Proposals ... 153
Métrologie (LEN) with tasks in zone area segmentation, writing type identification, optical
character recognition and logical structure extraction.
2.6. PRImA
The PRImA database (Antonacopoulos et al., 2009) is a printed text dataset containing re-
alistic documents with a variety of layouts, types, structure and font. The database was
built by scanning magazines, technology publications and technical articles resulting in
305 ground-truthed images. In addition to the images, the dataset provides searchable
document-level metadata and a web interface for navigation. Since the PRImA dataset
has documents with a variety of layouts, it was initially used for layout analysis tasks, but
it has been used for optical character recognition (Diem et al., 2011) as well.
3. Handwriting Synthesis
Cursive handwriting synthesis is aimed at generating either a handwritten text image or
a pen path information involving online handwriting trajectories. Any of these outputs
(image/pen path information) is characterized by trying to look as much as possible to the
handwriting of real people. One of the main motivations for developing these techniques
is to help to expand already-existing datasets for handwriting recognition, overcoming the
difficulty and time it takes to create one from scratch (Elarian et al., 2015; Marti and Bunke,
1999).
Handwriting synthesis is an old research area with works dating from 1996 (Guyon,
1996), and the most recent advances in handwriting recognition models, such as Graves
(2012) and Toselli and Vidal (2015) along with the use of deep learning techniques (Pham
et al., 2014) has increased the necessity of expanding the existing datasets and thereby
leading to a renewed interest in handwriting synthesis (Ahmad and Fink, 2015; Dinges et al.,
2015; Chen et al., 2015). These works and their contributions to the field will be examined
bellow, classifying them into three main areas: symbols and connections, statistical and
machine learning.
ten text recognition based on Hidden Markov Models (HMMs). The final system evaluation
was carried out for different numbers of generated word samples: 1 to 12. Obtained results
are reported in Table 1.
3.2. Statistical
Some models of synthesis to generate artificial text are based on statistical knowledge about
how the actual text is produced. To collect such a statistical information, it is more common
to use datasets based on online handwriting, as in Martı́n-Albo et al. (2014) and Plamondon
et al. (2014).
Martin et al. in “Training of On-line Handwriting Text Recognizers with Synthetic Text
Generated Using the Kinematic Theory of Rapid Human Movements” Martı́n-Albo et al.
(2014) describes a synthesis model based on human kinematic motion. In this paper, the
rational principle behind the proposed approach is that the response to a given gesture is
modeled by two log-normal distributions: one modeling the agonist in the same direction
of the gesture, and another modeling the antagonist in the opposite direction of the gesture.
For complex movements, such as writing, movement can be modeled by a vector sum of
log-normals. The generation of the synthesis model consists of three phases: first the signal
parameters are extracted from some online data; second noise generation is carried out upon
this data, and third, speed is adjusted and a new sequence of coordinates pairs (x, y) is gen-
erated. For the experiments, the single handwriting word dataset “Unipen-ICROW-03” was
employed, and an on-line handwritten text recognition based on left-to-right HMMs with a
variable number of states and 8 Gaussians per state. In this case, for each word and author
available in the dataset, s new synthetic words were generated. The final obtained results
are summarized in Table 2, which shows the classification error rate for each combination
of a number of authors-specific samples and number of synthetically generated samples (s).
The results reported by Martin et al. are very promising, both for the technical simplic-
ity and for the ease of use thereof. However, this approach depends strongly on the dataset
Handwritten and Printed Image Datasets: A Review and Proposals ... 155
vocabulary that serves as the basis for training and therefore, it is not very effective in cases
an extension of the vocabulary is necessary.
As can be seen, the use of typefaces for handwriting synthesis to generate a classifier
156 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
for a real scenario results very interestingly mainly because the facility of generating this
dataset.
However, it is required a more careful analysis about how a system performance would
behave in case of using the Latin or Cyrillic alphabet, where the differences between hand-
written and printed alphabets are more considerable.
(Portuguese) using an already know alphabet learned from a different language (English).
(a) O velho que é forte perdura (b) Das cinzas um fogo a de vir
4.1. Pre-Processing
In this step, the input images are prepared for processing. The main goal of the pre-
processing is to enhance the visual of the images (by removing or reducing noise, for in-
158 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
stance) and improve the manipulation of the datasets. Some OCR systems, for example,
only deal with TIFF format, others only accept or work better with binarized images.
It’s very important to point out that this step modifies the image and the created dataset
has to be formed by the original image. For this reason, after the pre-processing step, there
will be two versions of each image: the original, without any modifications and the pre-
processed, which is the enhanced image.
Most OCR systems automatically do various image processing operations, but depend-
ing on the input images, there will inevitably be cases where the results aren’t good, directly
impacting the quality of the system. The most common image processing operations are:
Format conversion. Some OCR systems only accept specific image formats. Transym
OCR 3 , for instance, can read bitmap (.bpm) and TIFF (.tiff) images. If you wish to process
any other formats, you need to convert them to the input formats. This is what the format
conversion operation is responsible for.
Rescaling. Rescaling refers to changing the size of an image in terms of both resolution
and dots per inch (dpi). Regarding text size, for example, Tesseract suggests that accuracy
drops off bellow 10pt × 300dpi. Rescaling is also important when the input images are on a
higher scale than necessary and you want to speed up the processing. Therefore, to achieve
better results, you probably need to rescale the input images. To ensure the image quality
we suggest the use of algorithms based on bicubic Interpolation (Bourke, 2001).
of black) and locally (Niblack, Sauvola, White and Bernsen) algorithms. For a general view
of all these methods and much more, we suggest the reading of Sezgin and Sankur survey
(Sezgin et al., 2004). Although in average local binarization methods perform slightly
better than the global ones, there is a large performance variance. In some cases, some
global methods have a very good performance and some local ones are close to the worst
options (Stathis et al., 2008). The classical Sauvola algorithm is a very stable method for a
general propose document (Sauvola and Pietikäinen, 2000).
Noise removal. Noise usually originates from the image acquisition process and often
results in unrealistic intensity of pixels that can make the text of an image more difficult to
read. There are several techniques for this purpose. Low-pass, high-pass, band-pass spatial
filtering, mean filtering and median filtering are few examples. For a general view of all
these methods, we suggest this survey (Chakraborty and Blumenstein, 2016).
4.2. Processing
After the preprocessing, there are two versions of each image: the original and the pro-
cessed. In the processing step, the processed images are presented for each the OCR system
in order to recognize the words on the image.
This step is also responsible for parsing the OCR engines’ output. If you plan to test
different settings you can optionally store the output in a structured text format (for example,
XML) containing, for each image, the image path and the OCR output. Storing the OCR
output is important to save time by avoiding rerunning the engines. The output usually
contains the word or character position, text orientation, the text itself and the confidence.
It’s also important to store this information for the next step.
In order to speed up this step, you can optionally run the OCR systems in parallel. Note
that depending on the product, there are licensing restrictions and this might not be possible
(i.e. there are licenses that only allow you to run one OCR engine on one CPU core at a
time). To remove these restrictions you’ll probably need to purchase extra licenses.
4.3. Decision
Using the OCR output from the last step, the Decision is responsible for generating the
dataset containing images of words and its respective label. These heuristics are mapped in
two operators: the Matching Operator and the Confidence Operator.
Confidence Operator. Most of OCR engines provide the confidence per character (a
number between 0 and 1). The purpose of this operator is to evaluate the word confidence
based on each character confidence. This value can be calculated using median, average
and minimum of the character confidences. We suggest the use of the minimum score of
each word as the confidence.
Matching Operator. The matching operator decides which word will be used in the final
generated dataset. For example, let’s say both Engine A and Engine B recognize the word
160 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
“house” of a given image, but the Engine C doesn’t. The matching operator decides whether
this word will be selected for the final dataset or not.
The Confidence Operator and the Matching Operator work together for the final image
selection. While the former evaluates the word confidence, the latter uses a threshold that
can ignore words with confidence less than this threshold.
At the same time, the Matching Operator uses few more heuristics for its decision.
They were called Match All and Keep All. On the Keep All heuristic, the final dataset
will contain all non-ignored words from all engines, only respecting the threshold. For
example, suppose the Engine A recognized the words “victory” and “house” in one image
and the Engine B recognized the words “victory” and “ouse” for the same image, the words
“victory”, “house” and “ouse” will form the final dataset regardless of the different results.
When using the Keep All, the final dataset has more images than the Match All heuristic.
This heuristic is also important to understand each OCR engine behavior.
Furthermore, the Match All heuristic will only use the non-ignored words that were
recognized by all engines. On the previous example, the final dataset will only contain the
word “victory”, because that was the only non-ignore word recognized by both engines.
Conclusion
We present a analysis of the current state of datasets, showing how they are already built
and they main features. As is pointed by us those datasets suffer from the problem of the
amount of data and in some cases diversity of data or the versions of data (a lot of samples to
a sentence or word). To overcome this problem we present two paths to be followed, the first
one mostly focused in handwriting based on synthesis with a review of the current state-of-
the art of the techniques and then showing how to use then to extend beyond the limitation
of diversity and variety. The second path is focused in Optical Character Recognition and
focus in using a combination of systems to produce a better solution based on a tiebreak
decision using confidence and matching operators. With those two paths, we believe that we
can overcome the current problem of the amount of data and improve the current datasets
and consequently the recognition models.
Acknowledgment
The authors would like to thank the CNPQ for supporting the development of this chapter
through the research projects granted by “Edital Universal” (Process 444745/2014-9) and
“Bolsa de Produtividade DT” (Process 311912/2015-0). In addition, the authors also ac-
knowledge the Document Solutions for providing several real image samples useful for the
development of this research.
References
Ahmad, I. and Fink, G. A. (2015). Training an arabic handwriting recognizer without a
handwritten training data set. In Document Analysis and Recognition (ICDAR), 2015
13th International Conference on, pages 476–480. IEEE.
Handwritten and Printed Image Datasets: A Review and Proposals ... 161
Ahmad, I., Fink, G. A., and Mahmoud, S. A. (2014). Improvements in sub-character hmm
model based arabic text recognition. In Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on, pages 537–542. IEEE.
Al-Muhtaseb, H., Elarian, Y., and Ghouti, L. (2011). Arabic handwriting synthesis. In First
International Workshop on Frontiers in Arabic Handwriting Recognition, 2010.
Antonacopoulos, A., Bridson, D., Papadopoulos, C., and Pletschacher, S. (2009). A re-
alistic dataset for performance evaluation of document layout analysis. In 2009 10th
International Conference on Document Analysis and Recognition, pages 296–300.
Bahdanau, D., Cho, K., and Bengio, Y. (2014). Neural machine translation by jointly
learning to align and translate. arXiv preprint arXiv:1409.0473.
Brunessaux, S., Giroux, P., Grilhres, B., Manta, M., Bodin, M., Choukri, K., Galibert, O.,
and Kahn, J. (2014). The maurdor project: Improving automatic processing of digital
documents. In Document Analysis Systems (DAS), 2014 11th IAPR International Work-
shop on, pages 349–354.
Chen, H.-I., Lin, T.-J., Jian, X.-F., Shen, I., Chen, B.-Y., et al. (2015). Data-driven handwrit-
ing synthesis in a conjoined manner. In Computer Graphics Forum, volume 34, pages
235–244. Wiley Online Library.
Diem, M., Kleber, F., and Sablatnig, R. (2011). Text classification and document layout
analysis of paper fragments. In 2011 International Conference on Document Analysis
and Recognition, pages 854–858.
Dinges, L., Al-Hamadi, A., Elzobi, M., El-etriby, S., and Ghoneim, A. (2015). Asm based
synthesis of handwritten arabic text pages. The Scientific World Journal, 2015.
Duda, R. O., Hart, P. E., and Stork, D. G. (2012). Pattern classification. John Wiley &
Sons.
Elarian, Y., Ahmad, I., Awaida, S., Al-Khatib, W. G., and Zidouri, A. (2015). An arabic
handwriting synthesis system. Pattern Recognition, 48(3):849–861.
Fischer, A., Frinken, V., Bunke, H., and Suen, C. Y. (2013). Improving hmm-based keyword
spotting with character language models. In 2013 12th International Conference on
Document Analysis and Recognition, pages 506–510. IEEE.
Fischer, A., Frinken, V., Fornés, A., and Bunke, H. (2011a). Transcription alignment of
latin manuscripts using hidden markov models. In Proceedings of the 2011 Workshop on
Historical Document Imaging and Processing, pages 29–36. ACM.
162 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
Fischer, A., Indermühle, E., Bunke, H., Viehhauser, G., and Stolz, M. (2010a). Ground
truth creation for handwriting recognition in historical documents. In Proceedings of the
9th IAPR International Workshop on Document Analysis Systems, DAS ’10, pages 3–10,
New York, NY, USA. ACM.
Fischer, A., Indermuhle, E., Frinken, V., and Bunke, H. (2011b). Hmm-based alignment of
inaccurate transcriptions for historical documents. In 2011 International Conference on
Document Analysis and Recognition, pages 53–57.
Fischer, A., Keller, A., Frinken, V., and Bunke, H. (2012). Lexicon-free handwritten word
spotting using character hmms. Pattern Recognition Letters, 33(7):934–942.
Fischer, A., Riesen, K., and Bunke, H. (2010b). Graph similarity features for hmm-based
handwriting recognition in historical documents. In Frontiers in Handwriting Recogni-
tion (ICFHR), 2010 International Conference on, pages 253–258.
Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., and Stolz,
M. (2009). Automatic transcription of handwritten medieval documents. In Virtual Sys-
tems and Multimedia, 2009. VSMM’09. 15th International Conference on, pages 137–
142. IEEE.
Frinken, V., Fischer, A., Manmatha, R., and Bunke, H. (2012). A novel word spotting
method based on recurrent neural networks. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 34(2):211–224.
Garris, M. D., Blue, J. L., Candela, G. T., et al. (1997). Nist form-based handprint recogni-
tion system (release 2.0).
Gatos, B., Louloudis, G., Causer, T., Grint, K., Romero, V., Snchez, J. A., Toselli, A. H., and
Vidal, E. (2014). Ground-truth production in the transcriptorium project. In Document
Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 237–241.
Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint
arXiv:1308.0850.
Indermühle, E., Liwicki, M., and Bunke, H. (2010). Iamondo-database: an online hand-
written document database with non-uniform contents. In Proceedings of the 9th IAPR
International Workshop on Document Analysis Systems, pages 97–104. ACM.
Kleber, F., Fiel, S., Diem, M., and Sablatnig, R. (2013). Cvl-database: An off-line database
for writer retrieval, writer identification and word spotting. In 2013 12th International
Conference on Document Analysis and Recognition, pages 560–564. IEEE.
LeCun, Y., Cortes, C., and Burges, C. J. (2012). The mnist database of handwritten digits,
1998. Available electronically at http://yann. lecun. com/exdb/mnist.
Liwicki, M. and Bunke, H. (2005b). Iam-ondb-an on-line english sentence database ac-
quired from handwritten text on a whiteboard. In Eighth International Conference on
Document Analysis and Recognition (ICDAR’05), pages 956–961. IEEE.
Marcus, M. P., Marcinkiewicz, M. A., and Santorini, B. (1993). Building a large annotated
corpus of english: The penn treebank. Computational linguistics, 19(2):313–330.
Marti, U.-V. and Bunke, H. (1999). A full english sentence database for off-line handwriting
recognition. In Document Analysis and Recognition, 1999. ICDAR’99. Proceedings of
the Fifth International Conference on, pages 705–708. IEEE.
Martı́n-Albo, D., Plamondon, R., and Vidal, E. (2014). Training of on-line handwriting
text recognizers with synthetic text generated using the kinematic theory of rapid human
movements. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International
Conference on, pages 543–548. IEEE.
164 Gearlles V. Ferreira, Felipe M. Gouveia, Byron L. D. Bezerra et al.
Menasri, F., Louradour, J., Bianne-Bernard, A.-L., and Kermorvant, C. (2012). The
a2ia french handwriting recognition system at the rimes-icdar2011 competition. In
IS&T/SPIE Electronic Imaging, pages 82970Y–82970Y. International Society for Optics
and Photonics.
Padmanabhan, R. K., Jandhyala, R. C., Krishnamoorthy, M., Nagy, G., Seth, S., and Sil-
versmith, W. (2009). Interactive conversion of web tables. In International Workshop on
Graphics Recognition, pages 25–36. Springer.
Pechwitz, M., Maddouri, S. S., Märgner, V., Ellouze, N., Amiri, H., et al. (2002). Ifn/enit-
database of handwritten arabic words. In Proc. of CIFED, volume 2, pages 127–136.
Citeseer.
Pham, V., Bluche, T., Kermorvant, C., and Louradour, J. (2014). Dropout improves recur-
rent neural networks for handwriting recognition. In Frontiers in Handwriting Recogni-
tion (ICFHR), 2014 14th International Conference on, pages 285–290. IEEE.
Plamondon, R., O’Reilly, C., Galbally, J., Almaksour, A., and Anquetil, É. (2014). Recent
developments in the study of rapid human movements with the kinematic theory: Appli-
cations to handwriting and signature synthesis. Pattern Recognition Letters, 35:225–235.
Rath, T. M. and Manmatha, R. (2007a). Word spotting for historical documents. Interna-
tional Journal of Document Analysis and Recognition (IJDAR), 9(2):139–152.
Rath, T. M. and Manmatha, R. (2007b). Word spotting for historical documents. Interna-
tional Journal of Document Analysis and Recognition (IJDAR), 9(2-4):139–152.
Saleem, S., Cao, H., Subramanian, K., Kamali, M., Prasad, R., and Natarajan, P. (2009).
Improvements in bbn’s hmm-based offline arabic handwriting recognition system. In
2009 10th International Conference on Document Analysis and Recognition, pages 773–
777. IEEE.
Schlapbach, A., Liwicki, M., and Bunke, H. (2008). A writer identification system for
on-line whiteboard data. Pattern Recogn., 41(7):2381–2397.
Sezgin, M. et al. (2004). Survey over image thresholding techniques and quantitative per-
formance evaluation. Journal of Electronic imaging, 13(1):146–168.
Shahab, A., Shafait, F., Kieninger, T., and Dengel, A. (2010). An open approach towards
the benchmarking of table structure recognition systems. In Proceedings of the 9th IAPR
International Workshop on Document Analysis Systems, pages 113–120. ACM.
Stathis, P., Kavallieratou, E., and Papamarkos, N. (2008). An evaluation technique for
binarization algorithms. J. UCS, 14(18):3011–3030.
Handwritten and Printed Image Datasets: A Review and Proposals ... 165
Toselli, A. H. and Vidal, E. (2015). Handwritten text recognition results on the bentham
collection with improved classical n-gram-hmm methods. In Proceedings of the 3rd
International Workshop on Historical Document Imaging and Processing, pages 15–22.
ACM.
Yalniz, I. Z. and Manmatha, R. (2011). A fast alignment scheme for automatic ocr evalua-
tion of books. In 2011 International Conference on Document Analysis and Recognition,
pages 754–758. IEEE.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J.,
Ollason, D., Povey, D., et al. (2002). The htk book. Cambridge University engineering
department, 3:175.
PART II.
ANALYSIS AND APPLICATIONS
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 7
1. Introduction
Mathematical notation is a well-known language that has been used all over the world for
hundreds of years. Despite the great number of cultures, languages and even different writ-
ing scripts, mathematical expressions constitute a universal language in many fields. During
the last century and in particular since the development of the Internet, digital information
represents the best resource for accessing and sharing data. Therefore, it is necessary to
digitize documents and to input mathematical expressions directly into computers.
Although most people know how to read or write mathematical expressions, introducing
them into a computer device usually requires learning specific notations or knowledge of
how to use a certain editor. Mathematical expression recognition intends to fill this gap
between the knowledge of a person and the language computers understand.
more regular. Thus, individual elements and the relation between them can be determined
more consistently. Handwriting introduces more variability in the shape of the symbols and
the relationship between them. Also, there are many different writers and writing styles,
thus, handwritten mathematical expression recognition is more challenging. Figures 1 and
2 show an example of the printed and handwritten version of the same formula.
Regarding the input representation, we consider the problem to be off-line if the math-
ematical expression is represented as an image, i.e. a matrix of pixels. On the other hand,
a mathematical expression is considered to be on-line when it has been acquired using a
device which provides us with the temporal information of the writing, i.e. the input is a
time sequence of points.
The representation of mathematical expressions is based on different primitives depend-
ing on the type of expression. Off-line expressions are usually based on connected compo-
nents.
The primitives for representing on-line mathematical expressions are usually strokes.
Definition 2. A stroke is the sequence of points drawn from when a pen touches the surface
until the user lifts the pen from the surface.
These definitions of primitives can be seen in the examples of Figures 1 and 2. In the
printed expression of Figure 1, symbols π and + are made up of one connected component,
and symbol = is composed of two connected components. If the handwritten expression
of Figure 2 was on-line, symbol π would be composed of three strokes, symbol + would
be composed of two strokes and number 0 would be drawn using just one stroke. But if
the handwritten expression of Figure 2 was off-line, the instance of symbol π would be
composed of two connected components, and the instances of symbol + and number 0
would be made up of one connected component.
Mathematical Expression Recognition 171
handwriting production, in that the expression could have several valid interpretations with
alternative segmentations: 1 − 1 < x , 1 − kx or H < x .
(a) (b)
R x +
Sup R
Sup R 2 + ∧ 1
x 2 + 1 1 x 2
pression can be obtained by computing the minimum spanning tree. Many approaches of
this group have been proposed and we briefly summarize some of them below.
Eto and Suzuki (2001) developed a model for printed math expression recognition that
computed the minimum spanning tree of a network representation of the expression. Tapia
and Rojas (2004) presented a proposal for online recognition also based on constructing the
minimum spanning tree and using symbol dominance. Zanibbi et al. (2002) recognized an
expression as a tree, and proposed a system based on a sequence of tree transformations.
Lehmberg et al. (1996) defined a net so that the sequence of symbols within the handwritten
expression was represented by a path through the graph. Shi et al. (2007) presented a similar
system where symbol segmentation and recognition were tackled simultaneously based on
graphs. They then generated several symbol candidates for the best segmentation, and the
recognized expression was computed in the final structural analysis (Shi and Soong, 2008).
This group of approaches generally results in efficient algorithms for recognizing for-
mulas, and trees and graphs are proper models for representing mathematical expressions.
However, context-free dependencies are not naturally modeled in most of these structures.
Also, some approaches require a one-dimensional order, but mathematical notation is 2D.
Therefore, the order is often achieved by detecting baselines and exploiting the left-to-right
reading order. But errors in the baseline detection cannot be solved in further steps. Another
option to obtain a one-dimensional order in online recognition is to assume that symbols
are written with consecutive strokes, which limits the set of accepted inputs.
Labahn (2013) developed an approach using relational grammars and fuzzy sets. Despite
previous approaches using different types of grammars, the methodology is based on the
same process.
Grammars allow us to model complex structural relationships by means of the rules,
which combine sub-problems to construct larger hypotheses (see Figure 10). In this chapter
we will focus on solutions based on PCFG since we will detail an approach based on this
formalism in the next section.
Proposals based on PCFG use grammars to model the structure of the expression, but
the recognition systems are different. Garain and Chaudhuri (2004) proposed a system that
combines online and offline information in the structural analysis. First, they created on-
line hypotheses based on determining baselines in the input expression, and then offline
hypotheses using recursive horizontal and vertical splits. Finally they used a context-free
grammar to guide the process of merging the hypotheses. Yamamoto et al. (2006) presented
a version of the CYK algorithm for parsing 2D-PCFGs with the restriction that symbols and
relations must follow the writing order. They defined probability functions based on a re-
gion representation called “hidden writing area”. Pra and Hlav (2007) described a system
for offline recognition using 2D context-free grammars. Their proposal was penalty-based
so that weights were associated with regions and syntactic rules. The model proposed by
Awal et al. (2014) considers several segmentation hypotheses based on spatial information,
and the symbol classifier has a rejection class in order to avoid incorrect segmentations.
Álvaro et al. (2016) developed an integrated model based on parsing 2D-PCFG where the
recognition process globally optimizes the most likely expression according to several prob-
abilistic sources. In the following section, we will further detail this proposal as an example
of a solution for math expression recognition.
Expression
+ 3
o7
o5 o2 o8
o1 o3
o4
o6
Figure 11. Example of input for an on-line handwritten math expression. The order of the
input sequence of strokes is labeled (o = o1 o2 . . . o8 ).
of strokes does not correspond necessarily to the sequence of symbols that it represents.
For example, we can see that the user first wrote the sub-expression x − y, then the user
added the parentheses and its superscript (x − y)2 , finally converting the subtraction into
an addition (x + y)2 . This example shows that some symbols might not be made up of
consecutive strokes (e.g. the + symbol in Figure 11). This means that the mathematical
expression would not be correctly recognized if it was parsed monotonically with the input,
i.e. processing the strokes in the order in which they were written. Meanwhile, the sequence
of symbols that make up a sub-expression does not have to respect the writing order (e.g.
the parentheses and the sub-expression they contain in Figure 11).
Given a sequence of input strokes, the output of a mathematical expression recognizer
is usually a sequence of symbols (Shi et al., 2007). However, we consider that a signifi-
cant element of the output is the structure that defines the relationship between the symbols
which make up the final mathematical expression. As mentioned above, we propose model-
ing the structural relationships of a mathematical expression using a statistical grammatical
model. By doing so, we define the problem of mathematical expression recognition as ob-
taining the most likely parse tree given a sequence of strokes. Figure 12 shows a possible
parse tree for the expression given in Figure 11, where we can observe that a (context-
free) structural model would be appropriate due to, for instance, structural dependencies
in bracketed expressions. The output parse tree represents the structure that relates all the
symbols and sub-expressions that make up the input expression. The parse tree derivation
produces the sequence of pre-terminals that represent the recognized mathematical sym-
bols. Furthermore, to generate this sequence of pre-terminals, we must take into account all
stroke combinations in order to form the possible mathematical symbols.
Taking these considerations into account, two main problems have been observed. First,
the segmentation and recognition of symbols is closely related to the alignment of mathe-
matical symbols to strokes. Second, the structural analysis of a mathematical expression ad-
dresses the problem of finding the parse tree that best accounts for the relationships between
different mathematical symbols (pre-terminals). Obviously, these two problems are closely
178 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
Exp
ParExp ∧ Sym
LeftPar ExpRightPar
Exp RightPar
Sym OpSym
OpBin Sym
( x + y ) 2
o1 o2 o3 o4 o5 o6 o7 o8
Figure 12. Parse tree of expression (x + y)2 given the input sequence of strokes described
in Figure 11. The parse tree represents the structure of the mathematical expression and it
produces the 6 recognized symbols that account for the 8 input strokes.
If we approximate the previous probability by the maximum probability parse tree, and
assume that the structural part of the equation depends only on the sequence of pre-terminals
Mathematical Expression Recognition 179
such that p(s|o) represents the observation (symbol) likelihood and p(t|s) represents the
structural probability.
This problem can be solved in two steps. First, by calculating the segmentation of the
input into mathematical symbols and, second, by computing the structure that relates all
recognized symbols (Zanibbi et al., 2002).
However, we propose here a fully integrated strategy for computing Equation (1) where
symbol segmentation, symbol recognition and structural analysis of the input expression are
globally determined. This way, all the information is taken into account in order to obtain
the most likely mathematical expression.
In Section below we define the observation model that accounts for the probability of
recognition and segmentation of symbols, p(s|o). The probability that accounts for the
structure of the mathematical expression p(t|s) is described in the Structural Probability
Section.
Definition 4. We define BK as the set of all possible segmentations of the input strokes o in
K parts. Similarly, we define the set of all segmentations B as:
[
B= BK
1≤K≤N
Then, in Equation (1), we can define a generative model p(s, o), rather than p(s|o),
because, given that the term p(o) does not depend on the maximization variables s and
t, we can drop it. The next step is to replace the sequence of N input strokes o by its
previously defined set of segmentations, b = b(o, K) ∈ BK where 1 ≤ K ≤ N . Finally,
given K, we define a hidden variable that limits the number of strokes for each of the K
pre-terminals (symbols) that make up the segmentation, l : l1 . . . lK . Each li falls within the
range 1 ≤ li ≤ min(N, Lmax ), where Lmax is a parameter that constrains the maximum
number of strokes that a symbol can have.
X X X
p(s, o) = p(s, b, l)
1≤K≤N b∈BK l
In order to develop this expression, we factor it with respect to the number of pre-terminals
(symbols) and assume the following constraints: 1) we approximate the summations by
maximizations, 2) the probability of a possible segmentation depends only on the spatial
constraints of the strokes it is made up of, 3) the probability of a symbol depends only on
the set of strokes associated with it, and 4) the number of strokes for a pre-terminal depends
only on the symbol it represents:
K
Y
p(s, o) ≈ max max max p(bi) p(si | bi ) p(li | si ) (2)
K b∈BK l
i=1
From Equation (2) we can conclude that we need to define three models: a symbol
segmentation model, p(bi), a symbol classification model, p(si |bi), and a symbol duration
model, p(li|si ).
Definition 5. The distance between two strokes oi and oj can be defined as the Euclidean
distance between their closest points.
Definition 6. A stroke oi is considered visible from oj if the straight line between the closest
points of both strokes does not cross any other stroke ok .
If a stroke oi is not visible from oj we consider that their distance is infinite. For
example, given the expression in Figure 11, the strokes visible from o4 are o3 , o6 and o8 .
Furthermore, we know that a multi-stroke symbol is composed of strokes that are spa-
tially close. For this reason, we only consider segmentation hypotheses bi where strokes are
close to each other.
Using these definitions, we can characterize the set of possible segmentation hypotheses.
Definition 8. Let G be an undirected graph such that each stroke is a node and edges only
connect strokes that are visible and close. Then, a segmentation hypothesis bi is admissible
if the strokes it contains form a connected subgraph in G.
cannot form a mathematical symbol) from the set of all admissible segmentations B. A
segmentation hypothesis bi is represented by the 4-dimensional normalized feature vector
g(bi) = [d, σ, δ, θ], and the probability p(bi) that a hypothesis bi forms a mathematical
symbol is obtained as
p(bi) = pGMM (c = 1 | g(bi)) (3)
• Normalized coordinates: (x, y) normalized values such that y ∈ [0, 100] and the
aspect-ratio of the sample is preserved.
• Normalized first derivatives: (x0 , y 0 ).
• Normalized second derivatives: (x00 , y 00).
• Curvature: k, the inverse of the radius of the curve at each point.
Mathematical Expression Recognition 183
It should be noted that no resampling is required prior to the feature extraction process be-
cause first derivatives implicitly perform writing speed normalization (Toselli et al., 2007).
Furthermore, the combination of on-line and off-line information has been proven to
improve recognition accuracy (Winkler, 1996; Keshari and Watt, 2007; Álvaro et al., 2014).
For this reason, we also rendered the image representing the symbol hypothesis bi and
extracted off-line features to train another BLSTM-RNN classifier.
Following (Álvaro et al., 2014, 2016), for a segmentation hypothesis bi , we generated
the image representation as follows. We set the image height to H pixels and kept the aspect
ratio (up to 5H, in order to prevent creating images that were too wide). Then we rendered
the image representation by using linear interpolation between each two consecutive points
in a stroke. The final image was produced after smoothing it using a mean filter with a win-
dow sized 3 × 3 pixels, and binarizing for every pixel that is different from the background
(white).
Given a binary image of height H and W columns, for each column we computed 9
off-line features (Marti and Bunke, 2001; Álvaro et al., 2014):
Finally, given a segmentation hypothesis bi and using Equation (4), we obtained the
posterior probability of a BLSTM-RNN with on-line features and the posterior probability
of a BLSTM-RNN with off-line features. We combined the probabilities of both classifiers
using linear interpolation and a weight parameter (α). The final probability of the symbol
classification model is calculated as
where p(α|A) is the probability of the rule A → α and represents the probability that α
is derived from A. Moreover, (A → α, t) denotes all rules (A → α) contained in the
parse tree t. In the defined 2D extension of PCFG, the composition of subproblems has
an additional constraint according to a spatial relationship r. Let the spatial relationship r
between two regions be a hidden variable. Then, the probability of a binary rule is written
as: X
p(BC | A) = p(BC, r | A)
r
When the inner probability in the previous addition is estimated from samples, the mode
is the dominant term. Therefore, by approximating summations by maximizations, and
assuming that the probability of a spatial relationship depends only on the subproblems B
and C involved, the structural probability of a mathematical expression becomes:
Y
p(t, s) ≈ p(a | A) (7)
(A→a,t)
Y
max p(BC | A) p(r | BC) (8)
r
(A→BC,t)
where p(a|A) and p(BC|A) are the probabilities of the rules of the grammar, and p(r|BC)
is the probability that regions encoded by non-terminals B and C are arranged according to
spatial relationship r.
dx1 dy1
C B
dx dy
dx2 D C
B dy2
dhc
Figure 13. Geometric features for classifying the spatial relationship between regions B
and C.
and superscript (Álvaro et al., 2014; Álvaro and Zanibbi, 2013). An important feature for
distinguishing between these three relationships is the difference between vertical centroids
(D). Some symbols have ascenders, descenders or certain shapes where that the vertical
centroid is not the best placement for the symbol center.
With a view to improving the placement of vertical centroids, we divided symbols into
four typographic categories: ascendant (e.g. d or λ), descendant (p, µ), normal (x, +) and
middle (7, Π). For normal symbols the centroid is set to the vertical centroid. For ascendant
symbols the centroid is shifted downward to (centroid + bottom)/2. Likewise, for descen-
dant symbols the centroid is shifted upward to (centroid + top)/2. Finally, for middle
symbols the vertical centroid is defined as (top + bottom)/2.
Once we defined the feature vector representing a spatial relationship, we can train a
GMM using labeled samples so that the probability of the spatial relationship model can be
computed as the posterior probability provided by the GMM for class r
p(r | BC) = pGMM (r | h(B, C))
Mathematical Expression Recognition 187
This model is able to provide a probability for every spatial relationship r between
any two given regions. However, there are several situations where we would not want the
statistical model to assign the same probability as in other cases. Considering the expression
in Figure 14, the GMMs might yield a high probability for superscript relationship ‘3x ’, for
the below relationship ‘π2 ’, and for the right relationship ‘2 3’; though we might expect
a lower probability, since they are not the true relationships in the correct mathematical
expression.
Intuitively, those symbols or subexpressions that are closer together should be combined
first. Furthermore, two symbols or subexpressions that are not visible from each other
should not be combined. These ideas are introduced into the spatial relationship model as a
penalty based on the distance between strokes.
Specifically, given the combination of two hypotheses B and C, we computed a penalty
function based on the minimum distance between the strokes of B and C
so that it is in the range [0, 1]. It should be noted that, although it is a penalty function, since
it multiplies the probability of a hypothesis, the lower the penalty value is, the greater the
probability is penalized.
This function is based on the single-linkage hierarchical clustering algorithm (Sibson,
1973) where, at each step, the two clusters separated by the shortest distance are combined.
We defined a penalty function in order to avoid making hard decisions, because it is not
always the case that the two closest strokes must be combined first.
The final statistical spatial relationship probability is computed as the product of the
probability provided by the GMM and the penalty function based on hierarchical clustering
An interesting property of the application of the penalty function is that, given that the
distance between non-visible strokes is considered infinite, this function prunes many hy-
potheses. Furthermore, it favors the combination of closer strokes over strokes that are
further apart. For example, in the superscript relationship between symbols 3 and x in Fig-
ure 14, although it could be likely, the penalty will favor that the 3 is first combined with
the fraction bar, and later the fraction bar (and the entire fraction) with the x.
statistical framework described previously. Using this algorithm, we compute the most
likely parse tree according to the proposed model.
The parsing algorithm is essentially a dynamic programming method. First, the ini-
tialization step computes the probability of several mathematical symbols for each possible
segmentation hypothesis. Second, the general case computes the probability of combining
different hypotheses so that it builds the structure of the mathematical expression.
The dynamic programming algorithm computes a probabilistic parse table γ. Following
a notation similar to (Goodman, 1999), each element of γ is a probabilistic non-terminal
vector, where their components are defined as:
∗
γ(A, b, l) = p̂(A ⇒ b); l =|b|
where that γ(A, b, l) denotes the probability of the best derivation that the non-terminal A
generates a set of strokes b of size l.
Initialization: In this step the parsing algorithm computes the probability of every
admissible segmentation b ∈ B as described in the Symbol Segmentation Model Section.
The probability of each segmentation hypothesis is computed according to Eqs. (1) and (2)
as
where Lmax is a parameter that constrains the maximum number of strokes that a symbol
can have.
This probability is the product of a range of factors so that it is maximized for every
mathematical symbol class s: probability of terminal rule, p(s|A) (Equation (7)), prob-
ability of segmentation model, p(b) (Equation (3)), probability of mathematical symbol
classifier, p(s|b) (Equation (5)), and probability of duration model probability, p(l|s) (Equa-
tion (6)).
General case: In this step the parsing algorithm computes a new hypothesis γ(A, b, l)
by merging previously computed hypotheses from the parsing table until all N strokes are
parsed. The probability of each new hypothesis is calculated according to Eqs. (1) and (8)
as:
where b = bB ∪ bC ; bB ∩ bC = ∅ and l = lB + lC .
This expression shows how a new hypothesis γ(A, b, l) is built by combining two sub-
problems γ(B, bB , lB ) and γ(C, bC , lC ), considering both syntactic and spatial informa-
tion: probability of binary grammar rule p(BC|A) (Equation (8)) and probability of spatial
relationship classifier p(r|BC) (Equation (9)). It should be noted that both distributions
significantly reduce the number of hypotheses that are merged. Also, the probability is
Mathematical Expression Recognition 189
maximized taking into account that a probability might already have been set by the Equa-
tion (10) during the initialization step.
Finally, the most likely hypothesis and its associated derivation tree t̂ that accounts
for the input expression can be retrieved in γ(S, o, N) (where S is the start symbol of the
grammar).
Right, Sub/Superscript
x = max(r(bB ).x + 1,
r(bB ).s − Rw ) bB
y = r(bB ).y − Rh R
s = r(bB ).s + 8Rw
t = r(bB ).t + Rh
Below
x = r(bB ).x − 2Rw bB
y = max(r(bB ).y + 1,
r(bB ).t − Rh )
s = r(bB ).s + 2Rw
R
t = r(bB ).t + 3Rh
Inside
x = r(bB ).x + 1 bB
y = r(bB ).y + 1
s = r(bB ).s + Rw
t = r(bB ).t + Rh R
Mroot
x = r(bB ).x − 2Rw R
y = r(bB ).y − Rh
s = min(r(bB ).s,
r(bB ).x + 2Rw )
t = min(r(bB ).t, bB
r(bB ).y + 2Rh )
Figure 15. Spatial regions defined to retrieve hypotheses relative to hypothesis bB according
to different relations.
hypotheses γ(C, bC , lC ) falling within that area by performing a binary search over that set
in O(log N ). Although the regions are arranged in two-dimensions and they are sorted only
in one dimension, this approach is reasonable since mathematical expressions grow mainly
from left to right.
Assuming that this binary search will retrieve a small constant number of hypothesis,
the final complexity achieved is O(N 3 log N |P |). Furthermore, many unlikely hypotheses
are pruned during the parsing process.
Mathematical Expression Recognition 191
LATEX MathML
x_a^2 + 1 <mathml> <mathml>
x_a^{2} + 1 <mrow> <mrow>
x_{a}^2 + 1 <msubsup> <msubsup>
<mi>x</mi> <mi>x</mi>
x_{a}^{2} + 1
<mi>a</mi> <mi>a</mi>
x^2_a + 1 <mn>2</mn> <mn>2</mn>
x^2_{a} + 1 </msubsup> </msubsup>
x^{2}_a + 1 <mo>+</mo> <mrow>
x^{2}_{a} + 1 <mn>1</mn> <mo>+</mo>
</mrow> <mn>1</mn>
</mathml> </mrow>
</mrow>
</mathml>
Figure 16. Some examples of different valid representations for math expression x2a + 1 in
LATEX and MathML format.
4.2. EMERS
A mathematical expression can be naturally represented as a tree (see Figure 7). The tree
representation, commonly in MathML format, contains simultaneously the symbols and the
structure of a given mathematical expression. For this reason, computing an edit distance
between trees is an appropriate method in order to compute the error between a recognized
expression and its ground-truth tree.
Sain et al. (2010) proposed EMERS,1 a tree matching-based performance evaluation
metric for mathematical expression recognition. Using the tree representation of two ex-
pressions in MathML (which can also be easily obtained from LATEX) they defined a method
for computing the edit distance between them. Since matching of trees is a hard prob-
lem, they proposed to match ordered trees represented by their corresponding Euler strings.
Given two trees encoded by two Euler strings A and B, the overall complexity of the
EMERS algorithm is O(|A|2|B|2 ) or more generally O(n4 ).
EMERS computes the set of edit operations that transform the recognized tree into the
ground-truth tree. Accordingly, EMERS is not a normalized metric but an edit distance,
where if both trees are identical EMERS is equal to zero. The edit distance between trees
is a well-defined metric but the representation ambiguity of MathML can mean that correct
recognition results are considered errors. In Álvaro et al. (2012b) an experiment using two
equivalent ground-truths it was shown that the expression recognition rate, computed as the
percentage of expressions with EMERS equal to zero, differed by almost 8% depending
on the ground-truth used. A canonical form to represent math expressions in MathML is
required in order to avoid this problem. Sain et al. (2010) tried to overcome this problem
by converting the MathML to LATEX and then converting the LATEX back to MathML.
As with global metrics, the computed error value accounts for the entire expression but
the source of the errors is not explicitly known. The set of edit operations is provided and
we could compute if they were related to symbols or tags, but segmentation mistakes could
not be detected and would become symbols and tags errors.
Finally, the authors propose two options for computing the error: every edit operation
has the same cost, or it depends on the baseline (using the concept of level defined in previ-
ous sections) in which the edit operators are done. The default EMERS value is computed
using the weighted version, and this results in a non-symmetrical distance in some cases.
for i = 1 to I do
for j = 1 to J do
i0 = i XI , j 0 = j YJ , z = 2c ;
z
X z
X
map(i, j) = min (Avi+n,j+m − Bx+n,y+m
v
)2
x∈S1
y∈S2 m=−z n=−z
+ (Ahi+n,j+m − Bx+n,y+m
h
)2
end
end
|cp|
return //Correct pixels ratio
|f g|
end
Algorithm 1: Binary IDM (BIDM) evaluation algorithm.
mean f1 = 2(p · r)/(p + r), and we obtain the final error value. Figure 17 illustrates an
example of this process.
img1 = x2 + 1 3 img2 = x2 + 1
c) BIDM computation in both directions
img2 → img1 img1 → img2
3
1489 ok 1429 ok
4 precision = 2197 fg = 0.6777 recall = 2338 fg = 0.6112
Figure 17. Example of the procedure for computing the IMEGE measure given a math
expression recognition and its ground-truth in LATEX.
Rendering the image of a math expression encoding copes with the problem of repre-
sentation ambiguity. IMEGE provides a normalized value in the range [0, 100] than can be
interpreted as a visual error (as human beings do) and is not as pessimistic as expression
recognition rate. IMEGE can not distinguish the source of the errors although it can iden-
tify the misrecognized zones of the math expression. As a visual error, misrecognitions
involving larger symbols would affect more pixels than errors produced by smaller sym-
bols. Given that this measure takes the global recognition information into account, it can
be very helpful to complement the expression recognition rate and symbol related metrics
in order to assess the performance of a system.
Each stroke keeps the spatial relationship of its associated symbol, and the nodes inherit the
spatial relationships of their ancestors in the layout tree.
Figure 18 shows an example of on-line handwritten math expression and two label
graphs: a label graph for its ground-truth, and a label graph for a recognition result con-
taining errors. Each label graph is displayed so that the dashed edges show the inherited
relationships. The adjacency matrix representation is also provided, where the diagonal of
the matrix represents the symbol class of each stroke and other cells provide primitive pair
labels. These pairs encode the spatial relationships (right, superscript, etc.), where under-
score ( ) identifies unlabeled strokes or no-relationship, and an asterisk (∗) represents two
strokes in the same symbol (Zanibbi et al., 2013).
Since the label graph representation contains the information of a mathematical expres-
sion at all levels (symbols, segmentation and structure), several metrics can be computed.
Recognition: 2k x
2 R R R R
k
_ k ∗ S S
R {o2 o3 } Sup
_ ∗ k S S
2 x
o1 R {o4 o5 } _ _ _ x ∗
_ _ _ ∗ x
Ground-truth: 21 < x
R
2 R R R R
2 R
< R x
_ 1 R R R
o1
{o4 o5 }
o3
_ _ < R R
R R
_ _ _ x ∗
R
1 _ _ _ ∗ x
o2
Figure 18. Example of label graph representation of an on-line handwritten math expression
recognition and its ground-truth. The dashed edges are inherited relationships.
Given a math expression composed of n strokes, its ground-truth label graph, and the
label graph of a recognition result, Zanibbi et al. (2011) defined the following set of metrics.
First, metrics for specific errors:
• Classification error (∆C): the number of strokes that have different symbol classes
(elements of the diagonal of the adjacency matrix) in the label graphs.
• Layout error (∆L): the number of disagreeing edge labels in the label graphs (off-
diagonal elements of the adjacency matrix). This error can be decomposed as the
sum of segmentation error (∆S) and relationships error (∆R), depending on the type
198 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
In the recognition example of Figure 18, we can see that the symbols 1 and < have been
incorrectly grouped as a letter k, and the relationship with the letter x has been incorrectly
detected as superscript. The error metrics previously described for this example are:
• ∆C = 2; {k → 1, k → <}
• ∆S = 2; {∗ → , ∗ → R}
• ∆R = 4; {S → R, S → R, S → R, S → R}
• ∆L = ∆S + ∆R = 6
2+6 8
• ∆Bn = 52
= 25 = 0.32
q q
1 2 2 6
• ∆E = 3 5 + 20 + 20 = 0.4213
All these label graph Hamming distances are proper metrics (Zanibbi et al., 2011), and
the time complexity for the expression level metrics is O(n2 ), although in practice only the
labeled edges have to be compared, which is much faster for sparse graphs. Furthermore,
label graph metrics precisely determine the different types of errors at all levels, which
is very useful information. Also, the representation ambiguity of formats like LATEX or
MathML is tackled by the label graph representation and the inheritance of relationships.
Finally, precision and recall at object level (symbols) can also be computed from this
representation (Zanibbi et al., 2013). The last editions of the CROHME competition
use these metrics for assessing the performance of the systems (Mouchère et al., 2013;
Mouchère et al., 2014), thus these metrics are becoming the current standard of the state-of-
the-art literature in mathematical expression recognition, thanks to the authors that released
a great set of open-source tools for computing them3 .
It should be noted that this set of metrics is based on strokes, i.e. for on-line handwrit-
ten mathematical expressions. However, the authors pointed out that they can be applied to
3
Label Graph Evaluation Library at http://www.cs.rit.edu/ dprl/Software.html
Mathematical Expression Recognition 199
images using pixels or connected components as primitives, as well as to other related prob-
lems like chemical diagrams, flowchart recognition or tables (Zanibbi et al., 2011, 2013). In
order to do so, the ground-truth has to be provided at primitive level. In the CROHME com-
petitions, the InkML format contains all the required information to build the label graphs,
but using only LATEX or MathML annotation would not be enough.
5. Experimentation
We developed a system4 that implements the parsing algorithm of the approach proposed in
this chapter5 . In order to evaluate the performance of this proposal, we carried out several
4
https://github.com/falvaro/seshat
5
Demo available at http://cat.prhlt.upv.es/mer
200 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
experiments using a large public dataset from a recent international competition (Mouchère
et al., 2013).
5.1. Dataset
The CROHME 2013 competition (Mouchère et al., 2013) released a large resource for math-
ematical expression recognition as a result of combining and normalizing several datasets.
Currently, it represents the largest public dataset of on-line handwritten mathematical ex-
pressions. This dataset has 8, 836 mathematical expressions for training and a test set of
671 math expressions. The number of mathematical symbol classes is 101.
We reported results on the competition test set in order to provide comparable results.
The metrics reported are based on label graphs as described in the Label Graphs Section.
We needed a validation set for adjusting the models and parameters used in the parsing
algorithm. For this reason, we set 500 mathematical expressions apart from the training
set, so that both sets had the same distribution of symbols. Therefore, the sets used in the
experimentation described in this study were a training set made up of 8, 336 expressions
(81K symbols), a validation set with 500 expressions (5K symbols) and the CROHME 2013
test set containing 671 mathematical expressions (6K symbols).
were incorrect segmentations. From the validation set we extracted 35K samples: 4.4%
correct and 95.6% incorrect segmentations.
We trained the GMMs using the training samples and the Expectation-Maximization
algorithm. The parameters of the model were chosen by minimizing the error when clas-
sifying the segmentation hypotheses of the validation set. The number of mixtures in the
final GMMs was 5.
where c(A → α) is the number of times that the rule A → α was used when recognizing
the training set, and c(A) is the total number of productions used that have A as left-hand
operator. In order to account for unseen events, we smoothed the probabilities using add-
one smoothing (Jurafsky and Martin, 2008).
Initial
System
Equiprobable Parameter
2D-PCFG Tuning
Train Set
Tuned
Mathematical
System
Expressions
Validation Set
Contrained
Mathematical
Parsing
Expressions
Viterbi
Estimation
Estimated
2D-PCFG
Parameter
Tuning
Final
System
Figure 19. Diagram of the process for training the final mathematical expression recognition
system.
For these reasons, following (Luo et al., 2008), we assigned different exponential
weights to each model probability, and we also added an insertion penalty in the initial-
ization step (Equation (10)). These parameters alleviate the scaling differences of the prob-
abilities. The weights help to adjust the contribution of each model to the final probability,
since some sources of information are more relevant than the others.
The parameters of the parsing algorithm are: insertion penalty, exponential weights,
segmentation distance threshold and maximum number of strokes per symbol (Lmax ). We
set Lmax = 4 because it accounts for 99.81% of the symbols in the dataset. The remaining
parameters were set initially to 1.0 and we tuned them using the downhill simplex algo-
rithm (Nelder and Mead, 1965) minimizing the ∆E metric (Zanibbi et al., 2011) when
recognizing the validation set (Figure 19).
sults. It was declared the strongest system in the competition. However, as they used
a large private dataset we were not able to fairly compare its performance to that of the
other systems. System IV was a preliminary version of seshat and was declared the best
system trained using only the CROHME training dataset. The main differences between
System IV and seshat are as follows: seshat includes off-line information in the sym-
bol classifier; symbol segmentation and spatial relationships classification are carried out
204 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
by GMMs in seshat and by SVMs in system IV; seshat uses visibility between strokes
and a clustering-based penalty; and the probabilities of the grammar were not estimated in
System IV.
The system presented in this study significantly outperformed the other systems that
were trained using the CROHME dataset at all levels. At symbol level, symbol classifi-
cation of correctly segmented symbols of seshat obtained recall and precision of about
82.2% and 81.0%, while the next best system (system VIII) obtained 73.8 and 71.0%. The
absolute difference was more than 8% in recall and 10% in precision. Regarding spatial
relationships, recall and precision for seshat stood at 88% and 82%, while System VIII
gave 73% and 77.7%. This translates into an absolute difference of about 14% in recall
and 4.3% in precision. Results at stroke-level were also better than those of the systems
trained only using the CROHME dataset. The systems were ranked according to the global
error metric ∆E, where seshat had 13.2%, some 6.1% less than next best system (19.3).
Table 3 shows that the confidence interval (Bisani and Ney, 2004) was 13.2% ± 0.9 at 95%
of confidence. We were not able to obtain confidence interval for the other systems because
the final system outputs were not freely available.
In addition to the experimentation comparing our proposed model to other systems, it
is interesting to see how each of the stochastic sources contribute to overall system perfor-
mance. For this reason we also carried out an experiment to observe this behavior. Some
models are mandatory for recognizing mathematical expressions: the symbol classifier, the
spatial relationships classifier and the grammar. We performed an experiment using only
these models (base system), then added the remaining models one by one. Also, the gram-
mar initially had equiprobable productions and then we compared the performance when
the probabilities of the rules were estimated.
Table 3 shows the changes in system performance when each source of information
is added. Global error metrics ∆Bn and ∆E consistently decreased with each added fea-
ture. Confidence intervals computed for ∆E showed that the improvements were significant
Mathematical Expression Recognition 205
from the first row to the last row. Symbol segmentation and symbol recognition also im-
proved with each addition. It is interesting that, when no segmentation model was used,
symbol segmentation still gave good results. This was the case because the parameters of
the integrated approach converged to high values of the insertion penalty and low values of
the segmentation distance threshold. In this way, the parameters of the system itself could
alleviate the lack of a segmentation model. In any case, when the segmentation model was
included, system performance improved significantly. Furthermore, we would like to under-
line that, when the relations penalty was included in the model, the number of hypotheses
explored was reduced by 56.7%.
The structural analysis is harder to evaluate. Prior to estimating the grammar probabili-
ties, the results at object level seem to worsen when the segmentation model was included,
although at stroke-level the errors in spatial relationships decreased from about 11, 000 to
9, 500 stroke pairs. Because of the inheritance of spatial relations in label graphs (Zanibbi
et al., 2011) some types of structural errors can produce more stroke-pair errors than oth-
ers (Zanibbi et al., 2013). Specifically, when the segmentation model was used, the seg-
mentation distance threshold was approximately twice as high as the value of the threshold
in the base system. This had two effects. First, that the system was able to account for more
symbol segmentations, as shown by the corresponding metrics. Second, that a bad decision
in symbol segmentation can lead to worse structural errors. Nevertheless, the estimation of
the probabilities of the grammar led to great improvements in the detection of the structure
of the expression with barely any changes in symbol recognition performance.
References
Álvaro, F., Sánchez, J. A., and Bened, J. M. (2012a). Unbiased evaluation of handwrit-
ten mathematical expression recognition. In International Conference on Frontiers in
Handwriting Recognition, pages 181–186.
Álvaro, F., Sánchez, J. A., and Benedı́, J. M. (2013). Classification of On-line Mathemat-
ical Symbols with Hybrid Features and Recurrent Neural Networks. In International
Conference on Document Analysis and Recognition, pages 1012–1016.
206 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
Álvaro, F., Sánchez, J.-A., and Benedı́, J.-M. (2013). An image-based measure for evalua-
tion of mathematical expression recognition. In Iberian Conference on Pattern Recogni-
tion and Image Analysis, pages 682–690. Springer Berlin Heidelberg.
Álvaro, F., Sánchez, J. A., and Benedı́, J. M. (2014). Offline Features for Classifying Hand-
written Math Symbols with Recurrent Neural Networks. In International Conference on
Pattern Recognition, pages 2944–2949.
Álvaro, F., Sánchez, J.-A., and Benedı́, J.-M. (2014). Recognition of on-line handwritten
mathematical expressions using 2d stochastic context-free grammars and hidden markov
models. Pattern Recognition Letters, 35(0):58 – 67.
Álvaro, F., Sánchez, J.-A., and Benedı́, J.-M. (2016). An integrated grammar-based ap-
proach for mathematical expression recognition. Pattern Recognition, 51:135 – 147.
Álvaro, F., Snchez, J.-A., and Bened, J.-M. (2012b). Unbiased evaluation of handwrit-
ten mathematical expression recognition. In International Conference on Frontiers in
Handwriting Recognition (ICFHR), pages 181–186.
Álvaro, F. and Zanibbi, R. (2013). A Shape-Based Layout Descriptor for Classifying Spatial
Relationships in Handwritten Math. In ACM Symposium on Document Engineering,
pages 123–126.
Awal, A.-M., Mouchere, H., and Viard-Gaudin, C. (2009). Towards handwritten mathe-
matical expression recognition. In International Conference on Document Analysis and
Recognition, pages 1046–1050.
Awal, A.-M., Mouchere, H., and Viard-Gaudin, C. (2010). The problem of handwritten
mathematical expression recognition evaluation. In International Conference on Fron-
tiers in Handwriting Recognition, pages 646–651.
Awal, A.-M., Mouchre, H., and Viard-Gaudin, C. (2014). A global learning approach for
an online handwritten mathematical expression recognition system. Pattern Recognition
Letters, 35(0):68 – 77.
Bisani, M. and Ney, H. (2004). Bootstrap estimates for confidence intervals in ASR perfor-
mance evaluation. In IEEE International Conference on Acoustics, Speech, and Signal
Processing, volume 1, pages 409–412, Montreal.
Chan, K.-F. and Yeung, D.-Y. (2000). Mathematical expression recognition: a survey.
International Journal on Document Analysis and Recognition, 3:3–15.
Chan, K.-F. and Yeung, D.-Y. (2001). Error detection, error correction and performance
evaluation in on-line mathematical expression recognition. Pattern Recognition, 34:1671
– 1684.
Mathematical Expression Recognition 207
Lavirotte, S. and Pottier, L. (1998). Mathematical formula recognition using graph gram-
mar. In Proceedings of the SPIE, volume 3305, pages 44–52.
Lehmberg, S., Winkler, H.-J., and Lang, M. (1996). A soft-decision approach for symbol
segmentation within handwritten mathematical expressions. In International Conference
on Acoustics, Speech, and Signal Processing, volume 6, pages 3434–3437.
Luo, Z., Shi, Y., and Soong, F. (2008). Symbol graph based discriminative training and
rescoring for improved math symbol recognition. In IEEE International Conference on
Acoustics, Speech, and Signal Processing, pages 1953–1956.
MacLean, S. and Labahn, G. (2010). Elastic matching in linear time and constant space.
IAPR Internatioanl Workshop on Document Analysis Systems, pages 551–554.
MacLean, S. and Labahn, G. (2013). A new approach for recognizing handwritten math-
ematics using relational grammars and fuzzy sets. International Journal on Document
Analysis and Recognition, 16(2):139–163.
Marti, U.-V. and Bunke, H. (2001). Using a statistical language model to improve the
performance of an HMM-based cursive handwriting recognition system. International
Journal of Pattern Recognition and Artificial Intelligence, 15(01):65–90.
Mouchère, H., Viard-Gaudin, C., Zanibbi, R., and Garain, U. (2014). ICFHR 2014 Com-
petition on Recognition of On-Line Handwritten Mathematical Expressions (CROHME
2014). In Frontiers in Handwriting Recognition (ICFHR), International Conference on,
pages 791–796.
Mouchère, H., Viard-Gaudin, C., Zanibbi, R., Garain, U., and Kim, D. H. (2013). ICDAR
2013 CROHME: Third International Competition on Recognition of Online Handwrit-
ten Mathematical Expressions. In International Conference on Document Analysis and
Recognition, pages 1428–1432.
Nelder, J. A. and Mead, R. (1965). A simplex method for function minimization. Computer
Journal, 7:308–313.
Ney, H. (1992). Stochastic Grammars and Pattern Recognition. In Speech Recognition and
Understanding, volume 75, pages 319–344.
Okamoto, M., Imai, H., and Takagi, K. (2001). Performance evaluation of a robust method
for mathematical expression recognition. In Proc. 6th International Conference on Doc-
ument Analysis and Recognition (ICDAR’01), pages 121–128.
Otsu, N. (1979). A Threshold Selection Method from Gray-level Histograms. IEEE Trans-
actions on Systems, Man and Cybernetics, 9(1):62–66.
Pavan Kumar, P., Agarwal, A., and Bhagvati, C. (2014). A string matching based algorithm
for performance evaluation of mathematical expression recognition. Sadhana, 39(1):63–
79.
Pra, D. and Hlav, V. (2007). Mathematical formulae recognition using 2d grammars. Inter-
national Conference on Document Analysis and Recognition, 2:849–853.
Sain, K., Dasgupta, A., and Garain, U. (2010). EMERS: a tree matching based performance
evaluation of mathematical expression recognition systems. IJDAR, pages 1–11.
Schuster, M. and Paliwal, K. (1997). Bidirectional recurrent neural networks. Signal Pro-
cessing, IEEE Transactions on, 45(11):2673–2681.
Shi, Y., Li, H., and Soong, F. K. (2007). A Unified Framework for Symbol Segmentation
and Recognition of Handwritten Mathematical Expressions. In International Conference
on Document Analysis and Recognition, pages 854–858.
Shi, Y. and Soong, F. (2008). A symbol graph based handwritten math expression recogni-
tion. In International Conference on Pattern Recognition, pages 1–4.
Sibson, R. (1973). SLINK: an optimally efficient algorithm for the single-link cluster
method. The Computer Journal, 16(1):30–34.
Thammano, A. and Rugkunchon, S. (2006). A neural network model for online handwritten
mathematical symbol recognition. In Intelligent Computing, volume 4113, pages 292–
298.
Toselli, A., Juan, A., and Vidal, E. (2004). Spontaneous handwriting recognition and clas-
sification. In Proc. of the 17th International Conference on Pattern Recognition, pages
433–436, Cambridge, UK.
Toselli, A., Pastor, M., and Vidal, E. (2007). On-line handwriting recognition system for
tamil handwritten characters. In Pattern Recognition and Image Analysis, volume 4477,
pages 370–377.
Winkler, H.-J. (1996). HMM-based handwritten symbol recognition using on-line and off-
line features. In IEEE Int. Conference on Acoustics, Speech, and Signal Processing,
volume 6, pages 3438–3441.
210 Francisco Álvaro, Joan Andreu Sánchez and José Miguel Benedı́
Yamamoto, R., Sako, S., Nishimoto, T., and Sagayama, S. (2006). On-line recognition
of handwritten mathematical expressions based on stroke-based stochastic context-free
grammar. IEIC Technical Report.
Zanibbi, R., Blostein, D., and Cordy, J. (2002). Recognizing mathematical expressions
using tree transformation. IEEE Trans. on Pattern Analysis and Machine Intelligence,
24(11):1–13.
Zanibbi, R., Mouchère, H., and Viard-Gaudin, C. (2013). Evaluating structural pattern
recognition for handwritten math via primitive label graphs. In Document Recognition
and Retrieval (DRR).
Zanibbi, R., Pillay, A., Mouchère, H., Viard-Gaudin, C., and Blostein, D. (2011). Stroke-
based performance metrics for handwritten mathematical expressions. In International
Conference on Document Analysis and Recognition, pages 334–338.
Zhang, L., Blostein, D., and Zanibbi, R. (2005). Using fuzzy logic to analyze superscript
and subscript relations in handwritten mathematical expressions. In International Con-
ference on Document Analysis and Recognition, volume 2, pages 972–976.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 8
1. Introduction
Handwriting recognition of unconstrained Indian text is difficult because of the presence of
many complex shaped compound characters as well as variability involved in the writing
style of different individuals. Though there are many published works on Latin, Chinese
or Japanese handwriting recognition, recognition of Indian scripts is yet to be extensively
investigated and reliable recognition system with high accuracy is yet to be obtained. A
number of studies (Pal et al., 2012; Jayadevan et al., 2011) have been done for offline
recognition of Indian scripts like Bangla, Devanagari, Gurmukhi, Tamil, Telugu, Oriya, etc.
Only a few works have been reported on online recognition of cursive Indian scripts. In
the following Sub-section, we present the state of the art of online handwriting recognition
studies of Indian scripts.
emerged, and usually each research group uses their own set of, mostly hand-crafted, fea-
tures. This is quite remarkable since in related field, such as speech recognition, a standard
set of features, e.g. the Mel-frequency cepstral coefficients (MFCC), is widely accepted
(Greenberg et al., 2004).
Some works are available on online isolated Bangla character recognition in Roy et al.
(2007) and Mondal et al. (2010). A few works are available on online Bangla cursive text
recognition. In Bhattacharya U. et al. (2008), online handwritten words were segmented es-
timating the position of the headline of the word. Preprocessing operations such as smooth-
ing and re-sampling of points were done before feature extraction. They used 77 features
considering 9 chain-code directions. Modified quadratic discriminant function (MQDF)
classifier was used for recognition. They did not consider any word involving compound
character. In Bhattacharya and Pal (2012), A system for segmentation and recognition of
Bangla online handwritten text was described. Authors identified 85 stroke classes used
in cursive Bangla handwriting. Cursive words were explicitly segmented into primitives
using an automatic segmentation algorithm. Segmentation procedure made use of some
rules discovered analyzing different joining patterns of Bangla characters. Next, the primi-
tives obtained from the input word were recognized using 64 chain code histogram features
and SVM classifier (Vapnik, 1995; Burges, 1998). A similar method for segmentation and
recognition of Bangla online handwritten text containing compound characters was de-
scribed in Bhattacharya N. et al. (2013). In Fink et al. (2010), the authors divided each
stroke of the pre-processed word sample into several sub-strokes using the angle incurred
while writing. Features used were length, angle, width, normalized x and y coordinates.
Then HMM was used for recognition. A work on holistic word recognition can be found in
Samanta et al. (2014). Authors tried to explicitly segment the word such that segmentation
points represent approximate character boundaries. They used 24 angles with respect to
neighboring points as features along with some of the NPen++ features reported in Jaeger
et al. (2001). HMM-based classifier was used considering both forward and reverse se-
quences of features from handwritten words.
In Swethalakshmi (2007), representations of strokes based on spatio-temporal, spectral
and spatio-structural features were proposed for Devanagari and Tamil scripts. Studies on
stroke classification were performed using support vector machines for the three proposed
representations. A rule-based approach was proposed for recognition of characters from
recognized strokes. Though this work did not handle cursive text, the performance of a
hidden. Markov model-based classifier was compared for the case of spatiotemporal rep-
resentation of strokes. Spatiotemporal feature vector contained the x and y co-ordinates of
the online points. The work in Bharath and Madhvanath (2012) addressed lexicon-free and
lexicon-driven recognition problems for Type-1 (discretely written), Type-2 (cursively writ-
ten) and Type-3 (written with delayed strokes) words of Devanagari and Tamil scripts. They
used the features introduced in NPen++ recognizer (Jaeger et al., 2001) and implemented
HMM-based classifier for recognition.
Naz et al. (2014) surveyed the optical character recognition (OCR) and online character
recognition literature with reference to the Urdu-like cursive scripts. In particular, Urdu,
Pushto, and Sindhi languages are discussed, with the emphasis being on the Nasta’liq and
Naskh scripts. The works are analyzed comparing with a typical OCR pipeline with an
emphasis on the pre-processing, segmentation, feature extraction, classification, and recog-
Online Handwriting Recognition of Indian Scripts 213
nition.
Different systems for handwriting recognition use different features to represent the in-
put text. Even after decades of research, no favourable decision on a best-practice exists
and many features are carefully hand-crafted. To facilitate the design phase for on-line
handwriting systems, Frinken et al. (2014b) proposed an unsupervised feature generation
approach based on dissimilarity space embedding (DSE) of local neighbourhoods around
the points along the trajectory. DSE has a high capability of discriminative representation
and hence beneficial for classification (Pekalska and Duin, 2005). They compared the ap-
proach with a state-of-the-art feature extraction method and demonstrated its superiority.
The fundamental idea of the feature extraction process, inspired from previously mentioned
works on DSE (Fischer et al., 2010; Pekalska and Duin, 2005; Riesen and Bunke, 2009),
was to first create a fixed set of reference points then use the list of dissimilarities of the
current element to the reference points as features. To find the reference prototypes, two
unsupervised approaches were investigated - clustering based and random prototype selec-
tion. While a wide variety of prototype selection strategies are known, these two approaches
show a robust performance over different types of datasets (Bunke and Riesen, 2012). Bidi-
rectional long short-term memory (BLSTM) neural network (Graves et al., 2009) was used
for Bangla words recognition.
We see that the most successful approaches to classifying temporal pattern involve hid-
den Markov Models (Bishop, 1992; Sun and Jelinek, 1999; Rabiner, 1989) or recurrent
neural networks. HMM has been proved to be successful to classify temporal patterns (Cho
et al., 1995); while long-term dependencies within sequences have been successfully ad-
dressed using BLSTM neural networks for handwriting recognition of Latin script (Graves
et al., 2009). In Frinken et al. (2014a), authors investigated different encoding schemes of
Bangla compound characters and compared the recognition accuracies. They proposed to
model complex characters not as unique symbols, represented by individual nodes in the
output layer. Instead, they exploited the BLSTM NN property of long-distance-dependent
classification. Only basic strokes were classified and special nodes were used to react
to semantic changes in the writing, i.e., distinguishing inter-character spaces from intra-
character spaces. This was the first work, to the knowledge of the authors, which explored
the applicability of BLSTM neural network to Bangla words containing the compound char-
acter.
Texts containing crossing out, repeated writing of the same stroke several times, and
over-writing are very common in practical life. These three types of writing (crossing out,
repetition and over-writing) can be called as “noise”. A practical problem in data acquisi-
tion and recognition is the detection and removal of noise from handwriting. The only work
on detection of noisy regions in online words was done in Bhattacharya et al. (2015). Au-
thors claimed that the method was able to detect noise from both online and offline words.
Different density-based features were proposed to distinguish between “relevant” and “un-
wanted” (or noisy) parts of writing and a 2-class HMM based offline classifier was used for
classification into clean and noisy parts in a word.
In this chapter, we choose to describe an approach for recognition of Bangla online
words. Bangla is an important script and it is very difficult to recognize.
214 Umapada Pal and Nilanjana Bhattacharya
2. Properties of Bangla
Bangla is the second most popular language in India and the fifth most popular language
in the world. More than 200 million people speak in Bangla and Bangla script is used
in Assamese and Manipuri languages in addition to Bangla language. The set of basic
characters of Bangla consists of 11 vowels and 39 consonants. As a result, there are 50
different shapes in the Bangla basic character set. The concept of upper/lower case is absent
in this script. Figure 1 shows ideal (printed) forms of these 50 basic character shapes.
In Bangla, a vowel (except for the first vowel) can take modified form and we call
it a vowel modifier. Ideal (printed) shapes of these vowel modifiers corresponding to 10
vowels with a basic character KA are shown in Figure 2. Similarly, consonants can also
take modified form. Figure 3 shows consonant modifiers with a basic character BA.
A consonant or a vowel following a consonant sometimes takes a compound ortho-
graphic shape, which we call compound character. Compound characters can be combina-
tions of two or three characters. Modifiers can be used with compound characters which
may result in more complicated shapes. Examples of some Bangla compound character
formations are shown in 4-a. Occasionally, the combination of two basic characters forms
a new shape as shown in the first two rows of Figure 4-a. In the third and fourth rows of
4-a one of the constituent characters of the compound character retains its shape and the
other constituent character reduces its size in the compound character. In the compound
character shown in the fifth row, two characters sit side by side in compounding, but the
size of the first character is slightly reduced. In the compound characters depicted in the
sixth and seventh rows are formed by three basic characters where the shape of none of
its constituent basic characters can be found. Since the formation of compound characters
is different and people write these in many ways, it is very difficult to recognize Bangla
compound characters.
Main difficulty of any character recognition system is the shape similarity. It can be
noted that because of handwritten style, two different characters in Bangla may look very
close. For example, in Figure 4-b, some similar shaped Bangla compound character pairs
are shown. Such shape similarity makes the recognition system more complex.
Unconstrained Bangla handwriting is usually cursive. In one stroke, the writer can write
a part of a character or one or more characters. It is found that a single stroke may contain
up to 6 characters and modifiers. Also in Bangla, the most of the touchings of characters
in a word occur in the region of word’s headline or sirorekha portion (Pal et al., 2012) in
contrast to English handwriting where the touchings occur in the lower part of the word
shape. For characters forming a compound character, joining occurs at the end part of the
first character (which may be in upper or middle or lower region of the word) with the
beginning of the next character.
On the other hand, several single characters are written in a variety of ways - in a single
stroke or in more than one stroke. From the statistical analysis, it is found that the minimum
number of stroke used to write a Bangla character is 1 and the maximum number is 6. Hence
online recognition of Bangla is a difficult task.
Online Handwriting Recognition of Indian Scripts 215
Figure 1. Bangla basic characters above (vowels are in green, consonants in brown) and
their respective codes for future reference.
Figure 2. Vowel modifiers of Bangla and their respective codes with basic character KA.
Figure 3. Consonant modifiers of Bangla and their respective codes with basic character
BA.
Figure 4. (a) Bangla compound character formation from basic characters. (b) Similar
shaped compound characters.
various types of stroke connections as well as shape variations. For segmentation of strokes,
some rules are discovered analyzing different joining patterns of Bangla characters. By
manually analyzing different strokes after segmentation, 251 distinct primitive classes are
obtained. Directional features of 64 dimensions are extracted to recognize the segmented
primitives using SVM classifier.
TOP LINE and BOTTOM LINE. For segmentation purpose, up and down zones are de-
fined, as depicted in Figure 5(a). From topmost row of the word to (TOP LINE + t1) row
is up zone and (TOP LINE + t2) row to down most row of the word is down zone. Here,
t1=height of busy zone/3. t2= height of busy zone/2. The height of busy zone= BOT-
TOM LINE - TOP LINE.
Online points are described as up, down or don’t know points according to their belong-
ing to up zone, down zone or outside these zones. If the pen tip goes from down zone to
up zone and then again come to down zone, two characters or modifiers may be touching
in the up zone and hence the stroke should be segmented (Figure 5 (b)). Because of this,
for each stroke, stroke movement patterns like “down-> up->down” are found, i.e. “any
number of down points followed by any number of up points followed by any number of
down points” within the stroke (don’t points are simply ignored). For such pattern, seg-
mentation is done at the highest point of up zone of the touching. Such segmentation point
is called as candidate segmentation point. For “down->up->down” stroke, from the first
“down”, find down most point. From second “down” also find the down most point. Find
the point which is higher (nearer to up points) among these two down most points. Call it
“HIGHER DOWN”.
Figure 5. (a) TOP LINE, BOTTOM LINE, up zone and down zone in a word. (b) Touching
of BA and KA (stroke movement form: up->down->up->down->up).
Now, the candidate points are validated to avoid over-segmentation. Using positional
information and stroke patterns, two levels of validations are performed as follows:
a. End point of a connected stroke should be at the right side of start point of the stroke,
i.e. c(end point) > c(start point), where c(x) means column value of x. Otherwise,
candidate segmentation point is cancelled.
b. End point of a connected stroke should be at the right side of previous validated
segmentation point of the stroke, i.e. c(end point) > c(previous segmentation point).
Otherwise, candidate segmentation point is cancelled.
Examples of some of the results obtained before and after Level-2 validation are shown
in Figure 6. Different strokes of input word are depicted in different colors and the segmen-
tation points are shown in red on the strokes.
Figure 6. Candidate segmentation points are shown by small solid red squares. (i) Before
applying Rule-(a): E is over-segmented. (ii) After applying Rule-(a). (iii) Before applying
Rule-(b): NGA is over-segmented. (iv) After applying Rule-(b).
and Figure 7 (iv) shows two GAs written by different writers. The left stroke of first GA
(Figure 7 (ii)) is similar to the right stroke of KA (Figure 7 (i)). Also, the left stroke of
second GA (Figure reffig:7 (iv)) is similar to the left stroke in SA (Figure 7 (iii)). Hence, in
the ground truth file, their codes are also considered similar.
Next, the stroke classes are analyzed with respect to the segmentation algorithm. There
are 11 additional stroke classes because of over-segmentation. If all types of joining be-
tween characters and modifiers are considered, it is found that some characters can be
joined with vowel modifiers like U, UU, R and consonant modifiers like R, RR within a sin-
gle stroke. As these modifiers can not be segmented from characters, these joined strokes
are considered as separate stroke classes. Thus 30 additional primitives are obtained for
GA+UU, DA+U etc. Some new shapes are obtained for the combination of character and
modifier (for example, HA+U, BHA+RR etc).
Now we come to compound characters. As we have mentioned in the proposed seg-
mentation approach, in case of compound character, if the first character ends at its right
side and in the upper region of the word, the compound character will got segmented by
the algorithm. Some compound characters can not be segmented because the joining oc-
curred in the lower part of the first character. These compounds are considered to be new
classes. Occasionally, constituent characters of the compound character form a new shape.
For example, HA+MA, KA+SSA etc. There are 11 such compounds which are new classes.
Some additional classes are also obtained for joining of compound characters with modi-
fiers. For 3-character compounds, segmentation may occur differently depending on the
length of each of the three characters. All the possibilities of segmentation are considered
to get all possible primitive classes. Finally, considering all the above cases, a set of 251
distinct primitive classes is found. Table 1 shows a few examples of primitive classes and
the characters in which these primitives are used.
Figure 7. (i) KA, (ii) GA, (iii) SA, (iv) GA. Right (black) stroke of KA and left (black)
stroke of GA in (ii) are the same. Left (green) stroke of SA and left (black) stroke of GA in
(iv) are the same.
different orders of pen points within the stroke making the directions just opposite. Thus,
for each cell, we get four integer values representing the histograms of the four direction
codes. So, 16x4=64 features are found for each primitive. These features are normalized
by dividing by the maximum value.
In this experiment, a Support Vector Machine (SVM) classifier is used for primitive
recognition. The SVM is originally defined for two-class problems and it looks for the
optimal hyper plane which maximizes the distance and the margin, between the nearest
examples of both classes, named support vectors (SVs). Given a training database of M
data: {xm | m = 1...M }, the linear SVM classifier is then defined as:
P
f (x) = j aj xj .x + b
where{xj } is the set of support vectors and the parameters αj and b have been determined
by solving the quadratic problem (Vapnik, 1995). The linear SVM can be extended to
various non-linear variants (Vapnik, 1995). In this experiment, the Gaussian kernel SVM
outperformed other non-linear SVM kernels.
A total number of 27,344 primitive samples are obtained after segmentation. 50% of
these samples are used for training and rest for testing. Word is recognized using a table
look-up approach by matching the sequence of primitives. If the exact entry is not found,
the nearest entry is considered.
Online Handwriting Recognition of Indian Scripts 221
Figure 10. Examples of words which are not segmented correctly (first two words are
under segmented, next two words are over segmented). Arrows indicate the positions where
under-segmentations and over- segmentations have occurred.
Conclusion
Both segmentation, as well as recognition of online Indian scripts, is yet to get full attention
from researchers. Because of complex nature of character formation as well as the presence
of many complex shaped compound characters, handwriting recognition of Indian script is
very challenging. This chapter discusses the state of the art of online handwriting recog-
nition of main Indian scripts and also presents a work for rigorous primitive analysis and
recognition taking into account both Bangla (Bengali) basic and compound characters. We
Online Handwriting Recognition of Indian Scripts 223
noted that the number of character classes in Bangla is more than the number of exhaustive
primitive classes in Bangla. At first, a rule-based scheme is used to segment online hand-
written Bangla cursive words into primitives. Using directional features in SVM classifier,
primitives are recognized. Word is recognized from the sequence of primitives. Finally, re-
sults obtained from the method as well as other published results are discussed and causes
of errors are studied.
References
Bishop C. (1992). Pattern Recognition & Machine Learning. Elsevier BV.
Bhattacharya, N., Frinken, V., Pal, U., and Roy, P. P. (2015). Overwriting repetition and
crossing-out detection in online handwritten text. In 2015 3rd IAPR Asian Conference
on Pattern Recognition (ACPR), pages 680–684. IEEE.
Bhattacharya, N. and Pal, U. (2012). Stroke segmentation and recognition from bangla
online handwritten text. In 2012 International Conference on Frontiers in Handwriting
Recognition, pages 736–741. IEEE.
Bhattacharya, N., Pal, U., and Kimura, F. (2013). A system for bangla online handwritten
text. In 2013 12th International Conference on Document Analysis and Recognition,
pages 1367–1371. IEEE.
Bhattacharya, U., Nigam, A., Rawat, Y. S., and K., P. S. (2008). An analytic scheme for
online handwritten bangla cursive word recognition. In Proceedings of the 2008 10th
International Conference on Frontiers in Handwriting Recognition, ICFHR ’08, pages
320–325.
Bunke, H. and Riesen, K. (2012). Towards the unification of structural and statistical pattern
recognition. Pattern Recognition Letters, 33(7):811–825.
Burges, C. J. (1998). A tutorial on support vector machines for pattern recognition. Data
Mining and Knowledge Discovery, 2(2):121–167.
Cho, W., Lee, S.-W., and Kim, J. H. (1995). Modeling and recognition of cursive words
with hidden markov models. Pattern Recognition, 28(12):1941–1953.
Fink, G. A., Vajda, S., Bhattacharya, U., Parui, S. K., and Chaudhuri, B. B. (2010). On-
line bangla word recognition using sub-stroke level features and hidden markov models.
In 2010 12th International Conference on Frontiers in Handwriting Recognition, pages
393–398. IEEE.
224 Umapada Pal and Nilanjana Bhattacharya
Fischer, A., Riesen, K., and Bunke, H. (2010). Graph similarity features for HMM-based
handwriting recognition in historical documents. In 2010 12th International Conference
on Frontiers in Handwriting Recognition, pages 253–258. IEEE.
Frinken, V., Bhattacharya, N., and Pal, U. (2014). Design of unsupervised feature extraction
system for on-line bangla handwriting recognition. In 2014 11th IAPR International
Workshop on Document Analysis Systems, pages 355–359. IEEE.
Frinken V., Bhattacharya N., Uchida S., and Pal U. (2014). Improved BLSTM Neural Net-
works for Recognition of On-Line Bangla Complex Words. In IAPR Joint International
Workshops on Statistical Techniques in Pattern Recognition + Structural and Syntactic
Pattern Recognition. Lecture Notes in Computer Science, pages 404–413. Springer.
Graves, A., Liwicki, M., Fernandez, S., Bertolami, R., Bunke, H., and Schmidhuber, J.
(2009). A novel connectionist system for unconstrained handwriting recognition. IEEE
Transactions on Pattern Analysis and Machine Intelligence, 31(5):855–868.
Greenberg, S., Popper, A. N., Ainsworth, W. A., and Fay, R. R. (2004). Speech Processing
in the Auditory System. Springer-Verlag New York Inc.
Jaeger, S., Manke, S., Reichert, J., and Waibel, A. (2001). Online handwriting recognition:
the NPen++ recognizer. International Journal on Document Analysis and Recognition,
3(3):169–180.
Jayadevan, R., Kolhe, S. R., Patil, P. M., and Pal, U. (2011). Offline recognition of devana-
gari script: A survey. Trans. Sys. Man Cyber Part C, 41(6):782–796.
Mondal, T., Bhattacharya, U., Parui, S. K., Das, K., and Mandalapu, D. (2010). On-line
handwriting recognition of indian scripts - the first benchmark. In Proceedings of the
2010 12th International Conference on Frontiers in Handwriting Recognition, ICFHR
’10, pages 200–205, Washington, DC, USA. IEEE.
Naz, S., Hayat, K., Razzak, M. I., Anwar, M. W., Madani, S. A., and Khan, S. U. (2014).
The optical character recognition of urdu-like cursive scripts. Pattern Recognition,
47(3):1229–1248.
Pal, U., Jayadevan, R., and Sharma, N. (2012). Handwriting recognition in indian regional
scripts: A survey of offline techniques. 11(1):1–35.
Pekalska, E. and Duin, R. P. W. (2005). The Dissimilarity Representation for Pattern Recog-
nition - Foundations and Applications. World Scientific Publishing Co. Pte. Ltd.
Riesen, K. and Bunke, H. (2009). Graph classification based on vector space embedding.
International Journal of Pattern Recognition and Artificial Intelligence, 23(06):1053–
1081.
Roy, K., Sharma, N., Pal, T., and Pal, U. (2007). Online bangla handwriting recognition
system. In International Conference on Advances in Pattern Recognition, pages 121–
126.
Samanta, O., Bhattacharya, U., and Parui, S. (2014). Smoothing of HMM parameters for
efficient recognition of online handwriting. Pattern Recognition, 47(11):3614–3629.
Sun, D. X. and Jelinek, F. (1999). Statistical methods for speech recognition. Journal of
the American Statistical Association, 94(446):650.
Chapter 9
1. Introduction
Ancient manuscripts record many pieces of important knowledge about world civilization
histories. In Southeast Asia, most of the ancient manuscripts are written on palm leaf. An-
cient palm leaf manuscripts are one of the very valuable cultural heritages that store various
forms of knowledge and historical records of social life in Southeast Asia. Many palm
leaf manuscripts contain information on important issues such as medicines and village
regulations that are used as daily guidance. It attracts the historians, philologists, and ar-
chaeologists to discover more about the ancient ways of life. The existence of ancient palm
leaf manuscripts in Southeast Asia is very important both in term of quantity and variety of
historical contents.
For example in Bali, Indonesia, the island’s literary works were mostly recorded on
dried and treated palm leaves (Figure 1). The dried and treated palm leaf manuscripts in
Bali are called lontar. Lontar is written on a dried palm leaf by using some sort small knife.
Lontars are inscribed with a special tool called pengerupak. It is made of iron, with its
tip sharpened in a triangular shape so it can make both thick and thin inscriptions. The
manuscripts were then scrubbed by a natural dyes to leave a black color on the scratched
part as text (Figure 2). The writings were incised in one (and/or both) sides of the leaf and
∗
E-mail address: made windu [email protected] (Corresponding author).
228 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
the script is then blackened with soot. The leaves are held and linked together by a string
that passes through the central holes and knotted at the outer ends.
The Balinese palm leaf manuscripts were written in the Balinese script and the Bali-
nese language, in the ancient literary texts composed in the old Javanese language of Kawi
and Sanskrit. The epic of lontar varies from ordinary texts to Bali’s most sacred writings
(Figure 3). Many of those epics based on the famous Indian epics of Ramayana and Ma-
habharata. They include texts on religion, holy formulae, rituals, family genealogies, law
codes, treaties on medicine (usadha), arts and architecture, calendars, prose, poems and
even magics.
But unfortunately, in reality, the majority of Balinese has never read any lontar because
of language obstacles as well as tradition which perceived them as a sacrilege. There is only
a limited access to the content of the manuscripts, because of the linguistic difficulties and
the fragility of the documents. Balinese script is considered to be one of the complex scripts
from Southeast Asia. The alphabet of Balinese script is composed of ±100 character classes
including consonants, vowels, diacritics, and some other special compound characters. In
Balinese manuscripts, there is no space between words in a text line. Some characters are
written on upper baseline or under the baseline of text line.
Figure 1. Palm tree (left), the dried and treated palm leaves (right).
The physical condition of natural materials from palm leaves certainly cannot fight
against time. Usually, palm leaf manuscripts are of poor quality since the documents have
degraded over time due to storage conditions. Many discovered lontars are now part of
collections of museums and private families. They are in a state of disrepair due to age and
due to inadequate storage conditions. Equipment that can be used to protect the palm leaf
to prevent rapid deterioration are still relatively few in number. Therefore, the digitization
and indexing projects for palm leaf manuscripts were proposed (Kesiman et al., 2015a,b,
2016b; Burie et al., 2016; Kesiman et al., 2016a,c, 2017).
In the last five years, the collection of palm leaf manuscripts in Southeast Asia at-
tracted the attention of researchers in document image analysis. For example, a digitization
project for palm leaf manuscripts from Indonesia (Kesiman et al., 2015a,b, 2016b; Burie
Historical Handwritten Document Analysis of Southeast Asian ... 229
et al., 2016; Kesiman et al., 2016a,c, 2017) under the scheme of the AMADI (Ancient
Manuscripts Digitization and Indexation) Project, Cambodia1 and Thailand (Chamchong
et al., 2010; Fung and Chamchong, 2010). The AMADI Project works not only to digi-
tize the palm leaf manuscripts, but also to develop an automatic analysis, transcription and
indexing system for the manuscripts. Our objectives are to bring added value to digitized
palm leaf manuscripts by developing tools to analyze, index and access quickly and effi-
ciently to the content of palm leaf manuscripts, and to make palm leaf manuscripts more
accessible, readable and understandable to a wider audience and to scholars and students all
over the world. Nowadays, due to the specific characteristics of the physical support of the
manuscript, the development of document analysis methods for palm leaf manuscripts in
order to extract relevant information is considered as a new research problem in handwrit-
ten document analysis (Kesiman et al., 2015a,b, 2016b; Burie et al., 2016; Kesiman et al.,
2016a,c, 2017; Chamchong et al., 2010; Chamchong and Fung, 2011, 2012). It ranges wide
from binarization process (Kesiman et al., 2015a,b; Burie et al., 2016), text line segmenta-
tion (Kesiman et al., 2017), character and text recognition tasks (Burie et al., 2016; Kesiman
et al., 2016c) to the word spotting methods.
1
http://www.khmermanuscripts.org/.
230 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Ancient palm leaf manuscripts contain artefacts due to aging, foxing, yellowing, marks
of strain, local shading effects, with low-intensity variations or poor contrast, random
noises, discoloured part, fading (Figure 4). Written on a dried palm leaf by using a sharp
pen (which looks like a small knife) and colored with natural dyes, it is hard to separate the
text from the background in the binarization process.
In the OCR task and development, severals deformations in the character shapes are
visible due to the merges and fractures of the use of nonstandard fonts. The similarities of
distinct character shapes, the overlaps, and interconnection of the neighboring characters
further complicate the problem of OCR system (Arica and Yarman-Vural, 2002) (Figure 5).
One of the main problems faced when dealing with segmented handwritten character recog-
nition is the ambiguity and illegibility of the characters (Blumenstein et al., 2003). These
characteristics provide a suitable condition to test and evaluate the robustness of feature ex-
traction methods which were already proposed for character recognition. Using a character
recognition system will help to transcript these ancient documents and translate them to the
current language, to give an access to the important information and knowledge in palm leaf
manuscript. OCR system is one of the most demanding systems which has to be developed
for the collection of palm leaf manuscript images.
This chapter is organized as follow: the following section gives a description about
the binarization of palm leaf manuscript images, the construction of ground truth binarized
images, and the analysis of ground truth binarized image variability. The section ”Isolated
Character Recognition” presents some most commonly used feature extraction methods
and describes our proposed combination of features for the isolated character recognition.
A segmentation free and training free word spotting method for our palm leaf manuscript
images is presented in section ”Word Spotting”. The palm leaf manuscript image dataset
used in our experiments and the experimental results are presented respectively in section
”Corpus and Dataset” and ”Experiments”. Conclusions with some prospects for the future
works are given in the last section.
Historical Handwritten Document Analysis of Southeast Asian ... 231
Figure 6. Original image (upper left), and binarized images (up to bottom, left to right)
(Kesiman et al., 2015a) using methods of Otsu (Pratikakis et al., 2013; Messaoud et al.,
2011), Niblack (Khurshid et al., 2009; Rais et al., 2004; Gupta et al., 2007; He et al., 2005;
Feng and Tan, 2004), Sauvola (Sauvola and Pietikäinen, 2000), Wolf (Khurshid et al., 2009;
Rais et al., 2004), Rais (Rais et al., 2004), NICK (Khurshid et al., 2009), and Howe (Howe,
2013).
able yet, the ground truth binarized image of palm leaf manuscripts has to be created to be
able to quantitatively measure and compare the performance of all binarization methods.
Therefore, in order to evaluate and to select an optimal binarization method, creating a new
ground truth binarized image of palm leaf manuscripts is a necessary step (Kesiman et al.,
2015a).
Manual creation of the ground truth binarized images (e.g. with PixLabeler applica-
tion (Saund et al., 2009)) is a time-consuming task. Therefore, several semi-automatic
frameworks for the construction of ground truth binarized images have been presented
Historical Handwritten Document Analysis of Southeast Asian ... 233
(Ntirogiannis et al., 2013, 2008; Nafchi et al., 2013; Bal et al., 2008) to reduce the time
of ground truthing process. The human intervention is required only for some necessary
but limited tasks. The previous works on construction of ground truth binarized images
were especially based on the method used for DIBCO competition series (Pratikakis et al.,
2013; Gatos et al., 2011). The need for a specific scheme which adapts and performs better
in constructing the ground truth of binarized images for palm leaf manuscripts should be
analyzed to achieve a better ground truth for low quality palm leaf manuscripts.
For the DIBCO competition series (Pratikakis et al., 2013), the ground truth binarized
images are constructed using a semi-automatic procedure described in (Ntirogiannis et al.,
2013). This procedure is adapted and improved by some other works on the construction of
ground truth binarized images. For instance, in (Messaoud et al., 2011), a similar method
is used to create ground truth of a large document database. In (Nafchi et al., 2013), in
order to save user time in manual modification process by an expert, two features of phase
congruency are used to pre-process Persian heritage images to generate a rough initial bina-
rized image. In (Bal et al., 2008), the ground truth binarized image of the machine-printed
document is constructed by segmenting and clustering the characters during the foreground
enhancement step. The user can manually add and remove character model assignments
to degraded character instances. Unfortunately, it is impossible to validate a ground truth
construction methodology to create a perfect ground truth image from a real image. The
ground truth images are normally accepted based on visual observation.
The construction of ground truth binarized images proposed in (Ntirogiannis et al.,
2008), consists of several steps: initial binarization process, skeletonization of the charac-
ters, manual correction of skeleton, and second skeletonization after manual correction pro-
cess. The estimated ground truth image is then constructed by dilating the corrected skele-
ton image, constrained by the character edges (detected using Canny algorithm (Canny,
1986)) and the binarized image under evaluation. The skeleton is dilated until half of the
Canny edges intersect each binarized component. The detailed algorithm in pseudo code
can be found in (Ntirogiannis et al., 2008). In this method, poor quality of the initial bina-
rized image will directly affect the result of the estimated ground truth. The ground truth
image constructed strongly depends on the binarized image used as a constraint during the
dilation process of the skeleton. The ground truth binarized images used for the DIBCO
competition series are constructed with a modified procedure (Ntirogiannis et al., 2013) as
illustrated in Figure 7. In this procedure, the conditional dilation step of the skeleton is
constrained only by Canny edge image, without any initial binarized image.
Based on the preliminary experiments, it is expected to obtain a good initial binarized
image as the input to the next process of ground truth creation (Kesiman et al., 2015a).
The initial binarization method used in the stage of construction of skeletonized ground
truth image should be able to generate an optimal and acceptable ‘good enough’ skeleton
which detects and keeps the form of the characters. An image of skeleton generated in
this step will facilitate the manual correction process. More the skeleton is correct, more
the manual process is easier and faster. For a nondegraded palm leaf manuscript, a simple
global thresholded binarization method is sufficient to generate an acceptable binarized im-
age and optimal image of the skeleton. However, this method is not adapted to degraded
palm leaf manuscripts. Figure 8 shows some examples of the skeletonized image generated
234 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
with Matlab standard function bwmorph2 from different binarized images using different
binarization methods. Influenced by the dried palm leaf texture, the stroke of characters in
palm leaf manuscripts is thickened and widened. As a consequence, a lot of small short un-
useful branches on the skeleton are generated. Because of the poor quality of the binarized
and skeletonized image, the step of manual correction of the skeleton is very time consum-
ing, it takes almost 8 hours for only one image of palm leaf manuscript. Therefore, in the
case of degraded and low-quality palm leaf manuscript images, the study focused on the
development of an initial binarization process for the construction of ground truth binarized
images. One other important remark, superposing image of skeleton on the original image
to guide the manual correction process is not enough. A priori knowledge of the form of
ancient characters is mandatory to guarantee that the incomplete character skeleton can be
completed in a more natural trace as the way how the characters have been originally writ-
ten. The manual correction process should be done by a philologist or at least by a person
who knows well how to write the ancient characters with a guide for the transcription of the
manuscript provided by a philologist.
In order to overcome the binarization problem on degraded and low quality palm leaf
manuscript image, the study of (Kesiman et al., 2015a) proposed a ‘semi-local’ concept.
The idea of this method is to apply a powerful global binarization method on only precise
local character area. The binarization scheme consists of several steps as illustrated in Fig-
ure 9. First, the edge detection with Prewitt operator is applied to get the initial surrounding
area of the line-strokes of each character. Based on our visual observation, Prewitt leads
to high edge response on the inner part of the characters, and it gives a good approximate
area for the skeleton. Whereas Canny leads to high edge response on the outer side of text
stroke, and it detects over sensitively the textural part of the palm leaf background. The
grayscale image of the edge is then binarized with Otsu’s method to get the first binarized
image of the palm leaf manuscript. Median Filter is then applied to this binarized image
in order to remove noise. After noise reduction, some characters might be affected and
broken. A dilation process is applied to recover and reform the broken parts of the char-
acter. The method constructs the approximated character area using Run Length Smearing
(RLS) method (Wahl et al., 1982). The smearing method should be done optimally, so the
missing/undetected character area can be detected completely. The RLS in row wise will
cover the missing area in horizontal strokes character line, meanwhile the RLS in column
wise will cover the missing area in vertical strokes character line. The output of those steps
is a binarized image with an approximated character area in black, and the background area
in white. The next step is the main concept of this scheme. Otsu’s binarization method is
applied for the second time, but locally only within a limited character area, defined by each
connected component from the first binarized image generated (Figure 10).
After the initial binarization process, the method finally performs a morphological-
based thinning method to get the skeleton of the character. The thinned image normally
still have the unwanted branch, so it applies a morphological-based pruning method to the
thinned characters image. A pruning method for the skeleton is effective to remove spurious
unwanted parts of the skeleton, and it makes the manual correction process of the skeleton
faster. Figure 11 shows a sample of image sequence as the result of our specific scheme.
2
http://fr.mathworks.com/help/images/ref/bwmorph.html.
Historical Handwritten Document Analysis of Southeast Asian ... 235
Figure 7. Ground truth construction procedure used for DIBCO series (Ntirogiannis et al.,
2013).
Figure 8. Examples of image of skeleton (left to right and up to bottom) (Kesiman et al.,
2015a) generated from binarized image of Otsu (Pratikakis et al., 2013; Messaoud et al.,
2011), Niblack (Khurshid et al., 2009; Rais et al., 2004; Gupta et al., 2007; He et al., 2005;
Feng and Tan, 2004), Rais (Rais et al., 2004), and NICK (Khurshid et al., 2009).
The goodness of the results can only be estimated qualitatively by examining the results.
Based on visual criteria, the proposed scheme provides a good initial image of skeleton with
respect to image quality and preservation of meaningful textual character information.
We experimentally tested the framework for the construction of ground truth binarized
image for nondegraded and degraded low-quality palm leaf manuscript images (Kesiman
et al., 2015a). For this initial experimental study, we only used the available sample scanned
236 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
images from Museum Bali, Museum Gedong Kertya, and from private family collection.
The manuscripts were written on both sides, but there was no back-to-front interference
observed.
Figure 10. Examples of extracted character area (on the left) and their semi-local binariza-
tion result (on the right) (Kesiman et al., 2015a).
Figure 11. Original sample image, and sequence sample image of Prewitt, Otsu, Median
Filter, Dilation, RLS Row, RLS Col, Local Otsu, Thinning, Pruning, Superposed Skeleton
on Original Image (Kesiman et al., 2015a).
For nondegraded palm leaf manuscripts, we used a simplest and the most conventional
global thresholding method with a proper threshold selected manually to obtain the initial
binarized image. With this initial binarized image, it is already sufficient to obtain an ac-
ceptable skeletonized image. We performed the manual correction of the skeleton, guided
by the transcription of the manuscript provided by a philologist, to finally obtain the skele-
ton ground truth of the manuscript. Figure 12 shows a snapshot of a simple prototype
with user friendly interface that we developed and used to facilitate the manual correction
process. We finally constructed the ground truth image by dilating the corrected skeleton
image, constrained by the Canny edge image and the initial binarized image from Otsu’s
global method. We use Otsu’s global method instead of the same global fixed threshold-
ing method used in our skeleton ground truth construction because we need a complete
connected component of all characters detected on the binarized image. Other binarization
methods can also be used, for example, Niblack’s method or the multi resolution version of
Otsu’s method (Gupta et al., 2007). They also provide a satisfactory preliminary binarized
image. Figure 13 shows an example of final ground truth image from a nondegraded palm
leaf manuscript image. It is visually an acceptable estimated ground truth image for the
Historical Handwritten Document Analysis of Southeast Asian ... 237
manuscript.
Figure 12. Snapshot of prototype interface used for manual correction of skeleton (Kesiman
et al., 2015a).
Figure 13. Estimated ground truth of a nondegraded palm leaf manuscript image (Kesiman
et al., 2015a).
For degraded low-quality palm leaf manuscript images, we applied our proposed spe-
cific binarization scheme by defining the optimal value of parameters based on our empir-
ical experiments as follows: filter size 3x3 for Median Filter, square structuring element
size 3x3 for Dilation, smearing 3 pixels in row and 3 pixels in column for RLS Method,
and pruning the branch of 2 pixels. We performed the manual correction of the skeleton,
guided by the transcription of the manuscript provided by a philologist to obtain the skele-
ton ground truth image of the manuscript. Figure 14 shows an example of a low-quality
palm leaf manuscript and the skeleton ground truth image. We first experimented the con-
struction of estimated ground truth image by applying a constraint of Canny edge image
and an initial binarized image. For example, we used the binarized image from Niblack’s
method or the multi-resolution version of Otsu’s method as the constraint. The estimated
238 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
ground truth image really depends on the initial binarized image used as a constraint. We
then experimented the construction of ground truth image without any initial binarized im-
age as a constraint. The result is shown in Figure 15. Based on visual criteria, the proposed
algorithm seems to achieve a better-estimated ground truth image with respect to image
quality and preservation of meaningful textual character information. Some other results
of ground truth binarized image for degraded low-quality palm leaf manuscript images are
shown in Figure 16.
Figure 14. Original Image and the skeleton ground truth (Kesiman et al., 2015a).
Figure 15. Ground truth image constructed with an initial binarized image of Niblack’s
method, Multi Resolution Otsu’s method, and without any constraint of initial binarized
image (Kesiman et al., 2015a).
the choice of ground truth. In this section, we present an experiment in a real condition to
analyze the human intervention subjectivity on the construction of ground truth binarized
image and to measure quantitatively the ground truth variability of palm leaf manuscript im-
ages with different binarization evaluation metrics (Kesiman et al., 2015b)(Kesiman et al.,
2015b). This experiment measures the difference between two ground truth binarized im-
ages from two different ground truthers.
The sample images used in this experiment are 47 images randomly selected from the
palm leaf manuscript corpus of AMADI Project (see section ”The palm leaf manuscript
corpus and the digitization process”). In this experiment, we adopted a semi-automatic
framework for the construction of ground truth binarized images which was described in
section ”The construction of ground truth binarized images”. But, in order to measure the
variability of human subjectivity in our ground truth creation, in this experiment, we did
not apply any initial binarization and skeletonization methods. The skeletonization process
is completely performed by human. The skeleton drawn manually by user is dilated until
Canny edges intersect each binarized component of the dilated skeleton in a ratio of 0.1.
This value of minimal ratio between number of pixels in intersection of Canny edge and
number of pixels of the dilated skeleton is found based on our empirical experiment and
observation on the thickness of the character stroke in our manuscripts.
As presented in (Smith, 2010), three metrics of binarization evaluation proposed in the
DIBCO 2009 contest (Gatos et al., 2011) are used in this analysis to measure the difference
between two ground truth binarized images from two different ground truthers. Those three
metrics are F-Measure (FM), Peak SNR (PSNR), and Negative Rate Metric (NRM) (Kes-
iman et al., 2015b). The value of FM and PSNR when we assumed the image drawn by
the first ground truther as ground truth image will be the same with the value of FM and
PSNR when we assumed the vice versa, the image drawn by the second ground truther as
ground truth image. The value of NRM when we assumed the image drawn by the first
ground truther as ground truth image will not be the same with the value of NRM when
we assumed the image drawn by the second ground truther as ground truth image. In this
case, we calculated two value of NRM: NRM1 and NRM2. A higher F-measure and PSNR
indicates a better match. A lower NRM indicates a better match.
For this experiment, 70 students were asked to trace manually the skeleton of the Ba-
linese character found in palm leaf manuscript image with PixLabeler tool (Saund et al.,
2009). One student worked with two different images, and one image was ground truthed
by two different students. These two manually skeletonized image will be re-skeletonized
with Matlab function bwmorph3 to make sure that the skeleton is one pixel wide for the
next process of automatic ground truth estimation with conditional dilation and Canny edge
constraint. Figure 17 shows the scheme diagram of our experiment. Figure 18 shows some
samples images as the result example of this experiment.
By observing visually the two skeletonized image created by two different ground
truthers, we can see how different are the results of the two ground truthers in choosing the
trace of the character skeleton. All the broken parts of in image of intersection between two
skeletonized images show the different skeleton traces between two ground truthers. And
all the double-lined parts in the image of union between two skeletonized images show how
3
http://fr.mathworks.com/help/images/ref/bwmorph.html.
240 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Figure 16. Two palm leaf manuscript images with their ground truth binarized images
(Kesiman et al., 2015a).
far the different are the positions of the skeleton traced by two ground truthers.
First, we measured the variability between two skeletonized ground truthed images
manually drawn by two different ground truthers (Table 1) (Kesiman et al., 2015b). The
wide range between the maximum and the minimum value and also the mean and variance
value of all three binarization evaluation metrics from 47 images show that there is a large
variability between the ground truthers for each image.
We then measured the variability between the two ground truth binarized images au-
tomatically estimated from two different manually skeletonized images for each image of
the manuscript. Table 2 illustrates this variability (Kesiman et al., 2015b). The wide range
between the maximum and the minimum value and also the mean and variance value of all
three binarization evaluation metrics show that there is still a large variability between the
estimated ground truth images for each image.
Historical Handwritten Document Analysis of Southeast Asian ... 241
Table 2. Variability between two ground truthed image automatically estimated from
two different manually skeletonized image (Kesiman et al., 2015b)
By comparing the value of binarization evaluation metrics between the two manually
skeletonized ground truth images (Table 1) and between the two automatic estimated ground
truth images (Table 2), we can see that the variability of two ground truth images in F-
hMeasure and NRM for all images decreases after the estimation ground truth process. The
value of PSNR decreases because the number of different foreground-background pixels
between the two estimated ground truth images also increases after the automatic estimation
process, not only the number of common foreground pixels from the two estimated ground
truth images. Figure 19 to Figure 22 shows that the ground truth estimation process tends
to decrease the variability between two ground truthers to produce a better match between
two ground truth images.
We also tested and estimated the ground truth binarized image from the union of two
242 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
skeleton images manually drawn by two differents ground truthers (see exemple in Fig-
ure 18(e)). The variability between this estimated union ground truth image with two other
estimated ground truth images from each ground truther is then measured. Table 3 and Ta-
ble 4 illustrate the results of the comparison metric for all images in experiment (Kesiman
et al., 2015b). The ground truth image, estimated from the union of two skeleton images,
indicates a better match with two other ground truth images from two different ground
truthers.
Table 3. Variability between ground truth image estimated from union of two
skeleton images with ground truth image estimated from the first ground truther
(Kesiman et al., 2015b)
Table 4. Variability between ground truth image estimated from union of two
skeleton images with ground truth image estimated from the second ground truther
(Kesiman et al., 2015b)
Based on our data survey after the experiment with all ground truthers, we have ob-
served and remarked some facts on the ground truth creation of palm leaf manuscripts as
follows: The Balinese alphabet found in the manuscripts are not daily used by the ground
truthers. Most of the ground truthers learned those alphabets in their elementary school
until their junior or senior high school, but they never re-used those alphabets after the
classroom learning process. There are some characters of the alphabet that they have never
seen before. For those kinds of characters, the ground truthers could not make a smooth and
natural trace of the character skeleton. Regarding the variability of ground truth images pro-
duced in this experiment, we suggest that this kind of important fact or condition should be
always taking into account in every ground truthing process of ancient manuscript project.
The time needed to semi-manually corrected the skeleton of the image from an initial au-
tomatic method can be much greater than making the skeleton totally manual started from
zero. In our first trial experiment, we need 4 until 6 hours to corrected the semi-automatic
generated skeleton. It is due to the physical characteristics of the manuscripts which make
the binarizing and skeletonizing method do not tend to produce the optimal good skeleton
of the characters. We finally decided to make it totally manual, and it takes between 2 to 3
hours to trace the skeleton started from zero.
Historical Handwritten Document Analysis of Southeast Asian ... 243
Figure 18. Example of ground truth binarized image from the experiment: (a) original
image, (b) skeletonized image by 1st ground truther, (c) skeletonized image by 2nd ground
truther, (d) image intersection between (b) and (c), (e) image union between (b) and (c),
(f) estimated ground truth binarized image from (b), (g) estimated ground truth binarized
image from (c), (h) image intersection between (f) and (g), (i) image union between (f) and
(g) (Kesiman et al., 2015b).
244 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Figure 19. Comparison of F-Measure between two skeletonized ground truth image and
between two estimated ground truth images (Kesiman et al., 2015b).
From the result of this experiment, we proved that the human subjectivity has a great ef-
fect in producing a great variability of ground truth binarized image. This phenomenon be-
comes much more visible when we are working on the binarization process of ancient type
document or manuscript where the physical characteristics and conditions of the manuscript
are not good enough or it is still hard to be ground truthed even by human. The method of
binarization evaluation by comparing and measuring pixel-by-pixel with a ground truth bi-
narized image should be re-evaluated to avoid the great bias from human subjectivity. Some
other measures should be proposed to evaluate the binarization process of document image
of ancient manuscripts.
Figure 20. Comparison of NRM1 between two skeletonized ground truth image and be-
tween two estimated ground truth images (Kesiman et al., 2015b).
Historical Handwritten Document Analysis of Southeast Asian ... 245
Figure 21. Comparison of NRM2 between two skeletonized ground truth image and be-
tween two estimated ground truth images (Kesiman et al., 2015b).
Figure 22. Comparison of PSNR between two skeletonized ground truth image and between
two estimated ground truth images (Kesiman et al., 2015b).
IHCR system is one of the most demanding systems which has to be developed for the
collection of palm leaf manuscript images. Using an IHCR system will help to transcript
these ancient documents and translate them to the current language. Usually, an IHCR sys-
tem consists of two main steps: feature extraction and classification. The performance of
an IHCR system greatly depends on the feature extraction step. The goal of feature ex-
traction is to extract information from raw data which is most suitable for classification
purpose (Aggarwal et al., 2015). Many feature extraction methods have been proposed to
perform the character recognition task (Arica and Yarman-Vural, 2002; Blumenstein et al.,
2003; Aggarwal et al., 2015; Kumar, 2010; Bokser, 1992; Hossain et al., 2012; Fujisawa
et al., 1999; Jin et al., 2009; Rani and Meena, 2011). These methods have been successfully
implemented and evaluated for recognition of Latin, Chinese and Japanese characters as
well as digit recognition. However, only a few systems are available in the literature for
other Asian scripts recognition. For example, some of the works are for Devanagari script
(Kumar, 2010; Ramteke, 2010), Gurmukhi script (Aggarwal et al., 2015; Lehal and Singh,
2000; Sharma and Jhajj, 2010; Siddharth et al., 2011), Bangla script (Hossain et al., 2012),
and Malayalam script (Ashlin Deepa and Rajeswara Rao, 2014). Those documents with
different scripts and languages surely provide some new research problem, not only be-
cause of the different shapes of characters but also because the writing style for each script
differs: the shape of the characters, character positions, separation or connection between
the characters in a text line.
Each feature extraction method has its own advantages or disadvantages over other
methods. In addition, each method may be specifically designed for some specific problem.
Most of feature extraction methods, extract the information from binary image or grayscale
image (Kumar, 2010). Some surveys and reviews on features extraction methods for char-
acter recognition were already reported (Trier et al., 1996; Kumar, 2011; Neha J. Pithadia,
2015; Pal et al., 2012; Pal and Chaudhuri, 2004; Govindan and Shivaprasad, 1990). Choos-
ing efficient and robust feature extraction methods plays a very important role to achieve
high recognition performance in an IHCR and OCR (Aggarwal et al., 2015). The perfor-
mance of the system depends on a proper feature extraction and a correct classifier selection
(Hossain et al., 2012). It is experimentally reported that to improve the performance of an
IHCR system, the combination of multi features is recommended (Trier et al., 1996). Our
objective is to find the combination of feature extraction methods to recognize the isolated
characters of Balinese scripts on palm leaf manuscript images.
In this work, first, we investigated and evaluated some most commonly used features
for character recognition: histogram projection (Kumar, 2010; Hossain et al., 2012; Ash-
lin Deepa and Rajeswara Rao, 2014), celled projection (Hossain et al., 2012), distance
profile (Bokser, 1992; Ashlin Deepa and Rajeswara Rao, 2014), crossing (Kumar, 2010;
Hossain et al., 2012), zoning (Blumenstein et al., 2003; Kumar, 2010; Bokser, 1992; Ash-
lin Deepa and Rajeswara Rao, 2014), moments (Ramteke, 2010; Ashlin Deepa and Ra-
jeswara Rao, 2014), some directional gradient based features (Aggarwal et al., 2015; Fu-
jisawa et al., 1999), Kirsch Directional Edges (Kumar, 2010), and Neighborhood Pixels
Weights (NPW) (Kumar, 2010). Secondly, based on our preliminary experiment results,
we proposed and evaluated the combination of NPW features applied on Kirsch Directional
Edges images, with Histogram of Gradient (HoG) features and zoning method. Two clas-
sifiers: k-NN (k-Nearest Neighbor) and SVM (Support Vector Machine) are used in our
Historical Handwritten Document Analysis of Southeast Asian ... 247
experiments. This section will only briefly describe the feature extraction methods which
were used in our proposed combination of features. For more detail description of other
commonly used feature extraction methods which were also evaluated in this experimental
study, please refer to references mentioned above.
Each directional edge image is thresholded to produce a binary edge image. The binary
edge image is then partitioned into N smaller regions. Then, edge pixel frequency in each
region is computed to produce the feature vector. In our experiments, we computed Kirsch
feature from grayscale image with 25 smaller regions to produce a 100 dimensions feature
vector. Based on the empirical tests for our dataset, the Kirsch edge image can be optimally
thresholded with a threshold value of 128. The feature value is then normalized by the
maximum value of edge pixel frequency from all regions.
Figure 24. Neighborhood pixels for NPW features (Kesiman et al., 2016c).
gradient by assigning the gradient direction of each pixel into certain range of orientation
bin which are evenly spread over 0 to 180 degrees or 0 to 360 degrees (Figure 25 & 26).
The histogram cells are then normalized with a larger overlap-connected blocks.
The final HoG descriptor is then generated from all concatenated vector of the histogram
after the block normalization process. For our experiments, we used the HoG implementa-
tion of VLFeat4 . We computed HoG feature from grayscale image with the cell size of 6
pixels and with 9 different orientations to produce a 1984 dimensions feature vector.
3.4. Zoning
Zoning is computed by dividing the image into N smaller zones: vertical, horizontal, square,
diagonal left and right, radial or circular zone (see Figure 27). The local properties of im-
age are extracted on each zone. Zoning can be implemented for binary image and grayscale
image (Kumar, 2010). For example, in binary image, the percentage density of character
pixels in each zone is computed as local feature (Bokser, 1992). In grayscale image, the
average of gray value in each zone is considered as local feature (Ashlin Deepa and Ra-
jeswara Rao, 2014). Zoning can be easily combined with other feature extraction methods
(Hossain et al., 2012), for example in (Blumenstein et al., 2003). In our experiments, we
computed zoning with 7 zone types (zone width or zone size = 5 pixels) and combined them
into a 205 dimensions feature vector. We also tested the zoning feature on the image of the
skeleton.
Figure 25. An image with 4x4 oriented histogram cells and 2x2 descriptor blocks over-
lapped on 2x1 cells (Kesiman et al., 2016c).
Figure 26. The representation of the array of cells HoG (Kesiman et al., 2016c).
4
http://www.vlfeat.org/api/hog.html.
250 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Figure 27. Type of Zoning (from left to right: vertical, horizontal, block, diagonal, circular,
and radial zoning) (Kesiman et al., 2016c).
4. Word Spotting
Many works on word spotting methods have been reported for the last decade (Lee et al.,
2012a; Dovgalecs et al., 2013; Rusinol et al., 2011; Rothacker et al., 2013a; Khayyat et al.,
2013; Fischer et al., 2012; Rothacker et al., 2013b). The segmentation free word spotting
method tries to spot the query word patch image given by the user by applying a sliding
window on the document image (Dovgalecs et al., 2013; Rusinol et al., 2011; Rothacker
et al., 2013a,b; Lee et al., 2012a). For each zone, the system measures the similarity be-
tween the query image based on some image features or descriptors. The training based
word spotting method integrates the learning system to recognize the query word patch im-
age on the document image (Rothacker et al., 2013a; Khayyat et al., 2013; Rothacker et al.,
2013b). The system should be sufficiently trained by a collection of training data to achieve
a better performance.
As the benchmark, most of the proposed word spotting methods were tested and evalu-
ated on the collection of document images which were printed or handwritten on the paper
with Latin script in English (Lee et al., 2012a; Dovgalecs et al., 2013; Rusinol et al., 2011;
Rothacker et al., 2013b), for example, the well-known and widely used collection of George
Washington document dataset5 . Some methods were already proposed and evaluated to spot
the word on the collection of document images with non-Latin scripts, for example, on Ko-
rean, Persian, Arabic documents and also on Indic scripts and languages (Rusinol et al.,
2011; Rothacker et al., 2013a; Khayyat et al., 2013; Lee et al., 2012a). The writing style for
each script differs in how to write and to join or separate a word in a text line.
Based on some surveys, an image feature which has been widely used to proceed the
matching task on image retrieval and indexing systems is the Scale Invariant Features Trans-
form (SIFT) (Dovgalecs et al., 2013; Rusinol et al., 2011; Lee et al., 2012a; Auclair et al.,
2007; Almeida et al., 2009; Ledwich and Williams, 2004; Lowe, 2004). Based on the work
of Rusiñol et. al. (Rusinol et al., 2011) and Dovgalecs et. al. (Dovgalecs et al., 2013),
we experiment a segmentation free and training free word spotting method for our multi-
writer palm leaf manuscript images using Bag of Visual Words (BoVW). Our preliminary
hypothesis is the powerful framework of BoVW, combined with Latent Semantic Indexing
(LSI) (Rusinol et al., 2011; Deerwester et al., 1990), Longest Common Subsequence (LCS)
(cor, 2001), and Longest Weighted Profile (LWP) (Dovgalecs et al., 2013). The segmenta-
tion free and training free word spotting method is more suitable to be used for palm leaf
manuscript images, because as we already stated, words in Balinese script were not written
separately, so in this case, word segmentation is not a trivial process for this collection.
Each descriptor contains 128 feature values. 2) Descriptors removal based on gradient
norm: We only kept 75% of descriptors with the highest gradient norm, and remove most
of the descriptors in the manuscript background (Figure 30). 3) Descriptors quantization
into codebook with K-Means Clustering7 : We quantized all descriptors into 1500 clusters.
4) Visual words construction with codebook cluster: We assigned a cluster label for each
keypoints (Figure 31). 5) Bag of Visual Words (BoVW) construction with Spatial Pyra-
mid Matching (SPM) (Lazebnik et al., 2006) of visual word patches: We generated the
histogram of visual words by sliding a patch of size 300 x 75 pixels, sampled at each 25
pixels (Figure 32 and Figure 33). This patch size covers sufficiently the average size of a
word. Based on SPM level 2, the histogram was constructed for each patch from 3 spatial
positions: total area of patch, left area of patch, and right area of patch (Figure 34). Those
three histograms from 3 spatial positions of patch, each with 1500 bin values of 1500 label
clusters, are then concatenated into one histogram feature with 4500 bin values (Figure 35)
(as jth feature patch of ith page of manuscript (Pij )).
Figure 35. Histogram feature of a patch of visual words with SPM level 2.
Figure 37. SIFT descriptors with high gradient norm of a query image.
4.4. Online Matching Between Query and Patches of Word on the Page
of Manuscripts
For the matching process, the following method is applied. 1) Similarity measure with
Cosine Distance: For each query image and the ith page of the manuscript, we measure the
Historical Handwritten Document Analysis of Southeast Asian ... 255
Figure 39. Spatial Pyramid Matching level 2 of a patch of visual words of a query image.
Figure 40. Histogram feature of a patch of visual words with SPM level 2 of a query image.
similarity with Cosine Distance between query feature Q̂i and each patch feature (visual
word feature) Pˆji in this ith page of the manuscript.
Q̂i .Pˆji
d=1−
i
ˆi
(10)
Q̂
Pj
2) Selection of N smallest distance to build the map of spotting area: For each query
image, we selected N patches with the smallest distance between patch feature and query
feature. In our experiments, we tested the value of N=75,100,125. All patches selected are
affixed on their position to build the map of spotting area (Figure 41).
To filter the best-selected patches from the previous step, we can apply the Longest
Common Subsequence (LCS) (cor, 2001) or Longest Weighted Profile (LWP) technique
256 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
(Dovgalecs et al., 2013). To perform the LCS and LWP algorithm, the visual word of the
query feature and patch feature should be constructed by concatenating each row of the
visual word into one-dimensional row vector. In the final step, to ignore redundant and
overlapping patches, we propose and apply a simple patch selection algorithm.
Figure 43. The selected patches after LCS technique of Figure 41.
Historical Handwritten Document Analysis of Southeast Asian ... 257
τ
hµi , µj i
Mi,j = max 0, , ∀i,j ∈ {1, ..., K} (11)
k µi kk µj k
Where are the concatenated SIFT feature space cluster centers, k is the number of
clusters, and τ ¿0. As in (Dovgalecs et al., 2013), we used τ = 50 in our experiments.
Algorithm 2 describes the pseudocode of the LWP algorithm. If , the matrix M will be-
come an identity matrix and LWP algorithm reduces to the same LCS algorithm. As in
the experiment with LCS, to filter the spotting areas, we empirically set the threshold value
T=0.35,0.40.
patch area, a single patch is directly selected as a final spotting area. In a group of overlap-
ping patches, we calculated the number of overlapping patches on each pixel in this area,
and we choose all pixels which contain the maximum number of overlapping patches as a
center of a new spotting area. A new spotting area is defined as a minimum rectangle area
to cover all those pixels. If this new spotting area is smaller than the size of query image, it
will be adjusted to the size of the query image.
Figure 44. Spotting Area after patch selection algorithm of Figure 43.
To capture the manuscript images, a Canon EOS 5D Mark III camera was used. The
camera settings are as follows (Kesiman et al., 2015b): F-stop: f/22 (diaphragm), exposure
time: 1/50 sec, ISO speed: ISO-6400, focal length: 70 mm, flash: On - 1/64, distance to
object: 76 cm, focus: Quick mode - Auto selection ‘On’. A black box camera support by
wood was also used to avoid the irregular lighting/luminance condition and to fits the semi-
outdoor capturing location (Figure 45). This camera support was optimally designed to be
used under some restricted conditions given by the museum or the owner of the manuscripts.
Two additional lights have been added inside the black box support. These lights are a
White Neon 50 cm of 20 watts. Thumbnail samples of the captured images are showed in
Figure 46.
Figure 45. Camera support for digitizing process of palm leaf manuscripts.
Table 8. Summary of word annotated images dataset for the AMADI LontarSet
(Kesiman et al., 2016b)
9
http://www.primaresearch.org/tools/Aletheia
262 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Figure 46. Sample images of palm leaf manuscript from a) Museum Gedong Kertya, Sin-
garaja, b) Museum Bali, Denpasar, c) Village of Jagaraga, Buleleng, d) Village of Susut,
Bangli, e) Village of Rendang, Karangasem (Kesiman et al., 2016b).
In a near future, we plan to develop the dataset in term of data quantity and variety to
be able to provide sufficiently a larger train data set for document analysis methods.
6. Experiments
6.1. Experiment on Isolated Character Recognition
We present this experimental study on feature extraction methods for character recognition
of Balinese script on palm leaf manuscripts (Kesiman et al., 2016c). We investigated and
evaluated the performance of 10 feature extraction methods and the proposed combination
of features in 29 different schemes. For all experiments, a set of image patches containing
Balinese characters from the original manuscripts will be used as input, and a correct class
of each character should be identified as a result. We used k=5 for the k-NN classifier, and
all images are resized to 50x50 pixels (the approximate average size of a character in the
collection), except for Gradient features where images are resized to 81x81 pixels to get
evenly 81 blocks of 9x9 pixels, as described in (Fujisawa et al., 1999).
The results (Table 10) show that the recognition rate of NPW features can be signif-
icantly increased (up to 10%) by applying it on the four directional Kirsch edge images
(NPW-Kirsch method). Then, by combining this NPW-Kirsch features, HoG features, and
Zoning method can increase the recognition rate up to 85% [6]. In these experiments, the
number of training dataset for each class is not balanced. But this condition was already
clearly stated and can not be avoided in our case of IHCR development for Balinese script
on palm leaf manuscripts. Some ancient characters are not frequently found in our collec-
tion of palm leaf manuscripts.
Figure 47. Samples of binarized images ground truth dataset (Kesiman et al., 2016b).
precision value. This is because, in the collection of palm leaf manuscripts, a specific word
Historical Handwritten Document Analysis of Southeast Asian ... 265
Figure 50. Screenshot of web based user interface for the character annotation process.
is normally found only on a very limited number of pages. Most of the patches with small
distance feature with query image are normally already found in one page. By limiting the
number of selected patches, it can decrease the number of spotted area in the other pages
of manuscript which do not contain the query word. LCS and LWP technique increase the
mean average precision value.
Figure 52. Samples of character-level annotated patch images of Balinese script on palm
leaf manuscripts (Kesiman et al., 2016b).
Historical Handwritten Document Analysis of Southeast Asian ... 267
Table 10. Recognition rate from all schemes of experiment (Kesiman et al., 2016c)
performance measure of any binarization methods. In the case of a manuscript with spe-
cific ancient characters, the qualitative observation and validation should also be made by
the philologist to guarantee the correctness of the binarized characters on the manuscripts.
A proper and robust combination of feature extraction methods can increase the character
recognition rate. This study shows that the recognition rate of isolated character recognition
of Balinese script can be significantly increased by applying NPW features on four direc-
tional Kirsch edge images. And the use of NPW on Kirsch features in combination with
HoG features and Zoning method can increase the recognition rate up to 85%. The results of
the study on word spotting show the challenging characteristics of a manuscript collection
268 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
with single script and multi-writer. Even though the methods and frameworks for this query
based word spotting technique are normally evaluated only on a single writer manuscript
collection, the results of this experiment show that the powerful framework combination of
BoVW with LSI, LCS and LWP can still provide a possibility to support the indexing and
word spotting system for multi-writer palm leaf manuscript images.
We have built the AMADI LontarSet, the first handwritten Balinese palm leaf
manuscript dataset. It includes three components of dataset as follows: binarized images
ground truth dataset, word annotated images dataset, and isolated character annotated im-
ages dataset. To improve the accuracy of character and text recognition of Balinese script on
palm leaf manuscripts, a lexicon-based statistical approach is needed. The lexicon dataset
will provide useful information about textual correlation of Balinese script (between char-
acters, syllables, and words). This information will be needed in the correction step of text
recognition when the physical feature description is failed to do the recognition process.
Our future interests are: - to build an optimal lexicon dataset for our system, in term of
quantity and completeness of the dataset. - to define an appropriate lexicon information
(characters, syllables and words level) for Balinese script. Writing in Balinese script, there
is no space between words in a text line. Most of the text recognition methods which nat-
urally proposed a sequential process to recognize the words as entity/unit will face this
characteristic as a very challenging task. The data representation of all words in some spe-
cific compositions of part-of-words (POW) can feed the recognizer with useful contextual
Historical Handwritten Document Analysis of Southeast Asian ... 269
Table 11. Results of experiment.
Acknowledgment
The authors would like to thank Museum Gedong Kertya, Museum Bali, and all families
in Bali, Indonesia, for providing us the samples of palm leaf manuscripts, and the students
from the Department of Informatics Education and the Department of Balinese Literature,
Ganesha University of Education for helping us in ground truthing process for this research
project. This work is also supported by the DIKTI BPPLN Indonesian Scholarship Program
and the STIC Asia Program implemented by the French Ministry of Foreign Affairs and
International Development (MAEDI).
References
(2001). Introduction to algorithms, 2nd ed.
Aggarwal, A., Singh, K., and Singh, K. (2015). Use of gradient technique for extract-
ing features from handwritten gurmukhi characters and numerals. Procedia Computer
Science, 46:1716–1723.
Almeida, J., Torres, R. d. S., and Goldenstein, S. (2009). Sift applied to cbir. Revista de
Sistemas de Informacao da FSMA n, 4:41–48.
Arica, N. and Yarman-Vural, F. T. (2002). Optical character recognition for cursive hand-
writing. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(6):801–813.
Ashlin Deepa, R. and Rajeswara Rao, R. (2014). Feature extraction techniques for recog-
nition of malayalam handwritten characters: Review. International Journal of Advanced
Trends in Computer Science and Engineering, 3(1):481–485.
Auclair, A., Cohen, L. D., and Vincent, N. (2007). How to use sift vectors to analyze
an image with database templates. In International Workshop on Adaptive Multimedia
Retrieval, pages 224–236. Springer.
Bal, G., Agam, G., Frieder, O., and Frieder, G. (2008). Interactive degraded document
enhancement and ground truth generation. In Electronic Imaging 2008, pages 68150Z–
68150Z. International Society for Optics and Photonics.
Blumenstein, M., Verma, B., and Basli, H. (2003). A novel feature extraction technique for
the recognition of segmented handwritten characters. In Document Analysis and Recogni-
tion, 2003. Proceedings. Seventh International Conference on, pages 137–141. IEEE.
Burie, J.-C., Coustaty, M., Hadi, S., Kesiman, M. W. A., Ogier, J.-M., Paulus, E., Sok, K.,
Sunarya, I. M. G., and Valy, D. (2016). Icfhr 2016 competition on the analysis of hand-
written text in images of balinese palm leaf manuscripts. In 15th International Conference
on Frontiers in Handwriting Recognition 2016, pages 596–601.
Historical Handwritten Document Analysis of Southeast Asian ... 271
Hossain, M. Z., Amin, M. A., and Yan, H. (2012). Rapid feature extraction for optical
character recognition. arXiv preprint arXiv:1206.0238.
Jin, Z., Qi, K., Zhou, Y., Chen, K., Chen, J., and Guan, H. (2009). Ssift: An improved
sift descriptor for chinese character recognition in complex images. In Computer Network
and Multimedia Technology, 2009. CNMT 2009. International Symposium on, pages 1–5.
IEEE.
Kesiman, M. W. A., Burie, J.-C., and Ogier, J.-M. (2016a). A new scheme for text line
and character segmentation from gray scale images of palm leaf manuscript. In 15th
International Conference on Frontiers in Handwriting Recognition 2016, At Shenzhen,
China, pages 325–330.
Kesiman, M. W. A., Burie, J.-C., Ogier, J.-M., Wibawantara, G. N. M. A., and Sunarya,
I. M. G. (2016b). Amadi lontarset: The first handwritten balinese palm leaf manuscripts
dataset. In 15th International Conference on Frontiers in Handwriting Recognition 2016,
pages 168–172.
Kesiman, M. W. A., Prum, S., Burie, J.-C., and Ogier, J.-M. (2015a). An initial study on
the construction of ground truth binarized images of ancient palm leaf manuscripts. In
Document Analysis and Recognition (ICDAR), 2015 13th International Conference on.
Kesiman, M. W. A., Prum, S., Burie, J.-C., and Ogier, J.-M. (2016c). Study on feature
extraction methods for character recognition of balinese script on palm leaf manuscript
images. In 23rd International Conference on Pattern Recognition, pages 4006–4011.
Kesiman, M. W. A., Prum, S., Sunarya, I. M. G., Burie, J.-C., and Ogier, J.-M. (2015b). An
analysis of ground truth binarized image variability of palm leaf manuscripts. In Image
Processing Theory, Tools and Applications (IPTA), 2015 International Conference on,
pages 229–233. IEEE.
Kesiman, M. W. A., Valy, D., Burie, J.-C., Paulus, E., Sunarya, I. M. G., Hadi, S., Sok,
K. H., and Ogier, J.-M. (2017). Southeast asian palm leaf manuscript images: a review
of handwritten text line segmentation methods and new challenges. Journal of Electronic
Imaging, 26(1):011011–011011.
Khayyat, M., Lam, L., and Suen, C. Y. (2013). Verification of hierarchical classifier results
for handwritten arabic word spotting. In Document Analysis and Recognition (ICDAR),
2013 12th International Conference on, pages 572–576. IEEE.
Khurshid, K., Siddiqi, I., Faure, C., and Vincent, N. (2009). Comparison of niblack in-
spired binarization methods for ancient documents. In IS&T/SPIE Electronic Imaging,
pages 72470U–72470U. International Society for Optics and Photonics.
Kumar, S. (2011). Study of features for hand-printed recognition. Int. J. Comput. Electr.
Autom. Control Inf. Eng. 5.
Lazebnik, S., Schmid, C., and Ponce, J. (2006). Beyond bags of features: Spatial pyramid
matching for recognizing natural scene categories. In Computer vision and pattern recog-
nition, 2006 IEEE computer society conference on, volume 2, pages 2169–2178. IEEE.
Ledwich, L. and Williams, S. (2004). Reduced sift features for image retrieval and indoor
localisation. In Australian conference on robotics and automation, volume 322, page 3.
Citeseer.
Lee, D.-R., Hong, W., and Oh, I.-S. (2012a). Segmentation-free word spotting using sift.
In Image Analysis and Interpretation (SSIAI), 2012 IEEE Southwest Symposium on, pages
65–68. IEEE.
Lee, D.-R., Hong, W., and Oh, I.-S. (2012b). Segmentation-free word spotting using sift.
In Image Analysis and Interpretation (SSIAI), 2012 IEEE Southwest Symposium on, pages
65–68. IEEE.
Lehal, G. S. and Singh, C. (2000). A gurmukhi script recognition system. In Pattern
Recognition, 2000. Proceedings. 15th International Conference on, volume 2, pages 557–
560. IEEE.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. Interna-
tional journal of computer vision, 60(2):91–110.
Messaoud, I. B., El Abed, H., Märgner, V., and Amiri, H. (2011). A design of a prepro-
cessing framework for large database of historical documents. In Proceedings of the 2011
Workshop on Historical Document Imaging and Processing, pages 177–183. ACM.
Nafchi, H. Z., Ayatollahi, S. M., Moghaddam, R. F., and Cheriet, M. (2013). An efficient
ground truthing tool for binarization of historical manuscripts. In Document Analysis and
Recognition (ICDAR), 2013 12th International Conference on, pages 807–811. IEEE.
Neha J. Pithadia, D. V. D. N. (2015). A review on feature extraction techniques for optical
character recognition. Int. J. Innov. Res. Comput. Commun. Eng. 3.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2008). An objective evaluation methodol-
ogy for document image binarization techniques. In Document Analysis Systems, 2008.
DAS’08. The Eighth IAPR International Workshop on, pages 217–224. IEEE.
Ntirogiannis, K., Gatos, B., and Pratikakis, I. (2013). Performance evaluation methodol-
ogy for historical document image binarization. IEEE Transactions on Image Processing,
22(2):595–609.
Pal, U. and Chaudhuri, B. (2004). Indian script character recognition: a survey. Pattern
Recognit., 37(9):1887–1899.
Pal, U., Jayadevan, R., and Sharma, N. (2012). Handwriting recognition in indian regional
scripts: a survey of offline techniques. ACM Transactions on Asian Language Information
Processing (TALIP), 11(1):1.
274 Made Windu Antara Kesiman, Jean-Christophe Burie, Jean-Marc Ogier et al.
Pratikakis, I., Gatos, B., and Ntirogiannis, K. (2013). Icdar 2013 document image bina-
rization contest (dibco 2013). In Document Analysis and Recognition (ICDAR), 2013 12th
International Conference on, pages 1471–1476. IEEE.
Rais, N. B., Hanif, M. S., and Taj, I. A. (2004). Adaptive thresholding technique for
document image analysis. In Multitopic Conference, 2004. Proceedings of INMIC 2004.
8th International, pages 61–66. IEEE.
Ramteke, R. (2010). Invariant moments based feature extraction for handwritten devana-
gari vowels recognition. International Journal of Computer Applications, 1(18):1–5.
Rani, M. and Meena, Y. K. (2011). An efficient feature extraction method for handwritten
character recognition. In International Conference on Swarm, Evolutionary, and Memetic
Computing, pages 302–309. Springer.
Rothacker, L., Fink, G. A., Banerjee, P., Bhattacharya, U., and Chaudhuri, B. B. (2013a).
Bag-of-features hmms for segmentation-free bangla word spotting. In Proceedings of the
4th International Workshop on Multilingual OCR, page 5. ACM.
Rothacker, L., Rusinol, M., and Fink, G. A. (2013b). Bag-of-features hmms for
segmentation-free word spotting in handwritten documents. In Document Analysis and
Recognition (ICDAR), 2013 12th International Conference on, pages 1305–1309. IEEE.
Rusinol, M., Aldavert, D., Toledo, R., and Llados, J. (2011). Browsing heterogeneous
document collections by a segmentation-free word spotting method. In Document Analysis
and Recognition (ICDAR), 2011 International Conference on, pages 63–67. IEEE.
Saund, E., Lin, J., and Sarkar, P. (2009). Pixlabeler: User interface for pixel-level la-
beling of elements in document images. In Document Analysis and Recognition, 2009.
ICDAR’09. 10th International Conference on, pages 646–650. IEEE.
Siddharth, K. S., Dhir, R., and Rani, R. (2011). Handwritten gurmukhi numeral recog-
nition using different feature sets. International Journal of Computer Applications,
28(2):20–24. Full text available.
Smith, E. H. B. and An, C. (2012). Effect of” ground truth” on image binarization. In
Document Analysis Systems (DAS), 2012 10th IAPR International Workshop on, pages
250–254. IEEE.
Historical Handwritten Document Analysis of Southeast Asian ... 275
Trier, Ø. D., Jain, A. K., and Taxt, T. (1996). Feature extraction methods for character
recognition-a survey. Pattern Recognition, 29(4):641 – 662.
Wahl, F. M., Wong, K. Y., and Casey, R. G. (1982). Block segmentation and text extraction
in mixed text/image documents. Computer graphics and image processing, 20(4):375–
390.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 10
1. Introduction
Transcription of handwritten documents has become an interesting research topic in the last
years. In particular, transcription of historical documents is interesting for preserving and
providing access to data on cultural heritage (Fischer et al., 2009). Since accessibility to
the contents of the documents is very limited without a proper transcription, this activity
is needed to provide indexing, consulting and querying facilities on the contents of the
documents.
The difficulties that historical manuscripts present make necessary the use of experts,
called paleographers, that employ their knowledge on ancient script and vocabulary for
obtaining an accurate transcription. In any case, this manual transcription is both slow and
expensive. In order to make the process more efficient an interesting option is automatic
transcription, which can employ the Handwritten Text Recognition (HTR) technology to
obtain a transcription of the document. However, current state-of-the-art HTR technology
does not guarantee an accurate enough transcription for the subsequent processes on the
obtained data (Fischer et al., 2009; Serrano et al., 2010a), and paleographer intervention is
required.
In order to alleviate the paleographer task on obtaining an accurate transcription from
an initial HTR transcription, interactive assistive approaches have been introduced re-
cently (Serrano et al., 2010b; Romero et al., 2012; Toselli et al., 2011; Llorens et al., 2009).
In these approaches, the user and the system work together to obtain the perfect transcrip-
tion; the system uses the text image, the automatic transcript provided by the HTR system
∗ E-mail address: [email protected] (Corresponding author).
278 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
and some feedback given by the user to provide a new, hopefully, better hypothesis.
Apart from that, additional data sources could be provided to improve the initial tran-
scription. For example, paleographers may employ speech dictation of the contents of
the image to be transcribed. Dictation can be processed with Automatic Speech Recog-
nition (ASR) techniques (Jelinek, 1998) and the recognition can be combined with HTR
recognition results to obtain a more accurate transcription. This possibility was explored
in (Granell and Martı́nez-Hinarejos, 2015a), using Confusion Networks (CN) combination;
CN combination was previously studied for unimodal and multimodal signal integration
with good results (Ishimaru et al., 2011; Granell and Martı́nez-Hinarejos, 2015a; Granell
and Martı́nez-Hinarejos, 2015b). However, combination effects on interactive systems were
not tested.
Additionally, in interactive systems, the user must provide feedback several times to the
system, independently of the initial transcription given by the available data sources. Al-
though the number of interactions may change depending on these initial sources, making
the interaction process comfortable to the user is crucial to the success of an interactive
system. Since paleographers usually employ touchscreen tablets for their task, using touch-
screen pen strokes to provide feedback appears as an appropriate interactive option. These
ideas have been previously explored in (Romero et al., 2012; Martı́n-Albo et al., 2013) in the
context of a computer assisted transcription system called CATTI. However, this previous
work employs a suboptimal two-phases process in each interaction step.
The work described in this chapter explores the effect of the combination of text im-
ages and speech signal as a new source for the interactive system, and the use of on-line
text feedback that is integrated into each interaction in a single step by using CN combi-
nation. This feedback modality will be applied to the result of unimodal (text image or
speech dictation) or multimodal recognition (combination of both text image and speech
dictation recognition). The main hypothesis is that using more ergonomic multimodal in-
terfaces should provide a more comfortable human-machine interaction, at the expense of
employing a less deterministic feedback than when using not so ergonomic peripherals
(e.g., keyboard and mouse). Thus, additional interaction steps may be necessary to correct
possible errors produced when combining the current hypothesis and the feedback and their
impact in productivity must be measured, specially for the multimodal source.
In summary, this chapter presents the use of combination of text images and speech
signal as a new source to improve the initial hypothesis offered to the user of the interactive
system, and the use of on-line text as a correction feedback, integrating it into the current
transcription hypothesis. Results show how on-line HTR hypotheses can correct several
errors on the initial hypotheses of the multimodal recognition process, providing a global
reduction of the user effort, and thus allowing to speed up the transcriber task.
Section “Multimodal Interactive Transcription of Handwritten Text Images” presents
the CATTI framework and the multimodal version of this approach. Section “Multimodal
Combination in CATTI” explains the Confusion Network combination. Section “Natural
Language Recognition Systems” gives an overview of the off-line HTR system, the ASR
system, and the on-line HTR feedback subsystem. Section “Experimental Framework”
details the experimental framework (data, conditions, and assessment measures); Section
“Experimental Results” shows the results; and Section “Conclusions and Future Work”
offers the final conclusions and future work lines.
Using Speech and Handwriting in an Interactive Approach for Transcribing ... 279
P(x | w)P(w)
ŵ = arg max P(w | x) = arg max = arg max P(x | w)P(w) (1)
w∈W w∈W P(x) w∈W
where W denotes the set of all permissible sentences, P(x) is the a priori probability of
observing x, P(w) is the probability of w = (w1 , w2 , . . . , w|w| ) approximated by the language
280 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
CUENTA
RA 4 6 LABRA
O DORES
AG
<s> E AGORA CUENTA LABRADORES </s> 2 10 </s
E >
<s> AGORA CUẼTA LA HISTORIA </s> <s>
0 1 AGO 8 HISTO > 12
<s> AGORA CUẼTA EL HISTO </s> R A LA RIA </s
CUẼTA
A
A 5 7 E 11
<s> A AGORA CUẼTA LA HISTORIA </s> OR L HISTO
AG
<s> A AGORA CUẼTA EL HISTO </s> 3 9
Figure 1. Recognition output formats: n-best list, Word Graph and Confusion Network.
model (usually modelled by a n-gram word language model), and P(x | w) is the probability
of observing x by assuming that w is the underlying word sequence for x, evaluated by the
optical or acoustical models for HTR and ASR respectively (typically it is approximated by
concatenated Hidden Markov Models - HMM).
The search (or decoding of) ŵ is carried out by the Viterbi algorithm (Jelinek, 1998).
From this dynamic-programming decoding process we can obtain not only a single best
hypothesis, but also a huge set of best hypotheses. These solutions can be presented in
the form of n-best list or compactly represented into Word Graphs (WG) or Confusion
Networks (CN) (Jurafsky and Martı́n, 2009). A WG is a directed, acyclic and weighted
graph that represents a huge set of hypotheses in a very efficient way. The nodes in a WG
correspond to discrete time points for ASR and horizontal positions for HTR. The edges
are labelled with words and weighted with the likelihood that the word appears in the signal
delimited between the starting and ending nodes of the edge. The scores are derived from
the HMM and n-gram probabilities computed during the decoding process. On the other
hand, a CN is also a directed, acyclic and weighted graph that shows at each point which
word hypotheses are competing or confusable. Each hypothesis goes through all the nodes.
The words and their probabilities are stored in the edges, and the total probability of the
words contained in a subnetwork (SN, all edges between two consecutive nodes), sum up to
1. Figure 1 provides an example of a n-best list, a WG representing these n-best hypotheses,
and an equivalent CN.
images contain.
The process starts when the system proposes a full transcription ŝ of the handwritten
text line image taking into account also the speech dictation. Then, the user reads this tran-
scription until finding a mistake and makes a mouse action (MA) m, or equivalent pointer-
positioning keystrokes, to position the cursor at this point. By doing so, the user is already
providing some very useful information to the system: he is validating a prefix p of the
transcription (which is error-free) and, in addition, he is signalling that the following word
e located after the cursor is incorrect. Hence, the system can already take advantage of
this fact and directly propose a new suitable suffix (i.e., a new ŝ) in which the first word
is different from the first wrong word of the previous suffix. This way, many explicit user
corrections are avoided (Romero et al., 2008). If the new suffix ŝ corrects the erroneous
word, a new cycle starts. However, if the new suffix has an error in the same position than
the previous one, the user can enter a word v to correct the erroneous one. This action
produces a new prefix p (the previously validated prefix followed by the new word). Then,
the system takes into account the new prefix to suggest a new suffix and a new cycle starts.
This process is repeated until a correct transcription is accepted by the user.
In Figure 2 we can see an example of the CATTI process. In this example, without
interaction, a user should have to correct about three errors from the originally recognised
hypothesis (“abadia”, “segun” and “el”). Using CATTI only one explicit user-correction is
necessary to get the final error-free transcription: the interaction 1 only needs a MA, but in
the interaction 2 a single mouse action does not succeed and the correct word needs to be
typed.
Image
Audio
INTER-0 p
ŝ la abadia de Toledo a mano de xpiānos segun el dicho es
INTER-1 m ⇑
p la
ŝ cibdad de Toledo a mano de xpiānos segun el dicho es
m ⇑
p la cibdad de Toledo a mano de xpiānos
INTER-2 ---- -------------------------------------------
ŝ sigue el dicho es
v segund
p la cibdad de Toledo a mano de xpiānos segund
ŝ dicho es
FINAL v #
p≡t la cibdad de Toledo a mano de xpiānos segund dicho es
Based on the Figure 2, the system starts with an initially recognised hypothesis ŝ from
any of the modalities or a combination of both, the user validates its longest well-recognised
prefix p by making a MA m, and the system emits a new recognised hypothesis ŝ. As the
new hypothesis corrects the erroneous word, a new cycle starts. Now, the user validates the
282 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
new longest prefix p which is error-free by making another MA m. The system provides
a new suffix ŝ taking into account this information. As the new suffix does not correct
the mistake the user types the correct word v, generating a new validated prefix p. Taking
into account the new prefix, the system suggests a new hypothesis ŝ. As the new hypothesis
corrects the erroneous word, a new cycle starts. This process is repeated until the final error-
free transcription t is obtained. The underlined boldface word in the final transcription is
the only one which was corrected by the user. Note that in the second interaction (INTER-
2) the correct word must be typed, given that the erroneous word was not corrected by the
performed MA. However, in the first interaction (INTER-1) only a MA is needed to correct
the erroneous word.
Formally, in the traditional CATTI framework (Romero et al., 2012), the system uses a
given feature sequence, xhtr , representing a handwritten text line image and a user validated
prefix p of the transcription. In this work, in addition to xhtr , we study how a sequence
of feature vectors xasr representing the speech dictation of the handwritten text line image
affects the system performance. Therefore, the CATTI system should try to complete the
validated prefix by searching for a most likely suffix ŝ taking into account both sequences
of feature vectors:
Making the naive assumption that xhtr does not depend on xasr , and applying the Bayes’
rule, we can rewrite the previous equation as:
multimodal CATTI (MM-CATTI). In this way, the user corrective feedback can be quite
naturally provided by means of on-line text or pen strokes exactly registered over the text
produced by the system.
In (Romero et al., 2012), the multimodal interaction process is formulated into two
steps. In the first step a CATTI system solves the problem presented in Equation (3). In the
second step, the user enters some pen-strokes, t, typically aimed at accepting or correcting
parts of the suffix suggested by the system in the previous interaction step, ŝ, validating a
prefix which is error free, p′ . Then, an on-line HTR feedback subsystem is used to decode
t into a word d,ˆ taking into account ŝ and p′ . This scenario is considered here as a baseline.
In (Granell et al., 2016), a multimodal interaction process in one step was presented. In
this way, both the CATTI source (it can be only off-line text, speech or both) and on-line
handwritten text help each other to optimise the system accuracy.
Formally speaking, let xhtr be the input image, xasr the speech dictation of the input
image, and t the on-line touchscreen pen strokes that the user introduces to insert or sub-
stitute a word. Let p′ be the user-validated prefix of the previously suggested transcription
which is error-free and e the wrong word that the user tries to correct. Using this informa-
tion, the system has to suggest a new suffix, s, as a continuation of the validated prefix p′ ,
conditioned by the on-line touchscreen strokes t and the erroneous word e. Therefore, the
problem is to find ŝ given xhtr , xasr and a feedback information composed of p′ , e and t. By
further considering the decoding d of t as a hidden variable, we can write:
We can now make the reasonable assumption that t only depends on d and, that xhtr ,
xasr and s do not depend on e and, approximating the sum over all the possible decodings d
of t by the dominating term, Equation (4) can be rewritten as:
The first three terms of Equation (5) are very similar to Equation (3), being p the con-
catenation of p′ and d. The main difference is that now d is unknown. On the other hand,
the last two terms correspond to the HTR decoding of the on-line feedback, conditioned by
the previously validated prefix p′ and the erroneous word e. As in conventional CATTI, the
probabilities P(xhtr | p′ , d, s), P(xasr | p′ , d, s) and P(t | d) are modelled by optical, acous-
tical and kinematical HMMs, whereas, P(s | p′ , d) and P(d | p′ , e) are modelled by using
conditioned n-grams.
In order to cope with the erroneous word e that follows the validated prefix, and given
that this word only affects to the decoding of t, P(d | p′ , e) can be formulated as follows:
′
δ̄(d, e) · P(d | p )
P(d | p′ , e) = (6)
1 − P(e | p′ )
As in conventional CATTI, this decoding can be implemented using CN. In each inter-
action step, the validated prefix p′ is parsed over the CN obtained from the combination
of the off-line original image xhtr and/or the speech dictation xasr . This parsing procedure
will end defining a node q of the CN whose associated word sequence is p′ . Then, a CN
is obtained from the on-line feedback recogniser. Assuming that the user corrects only one
word in each interaction, this on-line CN is composed of a list of words that corresponds
with the different decodings of t. This on-line CN is combined with the off-line CN after
the node q. Then, the system continues searching for the most probably suffix, according to
Equation (5), using this new combined CN.
2. The new CN is composed by using three editing actions: combination, insertion and
deletion of subnetworks:
Combination: Given two subnetworks, SNHT R and SNASR , the word posterior prob-
abilities of the combined CATTI subnetwork SNCAT T I are obtained by applying
a normalisation on the logarithmic interpolation of the smoothed word posterior
probabilities of both SN (SNHT R and SNASR ):
Pr(w | SN) + Θ
Pr s (w | SN) = (9)
1 + nΘ
where Θ is a defined granularity that represents the minimum probability for a
word and n is the number of different words in the final CATTI SN.
Insertion and deletion: The same process is performed in both actions: the sub-
network to insert or to delete is combined with a subnetwork with an only
*DELETE* arc with probability 1.0.
Combination: The same combination process explained for the multimodal hypotheses
combination is performed in this case.
In the example (Figure 10.3a), the marked subnetwork of the CATTI CN (subnetwork
between the nodes 3 and 4) is selected for combination. In this case, the correct word
286 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
CATTI SN smoothed
CATTI SN
Word Posterior
Word Posterior New CATTI SN
LABRADORES 0.918
LABRADORES 0.918 before normalisation New CATTI SN
LA 0.082
Normalisation
LA 0.082
Combination
DUC 9.99 × 10−5
Smoothing
Word Posterior Word Posterior
LA 0.168 LA 0.905
On-line feedback SN smoothed LABRADORES 0.010 LABRADORES 0.052
On-line feedback SN
Word Posterior DUC 0.008 DUC 0.043
Word Posterior
DUC 0.657
DUC 0.657
LA 0.343
LA 0.343
LABRADORES 9.99 × 10−5
Figure 4. Detailed subnetwork combination example using α = 0.5 and Θ = 10−4 . The
Smoothing block represents the use of Equation (9), and the Combination block the use of
Equation (8).
(LA) is not the most probable word, either in the on-line subnetwork. However, it
becomes the most probable word when combining both subnetworks with α = 0.5
and Θ = 10−4 , as can be seen in the appointed subnetwork of the new CN (Figure 4
shows this combination process in detail).
Insertion: The subnetwork insertion allows adding a word into the CATTI CN on a
particular position. This position is determined by the parsing of the validated prefix
p′ that precedes the on-line word inserted by the user in the CATTI interaction.
As an example, the on-line SN (see Figure 3) is inserted just after the 4th node of the
CATTI CN.
smoothing (Kneser and Ney, 1995), estimated directly from the training transcriptions of
the text line images. The decoding is optimally carried out by the Viterbi algorithm. This
dynamic-programming decoding process can return not only a single best solution, but also
a huge set of best solutions compactly represented into a WG. Then this WG is transformed
into a CN. For a more detailed description of these recognition systems see (Toselli et al.,
2004) for off-line HTR, (Rabiner and Juang, 1993) for ASR, and (Toselli et al., 2007) for
on-line HTR.
5. Experimental Framework
5.1. Datasets
5.1.1. Off-line Handwritten Text: Cristo Salvador
The Cristo Salvador corpus was employed previously in different works (Alabau et al.,
2011, 2014; Granell and Martı́nez-Hinarejos, 2015a) related to multimodal combination.
This corpus is a handwritten book of the XIX century provided by Biblioteca Valenciana
Digital (BiValDi). It is a single writer book with different image features that cause some
problems, such as smear, background variations, differences in bright, and bleed-through
(ink that trespasses to the other surface of the sheet). It is composed of 53 pages that were
automatically divided into lines (such as shown in Figure 5).
This corpus presents a total number of 1,172 lines, with a vocabulary of 3,287 different
words. For training the optical models for off-line HTR, a partition with the first 32 pages
(675 lines) was used. Test data for off-line HTR is composed of the lines of page 41 (24
lines, 222 words), that was selected for being, according to preliminary error recognition
results, a representative page of the whole test set (the remaining 21 pages, 497 lines).
Figure 6. Examples of the word “HISTORIA” generated by using characters from the three
selected UNIPEN test writers (BH, BR, BS).
of 120 utterances, with an average length of about 4 and a half seconds), using a sample
rate of 16 KHz and an encoding of 16 bits (to match the conditions of Albayzin data).
makes WER and WSR comparable. The relative difference between them gives us a good
estimation of the reduction in human effort (EFR) that can be achieved by using CATTI
with respect to using a conventional HTR system followed by human post-editing.
Moreover, since on-line feedback is used only for single-word corrections, the conven-
tional classification error rate (ER) was used to assess its recognition quality.
For each measure, confidence intervals of 95% were calculated by using the bootstrap-
ping method with 10,000 repetitions (Bisani and Ney, 2004).
6. Experimental Results
Several experiments were performed to assess our multimodal proposal for improving the
assistive transcription system presented in Section ”Multimodal Interactive Transcription
of Handwritten Text Images”. Multimodal combination allows to enrich the CATTI hy-
potheses from different sources of information (off-line HTR and ASR). Moreover, the
multimodal operation with the on-line HTR feedback offers ergonomics and increased us-
ability at the expense of the system having to deal with non-deterministic feedback signals.
Therefore, the main concern here is how the performance in CATTI and MM-CATTI can
be boosted by the multimodal combination of different decoding systems. Finally, we as-
sess which degree of synergy can be expected by taking into account both interaction and
multimodality.
We started obtaining the non interactive post-edition baseline, for the off-line HTR, for
the ASR, and for the multimodal combination of both unimodal recognition systems. Then,
the CATTI and the MM-CATTI approaches were applied to the three input possibilities,
two unimodal (off-line HTR and ASR), and one multimodal (off-line HTR combined with
ASR) formatted as CN.
However, the multimodal combination of both sources allows to reduce the WER value
to 29.3% ± 2.5, which represents a relative improvement of 10.9% over the off-line HTR
baseline, and 33.0% over the ASR baseline. One of the best advantages of the multimodal
combination is that not only the 1-best is improved, but also the rest of hypotheses are
improved. This fact can be observed through the oracle WER of this multimodal combina-
tion (13.4% ± 2.1), which is significantly reduced given the two unimodal sources. Given
that the oracle WER represents the WER of the best hypothesis that can be achieved, a
significant beneficial effect on interactive systems can be expected.
Concerning the time processing performance, as can be observed in Table 2, the off-
line HTR system needed on average 204642.2 ms (approximately 3.4 minutes) to decode
a text line image, while the ASR system only needed 30.1 seconds to decode every speech
utterance. On the other hand, the combination of the decoding results of both systems took
on average 315.5 ms per sample. Taking into account that both decoding processes can
be performed in parallel, the average time required for obtaining the multimodal output
corresponds to the processing time of the slowest modality (in this case the off-line HTR),
plus the combination time, i.e. 204957.7 ms on average. The multimodal combination
represents a negligible increase of 0.2% of extra time over the time required to decode the
off-line HTR.
CATTI Input
Measure
Off-line ASR Multimodal
WSR 31.1% ± 6.0 31.6% ± 3.1 12.9% ± 2.1
EFR 5.5% 4.0% 60.8%
Table 3 presents the estimated interactive human effort (WSR) required for obtaining
the perfect transcription using the interactive CATTI approach for the three different input
possibilities.
As expected, the obtained WSR for the unimodal inputs represents a slight effort re-
duction (EFR) of around 5% with respect to the off-line HTR baseline. However, in the
case of the multimodal input the WSR reaches 12.9% ± 2.1, which represents a significant
effort reduction of 60.8% over the off-line HTR baseline (32.9% ± 6.4). Notice that, in
the multimodal case, the obtained WSR value is a bit lower than the oracle WER value
(13.4% ± 2.1); this is possible because the presented CATTI approach, by means of mouse
actions, allows reducing the number of words explicitly corrected by the user. Therefore,
in this case, the CATTI approach not only offers the best hypothesis contained in the mul-
timodal lattices, but it improves the oracle WER value deleting several erroneous words of
this hypothesis.
292 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
Table 4. On-line HTR Feedback Results.
MM-CATTI Input
Measure
Off-line HTR ASR Multimodal
On-line Words 57.7 70.4 23.9
On-line HTR ER 6.9% ± 3.2 3.6% ± 0.9 8.7% ± 2.4
MM-CATTI Input
Measure
Off-line HTR ASR Multimodal
TS 26.0% ± 5.5 31.6% ± 3.4 10.7% ± 1.8
WSR KBD 12.3% ± 5.0 8.6% ± 1.5 3.2% ± 1.0
Global 38.3% ± 8.6 40.2% ± 4.4 13.9% ± 2.4
EFR −16.4% −22.2% 57.8%
In Table 5 the MM-CATTI results are presented. In this case, the estimated interactive
human effort (WSR) is decomposed into the percentage of words written with the on-line
HTR feedback - TouchScreen (TS) - and the percentage of those words for which the cor-
rection with the on-line HTR feedback failed and the corrections had to be entered by means
of the keyboard (KBD), i.e., in MM-CATTI the WSR is calculated under the assumption
that the cost of keyboard-correcting an erroneous on-line feedback word is similar to an-
other on-line HTR interaction. This is a pessimistic assumption since interaction through
touchscreen is more ergonomic than through the keyboard.
The multimodal combination of the on-line feedback with the MM-CATTI hypotheses
allowed to reduce significantly a number of words that required to be corrected by using
the keyboard. Despite the fact that in the unimodal input experiments only 12.3% of words
for off-line HTR, and 8.6% for ASR were corrected by using the keyboard, no EFR can
be considered given the previous definition of WSR for MM-CATTI. However, with the
multimodal input a 13.9% of WSR was obtained, which represents an EFR of 57.8% with
respect to the off-line HTR baseline.
Using Speech and Handwriting in an Interactive Approach for Transcribing ... 293
According to these results, in MM-CATTI most of the user effort is concentrated in the
more ergonomic and user preferred touchscreen feedback. However, the overall user effort
in MM-CATTI is only moderately higher than that of CATTI when the input presents a low
oracle WER value.
Acknowledgments
Work partially supported by projects READ - 674943 (European Union’s H2020),
SmartWays - RTC-2014-1466-4 (MINECO), and CoMUN-HaT - TIN2015-70924-C2-1-R
(MINECO/FEDER).
References
Alabau, V., Martı́nez-Hinarejos, C.-D., Romero, V., and Lagarda, A. L. (2014). An iterative
multimodal framework for the transcription of handwritten historical documents. Pattern
Recognition Letters, 35:195–203.
Alabau, V., Romero, V., Lagarda, A. L., and Martı́nez-Hinarejos, C.-D. (2011). A Multi-
modal Approach to Dictation of Handwritten Historical Documents. In Proc. 12th Inter-
speech, pages 2245–2248.
Bisani, M. and Ney, H. (2004). Bootstrap estimates for confidence intervals in ASR perfor-
mance evaluation. In Proc. ICASSP, volume 1, pages 409–412.
Fischer, A., Wuthrich, M., Liwicki, M., Frinken, V., Bunke, H., Viehhauser, G., and Stolz,
M. (2009). Automatic transcription of handwritten medieval documents. In 2009 15th
International Conference on Virtual Systems and Multimedia. Institute of Electrical and
Electronics Engineers (IEEE), pages 137-142.
294 Emilio Granell, Verónica Romero and Carlos-D. Martı́nez-Hinarejos
Granell, E., Romero, V., and Martinez-Hinarejos, C.-D. (2016). An Interactive Approach
with Off-line and On-line Handwritten Text Recognition Combination for Transcribing
Historical Documents. In Proc. DAS, pages 269–274.
Ishimaru, S., Nishizaki, H., and Sekiguchi, Y. (2011). Effect of Confusion Network Combi-
nation on Speech Recognition System for Editing. In Proc. 3rd APSIPA ASC, volume 4,
pages 1–4.
Kneser, R. and Ney, H. (1995). Improved backing-off for m-gram language modeling. In
Proc. ICASSP, volume 1, pages 181–184.
Llorens, D., Marzal, A., Prat, F., and Vilar, J. M. (2009). State: an assisted document
transcription system. In Proc. ICMI-MLMI, pages 233–234.
Luján-Mares, M., Tamarit, V., Alabau, V., Martı́nez-Hinarejos, C.-D., Pastor, M., Sanchis,
A., and Toselli, A. H. (2008). iATROS: A speech and handwritting recognition system.
In Proc. V JTH, pages 75–78.
Martı́n-Albo, D., Romero, V., and Vidal, E. (2013). Interactive off-line handwritten text
transcription using on-line handwritten text as feedback. In Proc. ICDAR, pages 1280–
1284.
Moreno, A., Poch, D., Bonafonte, A., Lleida, E., Llisterri, J., Mariño, J. B., and Nadeu, C.
(1993). Albayzin speech database: design of the phonetic corpus. In Proc. EuroSpeech,
pages 175–178.
Romero, V., Toselli, A. H., Civera, J., and Vidal, E. (2008). Improvements in the Computer
Assisted Transcription System of Handwritten Text Images. In Proc. 8th PRIS, pages
103–112.
Romero, V., Toselli, A. H., and Vidal, E. (2012). Multimodal Interactive Handwritten
Text Transcription, volume 80 of Machine Perception and Artificial Intelligence. World
Scientific.
Using Speech and Handwriting in an Interactive Approach for Transcribing ... 295
Serrano, N., Castro, F., and Juan, A. (2010a). The RODRIGO Database. In Proc. 7th LREC,
pages 2709–2712.
Serrano, N., Sanchis, A., and Juan, A. (2010b). Balancing Error and Supervision Effort in
Interactive-Predictive Handwritten Text Recognition. In Proc. 15th IUI, pages 373–376.
Stolcke, A. (2002). SRILM-an extensible language modeling toolkit. In Proc. 3rd Inter-
speech, pages 901–904.
Toselli, A. H., Juan, A., Keysers, D., González, J., Salvador, I., H. Ney, Vidal, E., and
Casacuberta, F. (2004). Integrated Handwriting Recognition and Interpretation using
Finite-State Models. IJPRAI, 18(4):519–539.
Toselli, A. H., Vidal, E., and Casacuberta, F., editors (2011). Multimodal Interactive Pattern
Recognition and Applications. Springer, 1st edition.
Toselli, A. H., Pastor, M., and Vidal, E. (2007). On-Line Handwriting Recognition System
for Tamil Handwritten Characters. In Proc. 3rd IbPRIA, volume 4477, pages 370–377.
Xue, J. and Zhao, Y. (2005). Improved Confusion Network Algorithm and Shortest Path
Search from Word Lattice. In Proc. 30th ICASSP, volume 1, pages 853–856.
Young, S., Evermann, G., Gales, M., Hain, T., Kershaw, D., Liu, X., Moore, G., Odell, J.,
Ollason, D., Povey, D., et al. (2006). The HTK book. Cambridge university engineering
department.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 11
1. Introduction
The traditional approach in document indexing usually involves an Optical Character
Recognition (OCR) step. Although OCR performs well in modern printed documents and
documents of high quality printing, in the case of handwritten documents OCR, several
factors affect the final performance like intense degradation, paper-positioning variations
(skew, translations, etc.) and writing styles variety.
Handwritten keyword spotting has attracted the attention of the research community
in the field of document image analysis and recognition since it appears to be a feasible
solution for indexing and retrieval of handwritten documents in the case that OCR-based
methods fail to deliver satisfactory results.
Handwritten keyword spotting (KWS) is the task of retrieving all instances of a given
query word in handwritten document image collections without involving a traditional OCR
step. There exist two basic variations for KWS approaches: (a) the Query by Example case
(QbE) where the query is a word image and (b) the Query by String case (QbS) where, as
the name implies, the query is a string. The study presented in this chapter will focus on
the QbE approach.
For a better understanding, QbE methods will be presented taking into account two dif-
ferent perspectives which relate to the use of segmentation and learning. The segmentation-
based methods are divided into 2 subcategories based upon the segmented entity which
could be either the word image or the text line. They are strongly dependent on the segmen-
tation step so that to compare different methods regardless of segmentation errors, many
researchers do not implement a segmentation method but they use datasets where the seg-
ments are given.
∗ E-mail address: [email protected] (Corresponding author).
298 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
In the case of segmentation-free methods, the whole image is tested against similarities
between the query image and the patches of the document image without segmenting it at
any level. The methods of this class, on the one hand bypass the step of segmentation but on
the other hand they cannot avoid searching for the words in parts of the image that may not
contain text. Therefore, segmentation-free methods avoid failures due to bad segmentation
but the running time increases considerably. It is worth-mentioning that the methods of this
class are not the trend.
Training-based methods are those that require training data at a particular stage of the
process. A common problem in these methods is the availability of training data. Further-
more, an extra weakness is that to apply such a method to a new word, usually ground
truthing work is required to obtain training data, which is quite time consuming and often
it has to be done totally manual.
Training - free are methods that as the name implies do not include any training stage
in the operational KWS pipeline. The training - free methods can be applied directly to
new word although, they usually require a particular configuration to be effective in the
corresponding text.
This chapter is structured as follows: Section “Segmentation-based Context” will
present the KWS methodologies that operate in a segmentation-based context wherein
methods based on training and methods that are independent of any training involvement
will be detailed. Both variations will be separately reviewed depending on the type of seg-
mentation which is used. In Section “Segmentation - Free Context”, methodologies that do
account for a segmentation will be discussed with a particular focus on the use or not of
training. Section “Experimental Datasets and Evaluation Metrics” deals with an overview
of the current efforts for performance evaluation and a brief description of datasets that
were used in QbE KWS, while the Section “Concluding Remarks” is dedicated to a fruitful
discussion which aims to identify the current trends of the QbE KWS.
2. Segmentation-Based Context
In this section, segmentation-based methods are presented. Segmentation-based methods
have been categorized into training-based and training-free approach. Each category is then
subdivided into word image segmentation and text line segmentation context.
word image, LGH features are computed by moving a window from left to right over the
image and feature vectors are extracted at each position to build the feature vector sequence.
Finally, using SC-HMM with GMM, a score is assigned to each feature vector sequence
which is used to attribute the similarity with the query using a threshold. An overview of
the methodology is shown in Figure 1.
The same framework was used in Rodrı́guez-Serrano et al. (2010) modified so that writ-
ers adaptation is achieved. For this purpose, a statistical adaptation technique was applied
to change some of GMMs parameters at each document. Furthermore, SC-HMM was used
in Rodrı́guez-Serrano and Perronnin (2012) to enrich features extraction, since in a left-to-
right HMM, the states are ordered and the weights of the SC-HMMs can be viewed as a
sequence of vectors. The distance between these vectors is computed using the Dynamic
Time Warping (DTW) wherein Bhattacharyya measure is being used as local similarity. The
use of SC-HMM in an unsupervised context was presented in Rodriguez-Serrano and Per-
ronnin (2012) where character examples of existing fonts were used to create the training
set.
In the work in Almazán et al. (2012a), a preprocessing stage is initially applied using
margins removal and anisotropic Gaussian filtering. Then, binarization and word segmen-
tation are applied. In the core methodology, they use a hashing strategy based on Loci
features to prune word images and limit the candidate locations. A discriminative model is
then applied to those locations. The discriminative learning relies upon a Support Vector
Machine (SVM) which sets the weights on the appearance features Histogram of Oriented
Gradients (HOG) to compute the final similarity.
Almazan et al. (2013) created a method that is both QbE and QbS and addresses
300 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
multi-writer WS. The QbE pipeline is based on Fisher Vectors (FV) computed over Scale-
Invariant Feature Transform SIFT descriptors. Then the FV are used to feed an SVM to
get the attribute scores. They report that any encoding method and classification algorithm
that transforms the input image into attribute scores could be used to replace them, but they
chose SVMs and FVs for simplicity and effectiveness.
The study in Fernández et al. (2013) uses a previous method (Fernández et al., 2011)
which is extended in a way that the syntactic context in the document is used to infer
context. To achieve this, Markov Logic Networks (MLN) are used that are trained with
specific rules. The MLN can be considered as a collection of first-order rules to each of
which it is assigned a real number, the weight. Each rule represents the rule in the domain,
while the weights indicate the strength of the rule.
The work in Aldavert et al. (2015) uses the Bag-of-Visual-Words (BoVW) model to
WS. The authors divide the procedure of creating BoVW in four basic steps: sampling,
description, encoding and pooling. In particular, they sampled densely the word images
using a fixed step at different scales. The description is derived from the HOG descriptor.
To encode the descriptors and create the codebook, the Locality-constrained Linear Coding
(LLC) algorithm was used. Finally, at pooling step, the Spatial Pyramid Matching (SPM)
technique applied so that spatial information could be used.
In the work at Sharma et al. (2015), they made experiments with Convolutional Neural
Networks (CNN). Only the classification layers were retrained to address the problem. The
CNN extracted a 4096-d feature for each word image which was achieved after discarding
the last fully connected layer and considering the activation of the units in the second fully
connected layer. For matching, standard Lp norms have been used.
Figure 2. An illustration of the word spotting pipeline presented in (Rath and Manmatha,
2007).
(iii) horizontal projection, (iv) top shape projection and (v) bottom shape projection. For
each feature, Gustafson-Kessel fuzzy algorithm is applied for quantization to reduce the
stored size of each descriptor. To refine the results, relevance feedback is applied. The
user selects the best result and a training set is created. This training set is used to train an
SVM to correct the results. The similarity measure that was used is a modified (weighted)
Minkowski L1.
The study appeared in Can and Duygulu (2011) was motivated from a shape represen-
tation method. They used already segmented word images provided with the datasets. The
proposed methodology comprises 3 main steps. First, binarization is applied by using a
threshold which is computed as the mean intensity value of the gray-scale image. Next,
a contour extraction is used for each connected component in the binarized word image.
Finally, a sequence of lines is created which is used as the descriptor for matching. The
matching score is computed by first finding the distances between the line descriptors and
then, summing all distances over the complete word image.
The basic premise in the work of Fornés et al. (2011) is that each word image is treated
as a shape which can be represented using two models, namely the Blurred Shape Model
(BSM) and the Deformable BSM. First, at preprocessing stage, segmented text lines are
normalized by applying skew angle removal and slant correction. In the case of BSM,
Handwritten Keyword Spotting The Query by Example (QbE) Case 303
the descriptor represents a probability distribution of the object shape. In the case of De-
formable BSM, every image is represented with two output descriptors, a vector which
contains the BSM value of each focus (equidistantly distributed points) and the position of
each focus. The proposed matching technique lies upon the movement of focuses so that
its own BSM is maximized. It is shown that using Euclidean distance in both BSM and
deformable BSM methods outperforms the use of DTW.
Fernández et al. (2011) used Loci features (Glucksman, 1969) along eight directions.
They are computed on the skeleton of each word image which is achieved after a document
image binarization and word image segmentation step. The similarity is computed using
the Euclidean and Cosine distance.
The study in Diesendruck et al. (2012b) and Diesendruck et al. (2012a) was focusing on
building a search system for 1930-40 US Census data. The process starts with binarization,
morphological thinning and Hough transform to locate table lines since the documents are
in table format. Thus, the segmentation is based on table lines. Since each cell contains one
word, the method is word-based segmented. Then, a signature vector was composed the first
10 coefficients of cosine transform of upper, lower and transition profiles. Since the data
set is quite large and the response time should be reasonable, hierarchical agglomerative
clustering with complete linkage is used to cluster the signature vectors.
The problem of sequential KWS was addressed in Fernández-Mota et al. (2014). In
sequential KWS the ordered sequence of indices is taken into account for finding similar
instances of words in a book. They experimented with descriptors that relate to a single
writer scenario (BSM, HOG, nrHOG) as well as descriptors that relate to a multiple writers
scenario (attribute-based approach).
The work in Howe (2013) is based on a part-structured modeling which aims to mini-
mize a deformation energy required to fit the query to the word image instance. The process
is initiated with a binarization. Then, skeletonization is applied to produce connected com-
ponents of a single-pixel width. The endpoints and junctions of the skeleton are used to
build a tree. An energy minimization of a function that comprises a deformation energy and
an observation energy term is finally addressed.
The method in Kovalchuk et al. (2014) is the winner of H-KWS 2014 competition. First,
they binarize the image by global thresholding and connected components are computed.
Then, pruning of connected components is followed based on heuristics that rely upon
properties of connected components. Using a regular grid of fixed size, they compute HOG
and LBP descriptors which result in a 250D vector. A max-pooling process is then applied
to the descriptor. The matching is made with L2 distance.
In Wang et al. (2014) the authors initiate the process by applying the preprocessing step
presented in Wang et al. (2013). They use a graph representation model which is based on
the skeleton of each word image. In this graph, vertices are the structural interest points and
the strokes connecting them are the edges. The value of vertices corresponds to the Shape
Context descriptor while the value of edges corresponds to the length of the stroke. The
computation of similarity between two-word images concerns the similarity of graphs for
each connected component existing in the query and the test word image which is used to
guide the DTW computation.
The work in Zagoris et al. (2014) is based on spatial information from word images.
First, gradient vectors are calculated. Because of the sensitivity of gradient to noise, an Otsu
304 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
like threshold is applied to gradient vectors. At the remaining points, gradient orientation
is calculated. Next, a linear quantization of gradients orientation to a desirable amount of
levels follows. The quantization step also controls the amount of the final local points. After
quantization, the corner points are characterized as initial keypoints (kP). The final points
are the dominant kPs according to Shannon entropy at an area were the kP is the center.
After the final points have been selected each area around kPs is divided into 9 subareas.
For each of these 9 areas using a voting system based on the weighted distance of each
point to kP, a 3-bin histogram is created. The combination of these 9 histograms to a 27-
bin histogram results in the descriptor, the Document-Specific Local Features (DSLF). At
matching stage first, a normalization is applied for each word image. Then instead of a brute
force search, a Local Proximity Nearest Neighbour (LPNN) search is used by taking into
account the mean distance of each pixel from the mean center. Finally, Euclidean distance
is applied between the kPs and the results are presented in an ascending order. In Zagoris
et al. (2015), an extension of this work is presented using Relevance Feedback strategies
(CombSum, CombMin, Probabilistic model). It is reported that the optimal results are
achieved from CombMin model.
The goal in Papandreou et al. (2016) is to study the zoning features. Binarization and
deslanting are first applied to the query and candidate word images as pre-processing steps.
The zoning features are extracted after cutting the query word in vertical zones based on its
length and pixel distribution along the horizontal axis and adjusting these boundaries with
the corresponding zones in the candidate word image using DTW. In the sequel, the word
images are normalized and their features, which are based on pixel density, are extracted.
Finally, the final distance is the product of Euclidean distance of the two-word images with
the distance provided from DTW.
In Mondal et al. (2016) the author introduces a new matching technique the Flexible Se-
quence Matching (FSM) algorithm for KWS task. At the preprocessing stage, the document
images are first binarized by an adaptive technique (Gatos et al., 2006), after binarization
along with border removal is applied to obtain proper text boundaries (Stamatopoulos et al.,
2010) and then a segmentation stage follows that partitions the documents in lines or pieces
of lines, up to words or parts of words, depending on the experiment to be conducted. At
features, extraction stage grayscale and binary image are used to extract two types of fea-
tures, namely column-based features and Slit style HOG (SSHOG) (Terasawa and Tanaka,
2009). Eight Column-based features are extracted from the binary image. Finally, at match-
ing stage, FSM is applied. FSM is similar in spirit to DTW but it is less sensitive to local
variations in the spelling of words and to local degradations effects within the word image.
In a recent work Retsinas et al. (2016), three variances of the Projections of Oriented
Gradient (POG) descriptor (Retsinas et al., 2015) studied in the framework of the KWS
problem. The first variant, the global POG (gPOG) is slightly different from POG, for
which the main differences are: (i) it keeps a different number of coefficients from DCT
and (ii) has 6 projections. The second variant k-segmented POG (lPOG), first segments
the word image to k overlapping images and then calculates the POG descriptor to each
of them. The third variant, fusion POG (fPOG), as the name implies is a fusion of gPOG
and lPOG descriptors. Finally, the Euclidean distance is used to attain the matching score.
It should be noted that at the preprocessing step, binarization, skew correction and height
normalization were applied.
Handwritten Keyword Spotting The Query by Example (QbE) Case 305
3. Segmentation-Free Context
In this section, segmentation-free methods are presented. The methods of this section are
divided into training-based and training-free.
(Hast and Marchetti, 2012), a technique that first introduced for matching aerial images.
First, the input images are binarized using Otsu and then smoothed with a Gaussian in order
to find more key points. Then, four different kinds of key points are detected in the word
images, which basically detect lines, corners and blobs. Taking into account the detected
keypoints, Fourier-based feature descriptors are computed. At the end, the matching is
performed by an improved version of Random Sample Consensus (RANSAC), called as
PUMA, which is able to allow a more relaxed matching among the word images.
Rabaev et al. (2016) focus on KWS by locating the query word in a document recur-
sively in a scale-space pyramid. The proposed scheme does not depend on a specific choice
of features. They experimented with the HOG descriptors, which have been shown to pro-
vide good results. Chi-square distance is applied to compare HOG descriptors at all levels
of the pyramid.
Precision is the fraction of retrieved words that are relevant to the search, while Recall
is the fraction of the words that are relevant to the query. It is apparent that the above
metrics are inversely related. To achieve a combined evaluation, the precision-recall curve
is computed.
The Precision - Recall Curve is computed by the traditional 11-point interpolated av-
erage precision approach (Manning et al., 2008), (Van Rijsbergen, 1979). For each query,
the interpolated precision is measured at the 11 recall levels of 0.0, 0.1, 0.2, ..., 1.0.
Sometimes, the differences between the evaluation algorithms are very hard to observe
especially, between very small performance results. Moreover, these graphs may not
contain all the desired information (Salton, 1992). Therefore, the need to evaluate the
retrieval results with a single value is needed. The most common evaluation metric
that can meet this requirement is the Mean Average Precision (MAP) (NIST, 2013;
Chatzichristofis et al., 2011) which is defined as the average of the precision value obtained
after each relevant word is retrieved:
Handwritten Keyword Spotting The Query by Example (QbE) Case 309
Table 1. Descriptors, learning methods and similarity measures used from each
method
Method Descriptors Learning Similarity
T (Rodrı́guez-Serrano and Perronnin, 2009) LGH SC-HMM Euclidean
r (Rodrı́guez-Serrano and Perronnin, 2012) LGH SC-HMM DTW
a (Almazán et al., 2012a) Loci, HOG SVM Dot product
i (Fernández et al., 2013) Loci MLN Euclidean
n (Aldavert et al., 2015) HOG BoVW Histogram Matching
i Deep
(Sharma et al., 2015) CNN Lp-norm
n features
g DCT on
(Keaton et al., 1997) Bayesian network Graph matching
profiles
b Column-
a (Choisy, 2007) wise binary NSHP-HMM Posteriori Probability
s patterns
e (Rusinol et al., 2011) SIFT BoVW Histogram Matching
d (Almazán et al., 2012b) HOG Exemplar SVM Euclidean
(Dovgalecs et al., 2013) SIFT BoVW Chi-square
(Rothacker et al., 2013) SIFT BoVW, HMM Histogram Matching
(Manmatha et al., 1996) Profiles Euclidean, DTW
(Adamek et al., 2007) MCC, DCT DTW
(Bhardwaj et al., 2008) Moments Cosine
(Zagoris et al., 2011) CSPD Minkowski L1
Sequence of
T (Can and Duygulu, 2011) Line matching
lines
r
Deformable
a (Fornés et al., 2011) Euclidean, DTW
BSM
i
(Fernández et al., 2011) Loci Euclidean, Cosine
n
DCT on
i (Diesendruck et al., 2012b,a) Euclidean
profiles
n
Endpoints and
g
(Howe, 2013) junctions of Energy minimization
skeleton
f
(Kovalchuk et al., 2014) HOG, LBP Euclidean
r
(Wang et al., 2014) SC DTW
e
(Zagoris et al., 2014) DSLF Euclidean
e
(Papandreou et al., 2016) Zoning features Euclidean and DTW
Column-
(Mondal et al., 2016) based, FSM
SSHOG
POG, gPOG,
(Retsinas et al., 2016) Euclidean
lPOG, fPOG
(Kolcz et al., 2000) Profiles DTW
(Terasawa and Tanaka, 2009) SSHOG DTW
(Wang et al., 2013) Profiles, SC DTW
(Leydier et al., 2009) ZOI Cohesive matching
Minimum cost path
between connected
(Zhang and Tan, 2013) DaLI
keypoints in a mesh
grid
(Hast and Fornés, 2016) Corners, blobs PUMA
(Rabaev et al., 2016) HOG Chi-square
310 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
Figure 4. Samples from the most used datasets (a) GW (b) IAM (c) Bentham (d) Modern.
n
∑ (P@K × rel(k))
k=1
AP = (3)
{relevant words}
where
Precision at k items (P@k) denoted as
{relevant words} ∩ {k retrieved words}
P@k = (4)
{k retrieved words}
with the function of relevance denoted as follows:
(
1 if word at rank k is relevant
rel(k) =
0 if word at rank k is not relevant
Finally, the Mean Average Precision (MAP) is calculated by averaging the AP for all the
queries, denoted as:
Q
MAP = ∑ APq (5)
q=1
A∩B
OA = (6)
A∪B
Handwritten Keyword Spotting The Query by Example (QbE) Case 311
where OA is the overlapping area, A the resulting bounding box and B is the ground-truth.
The challenging nature of KWS in handwritten documents has motivated the organiza-
tion of three dedicated international competitions in conjunction with the International Con-
ference on Frontiers of Handwriting Recognition (ICFHR) and the International Conference
on Document Analysis and Recognition (ICDAR). In particular, the ICFHR 2014 Hand-
written Keyword Spotting Competition (ICFHR-2014) (Pratikakis et al., 2014), the IC-
DAR 2015 Competition on Keyword Spotting for Handwritten Documents (ICDAR-2015)
(Puigcerver et al., 2015) and the ICFHR 2016 Handwritten Keyword Spotting Competition
(ICFHR-2016) (Pratikakis et al., 2016) have been the venues where research groups have
competed in two different KWS scenarios, namely, segmentation-free and segmentation-
based. Table 2, 3 and 4 shows the results for the ICFHR-2014, ICDAR-2015 and ICFHR-
2016 competitions, respectively.
Segmentation-based Segmentation-free
BENTHAM MODERN BENTHAM MODERN
Method mAP P@5 mAP P@5 mAP P@5 mAP P@5
(Kovalchuk et al., 2014) 0.524 0.738 0.338 0.588 0.416 0.609 0.263 0.539
(Almazan et al., 2013) 0.513 0.724 0.523 0.706 - - - -
(Howe, 2013) 0.462 0.718 0.278 0.569 0.363 .556 0.163 0.417
(Leydier et al., 2009) - - - - 0.205 0.335 0.087 0.234
(Pantke et al., 2013) - - - - 0.337 0.543 0.091 0.245
Segmentation-based Segmentation-free
Method mAP P@5 mAP P@5
Pattern Recognition Group,
0.424 0.406 0.276 0.343
TU Dortmund University
(Aldavert et al., 2013) 0.300 0.342 0.082 0.109
Segmentation-based Segmentation-free
Botany Konzil. Botany Konzil.
Method mAP mAP mAP mAP
Computer Vision Center (CVCDAG)
75.77 77.91 0 0
Universitat Autoònoma de Barcelona, Spain
Pattern Recognition (PRG)
89.69 96.05 15.89 52.20
TU Dortmund University, Germany
(Kovalchuk et al., 2014) 50.64 71.11 37.48 61.78
Visual Information and Interaction (QTOB)
54.95 82.15 - -
Uppsala University, Sweeden
“CB” stands for a collection of 50 pages from handwritten marriage licenses from the
Barcelona Cathedral written in 1617. URL: http://dag.cvc.uab.es/the-esposalles-database/
The “Bentham” dataset is part of the H-KWS 2014 contest’s dataset. It consists of high
312 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
Table 5. Segmentation type and Datasets used from each method.
Segmentation type
Method Dataset
Line Word
(Rodrı́guez-Serrano and Perronnin, 2009),
T (Rodriguez-Serrano and Perronnin, 2012), X French
r (Rodrı́guez-Serrano et al., 2010)
a (Rodrı́guez-Serrano and Perronnin, 2012) X GW, French, IFN/ENIT
i (Almazán et al., 2012a),
X CB
n (Almazan et al., 2013)
i (Fernández et al., 2013) X CB
n (Aldavert et al., 2015) X GW, Bentham, Modern
g (Sharma et al., 2015) X IAM
(Keaton et al., 1997) X AIS
b (Choisy, 2007) - - HMS
a (Rusinol et al., 2011, 2015) - - GW
s (Almazán et al., 2012b) - - GW
e (Dovgalecs et al., 2013) - - GW
d (Rothacker et al., 2013) - - GW
(Manmatha et al., 1996) X DIMUND, Hudson
(Rath and Manmatha, 2003a,b, 2007) X GW
(Adamek et al., 2007) X GW
(Bhardwaj et al., 2008) X GW, IAM
(Zagoris et al., 2011) X GW, Greek
T (Can and Duygulu, 2011) X GW, OTM
r (Fornés et al., 2011) X GW
a (Fernández et al., 2011),
X CB
i (Fernández-Mota et al., 2014)
n (Diesendruck et al., 2012b,a) X USC
i (Howe, 2013) X GW
n (Kovalchuk et al., 2014) X GW
g (Wang et al., 2014) X GW, CB
(Zagoris et al., 2014) X GW, Bentham, Modern
f (Papandreou et al., 2016) X GW, Bentham
r (Mondal et al., 2016) X GW, Japanese
e (Retsinas et al., 2016) X Bentham, Modern
e (Kolcz et al., 2000) X AIS
(Terasawa and Tanaka, 2009) X GW, Japanese
(Wang et al., 2013) X CITERE
(Leydier et al., 2007, 2009) - - GW
(Zhang and Tan, 2013) - - GW
(Hast and Fornés, 2016) - - CB
(Rabaev et al., 2016) - - GW, CG, Arabic
quality (approximately 3000 pixels width and 4000 pixels height) handwritten manuscripts.
The documents are written by Jeremy Bentham (1748-1832) himself as well as by Ben-
tham’s secretarial staff over a period of sixty years.
The “Modern” dataset is also part of the H-KWS 2014 contest’s dataset. It consists of
modern handwritten documents from the ICDAR 2009 Handwritten Segmentation Contest.
These documents originate from several writers that were asked to copy a given text. They
Handwritten Keyword Spotting The Query by Example (QbE) Case 313
do not include any non-text elements (lines, drawings, etc.) and are written in four (4)
languages (English, French, German and Greek).
A dataset that comprises 1539 pages of modern off-line handwritten English text written
by 657 different writers is denoted as “IAM”. URL: http://www.fki.inf.unibe.ch/databases
/iam-handwriting-database
The Archives of the Indies in Seville (AIS), is a repository that represents the official
communication between the Spanish Crown and its New World colonies and spans approx-
imately four centuries (i.e. 15th-19th).
At Choisy (2007) a dataset that consists of French handwritten mails collection was
used for Handwritten Mail Sorting (HMS) task, wherein 1522 mail pages are manually
labelled.
At Manmatha et al. (1996) as dataset two single pages were used. One obtained from
the DIMUND document server, thus denoted as “DIMUND” and the other single page was
taken from a collection in the library of the University of Massachusetts. This page is a letter
written by James S. Gibbons to Erasmus Darwin Hudson and it is denoted as “Hudson”.
An Ottoman dataset denoted as “OTM” comprises documents written with a commonly
encountered calligraphy style called Riqqa, which was used in official documents. Consists
of 257 words in three pages of text. URL: http://courses.washington.edu/otap/
US Census forms from 1930 and 1940 comprise a dataset denoted as “USC”.
Scanned images of the Japanese manuscript “The diary of Matsumae Kageyu” by
Akoku Raishiki comprise a dataset denoted as “Japanese”.
Letters written by different French philosophers constituting 4 collections are denoted
as “CITERE”. There are 11 pages containing approximately 2000 words, where 51 words
were used as queries. URL: http://citere.hypotheses.org/
The Cairo Geniza (CG) collection that consists of 12 document images dated to the 10th
century. This collection exhibits smeared characters, bleed through, and stains. The page
size is about 1650 × 2330 pixels, and the collection contains 1371 words of 921 different
transcriptions. URL: http://www.genizah.org/
A collection of 10 pages of Islamic manuscripts from Harvard University denoted as
“Arabic” consists of documents that are dated from 12th to 15th centuries. The ground truth
for this collection is given in terms of word-parts. Since word-parts are relatively small, for
the experiments, 5117 largest word-parts were chosen with 929 different transcriptions. The
page size is 1600 pixels. URL: http://ocp.hul.harvard.edu/ihp/.
Conclusion
The major difference between segmentation-based and segmentation-free methods is the
different search space (distinct word images versus the whole document image) where they
operate. This is the basis of each disadvantage or advantage that each approach entails.
The main advantage of the segmentation-based methods is the retrieval speed. The
ability to know the words boundaries inside the document provide profound advantages
with respect to an efficient retrieval performance. Therefore, segmentation-based methods
are mainly based on word segmentations as there is only one method i.e. Keaton et al.
(1997) which alternatively uses text line segmentation.
314 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
On the other hand, if the document is very noisy or very complex to apply a word seg-
mentation method then a segmentation-free methodology is the most appropriate approach.
Unfortunately, current segmentation-free KWS methods seem not to be appealing since
the indexing storage size and the retrieval computation are very costly even when they are
dealing with a medium size (100 pages) datasets. Moreover, the complicated issue of man-
aging directly the whole document image is the main reason that few works deal with the
segmentation-free approach.
Concerning features extraction, it is worth noting, that the majority of the works are
relying on some form of gradients like SIFT, HOG, LGH, SSHOG, POG. Although the
initial approaches were using profile features, recent works understood the spatial texture
information is more robust than shape, especially for the handwritten documents. Lastly,
some very recent works use features that are obtained using Deep Neural Networks which
are called deep features.
For the training-based methods, the BoVW has been extensively used as a standalone
learning component or combined with other models like the HMMs. The connection with
the HMMs was motivated by the use of HMMs for handwritten transcription using a mod-
eling inherent to the way a human makes a transcription. The recent advent of CNNs has
started to appear in the KWS context (Sharma et al., 2015).
Until recently, the most common used dataset was the George Washington dataset for
which, there was not a common evaluation protocol and each researcher employing its own
subset and query set. Fortunately, the recent KWS competitions have set the ground for a
concise performance evaluation framework.
Finally, in some works (Zagoris et al., 2011, 2014; Rusinol et al., 2011; Bhardwaj et al.,
2008) the KWS pipeline is coupled with a relevance feedback mechanism which introduces
the user in the retrieval loop, thus, improving the final retrieval performance.
References
Adamek, T., O’Connor, N. E., and Smeaton, A. F. (2007). Word matching using single
closed contours for indexing handwritten historical documents. International Journal of
Document Analysis and Recognition (IJDAR), 9(2-4):153–165.
Aldavert, D., Rusinol, M., Toledo, R., and Lladós, J. (2013). Integrating visual and tex-
tual cues for query-by-string word spotting. In Document Analysis and Recognition
(ICDAR), 2013 12th International Conference on, pages 511–515. IEEE.
Aldavert, D., Rusiñol, M., Toledo, R., and Lladós, J. (2015). A study of bag-of-visual-words
representations for handwritten keyword spotting. International Journal on Document
Analysis and Recognition (IJDAR), 18(3):223–234.
Almazán, J., Fernández, D., Fornés, A., Lladós, J., and Valveny, E. (2012a). A coarse-to-
fine approach for handwritten word spotting in large scale historical documents collec-
tion. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference
on, pages 455–460. IEEE.
Almazán, J., Gordo, A., Fornés, A., and Valveny, E. (2012b). Efficient exemplar word
spotting. In BMVC, volume 1, page 3.
Handwritten Keyword Spotting The Query by Example (QbE) Case 315
Almazan, J., Gordo, A., Fornés, A., and Valveny, E. (2013). Handwritten word spotting with
corrected attributes. In Proceedings of the IEEE International Conference on Computer
Vision, pages 1017–1024.
Bhardwaj, A., Jose, D., and Govindaraju, V. (2008). Script independent word spotting in
multilingual documents. In IJCNLP, pages 48–54.
Cao, H., Govindaraju, V., and Bhardwaj, A. (2011). Unconstrained handwritten docu-
ment retrieval. International Journal on Document Analysis and Recognition (IJDAR),
14(2):145–157.
Chatzichristofis, S. A., Zagoris, K., and Arampatzis, A. (2011). The TREC files: the
(ground) truth is out there. In Proceedings of the 34th international ACM SIGIR confer-
ence on Research and development in Information Retrieval, pages 1289–1290. ACM.
Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990).
Indexing by latent semantic analysis. Journal of the American society for information
science, 41(6):391.
Diesendruck, L., Marini, L., Kooper, R., Kejriwal, M., and McHenry, K. (2012a). Digitiza-
tion and search: A non-traditional use of hpc. In E-Science (e-Science), 2012 IEEE 8th
International Conference on, pages 1–6. IEEE.
Diesendruck, L., Marini, L., Kooper, R., Kejriwal, M., and McHenry, K. (2012b). A frame-
work to access handwritten information within large digitized paper collections. In E-
Science (e-Science), 2012 IEEE 8th International Conference on, pages 1–10. IEEE.
Dovgalecs, V., Burnett, A., Tranouez, P., Nicolas, S., and Heutte, L. (2013). Spot it! finding
words and patterns in historical documents. In Document Analysis and Recognition
(ICDAR), 2013 12th International Conference on, pages 1039–1043. IEEE.
Fernández, D., Lladós, J., and Fornés, A. (2011). Handwritten word spotting in old
manuscript images using a pseudo-structural descriptor organized in a hash structure.
In Iberian Conference on Pattern Recognition and Image Analysis, pages 628–635.
Springer.
Fernández, D., Marinai, S., Lladós, J., and Fornés, A. (2013). Contextual word spotting in
historical manuscripts using markov logic networks. In Proceedings of the 2nd Interna-
tional Workshop on Historical Document Imaging and Processing, pages 36–43. ACM.
Fernández-Mota, D., Manmatha, R., Fornes, A., and Llados, J. (2014). Sequential word
spotting in historical handwritten documents. In Document Analysis Systems (DAS),
2014 11th IAPR International Workshop on, pages 101–105. IEEE.
316 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
Fornés, A., Frinken, V., Fischer, A., Almazán, J., Jackson, G., and Bunke, H. (2011). A
keyword spotting approach using blurred shape model-based descriptors. In Proceedings
of the 2011 workshop on historical document imaging and processing, pages 83–90.
ACM.
Gatos, B., Pratikakis, I., and Perantonis, S. J. (2006). Adaptive degraded document image
binarization. Pattern recognition, 39(3):317–327.
Giacinto, G. and Roli, F. (2004). Instance-based relevance feedback for image retrieval. In
NIPS, pages 489–496.
Glucksman, H. A. (1969). Classification of mixed-font alphabetics by characteristic loci.
Technical report, DTIC Document.
Hast, A. and Fornés, A. (2016). A segmentation-free handwritten word spotting approach
by relaxed feature matching. In Document Analysis Systems (DAS), 2016 12th IAPR
Workshop on, pages 150–155. IEEE.
Hast, A. and Marchetti, A. (2012). Putative match analysis: a repeatable alternative to
RANSAC for matching of aerial images. In International Conference on Computer
Vision Theory and Applications, VISAPP2012, Rome, Italy, 24-26 February, 2012, pages
341–344. SciTePress.
Howe, N. R. (2013). Part-structured inkball models for one-shot handwritten word spotting.
In Document Analysis and Recognition (ICDAR), 2013 12th International Conference
on, pages 582–586. IEEE.
Ide, E. (1971). New experiments in relevance feedback. The SMART retrieval system, pages
337–354.
Keaton, P., Greenspan, H., and Goodman, R. (1997). Keyword spotting for cursive docu-
ment retrieval. In Document Image Analysis, 1997.(DIA’97) Proceedings., Workshop
on, pages 74–81. IEEE.
Kolcz, A., Alspector, J., Augusteijn, M., Carlson, R., and Popescu, G. V. (2000). A line-
oriented approach to word spotting in handwritten documents. Pattern Analysis & Appli-
cations, 3(2):153–168.
Kovalchuk, A., Wolf, L., and Dershowitz, N. (2014). A simple and fast word spotting
method. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International
Conference on, pages 3–8. IEEE.
Leydier, Y., Lebourgeois, F., and Emptoz, H. (2007). Text search for medieval manuscript
images. Pattern Recognition, 40(12):3552–3567.
Leydier, Y., Ouji, A., LeBourgeois, F., and Emptoz, H. (2009). Towards an omnilingual
word retrieval system for ancient manuscripts. Pattern Recognition, 42(9):2089–2105.
Manmatha, R., Han, C., and Riseman, E. M. (1996). Word spotting: A new approach to
indexing handwriting. In Computer Vision and Pattern Recognition, 1996. Proceedings
CVPR’96, 1996 IEEE Computer Society Conference on, pages 631–637. IEEE.
Handwritten Keyword Spotting The Query by Example (QbE) Case 317
Manning, C. D., Raghavan, P., Schütze, H., et al. (2008). Introduction to information
retrieval, volume 1. Cambridge university press Cambridge.
Mondal, T., Ragot, N., Ramel, J.-Y., and Pal, U. (2016). Flexible sequence matching
technique: An effective learning-free approach for word spotting. Pattern Recognition,
60:596–612.
Pantke, W., Märgner, V., Fecker, D., Fingscheidt, T., Asi, A., Biller, O., El-Sana, J., Saabni,
R., and Yehia, M. (2013). Hadara–a software system for semi-automatic processing of
historical handwritten arabic documents. In Archiving Conference, volume 2013, pages
161–166. Society for Imaging Science and Technology.
Papandreou, A., Gatos, B., and Zagoris, K. (2016). An adaptive zoning technique for word
spotting using dynamic time warping. In Document Analysis Systems (DAS), 2016 12th
IAPR Workshop on, pages 387–392. IEEE.
Pratikakis, I., Zagoris, K., Gatos, B., Louloudis, G., and Stamatopoulos, N. (2014). ICFHR
2014 competition on handwritten keyword spotting (H-KWS 2014). In Frontiers in
Handwriting Recognition (ICFHR), 2014 14th International Conference on, pages 814–
819. IEEE.
Pratikakis, I., Zagoris, K., Gatos, B., Puigcerver, J., Toselli, A. H., and Vidal, E. (2016).
ICFHR 2016 handwritten keyword spotting competition (h-kws 2016). In Frontiers in
Handwriting Recognition (ICFHR), 2016 15th International Conference on, pages 613–
618. IEEE.
Puigcerver, J., Toselli, A. H., and Vidal, E. (2015). ICDAR 2015 competition on keyword
spotting for handwritten documents. In Document Analysis and Recognition (ICDAR),
2015 13th International Conference on, pages 1176–1180. IEEE.
Rabaev, I., Kedem, K., and El-Sana, J. (2016). Keyword retrieval using scale-space pyra-
mid. In Document Analysis Systems (DAS), 2016 12th IAPR Workshop on, pages
144–149. IEEE.
Rath, T. M. and Manmatha, R. (2003a). Features for word spotting in historical manuscripts.
In Document Analysis and Recognition, 2003. Proceedings. Seventh International Con-
ference on, pages 218–222. IEEE.
Rath, T. M. and Manmatha, R. (2003b). Word image matching using dynamic time warping.
In Computer Vision and Pattern Recognition, 2003. Proceedings. 2003 IEEE Computer
Society Conference on, volume 2, pages II–II. IEEE.
Rath, T. M. and Manmatha, R. (2007). Word spotting for historical documents. Interna-
tional Journal of Document Analysis and Recognition (IJDAR), 9(2-4):139–152.
318 Georgios Barlas, Konstantinos Zagoris and Ioannis Pratikakis
Retsinas, G., Gatos, B., Stamatopoulos, N., and Louloudis, G. (2015). Isolated character
recognition using projections of oriented gradients. In Document Analysis and Recog-
nition (ICDAR), 2015 13th International Conference on, pages 336–340. IEEE.
Retsinas, G., Louloudis, G., Stamatopoulos, N., and Gatos, B. (2016). Keyword spotting in
handwritten documents using projections of oriented gradients. In Document Analysis
Systems (DAS), 2016 12th IAPR Workshop on, pages 411–416. IEEE.
Rodrı́guez-Serrano, J. A., Perronnin, F., Sánchez, G., and Lladós, J. (2010). Unsuper-
vised writer adaptation of whole-word HMMs with application to word-spotting. Pattern
Recognition Letters, 31(8):742–749.
Rothacker, L., Rusinol, M., and Fink, G. A. (2013). Bag-of-features HMMs for
segmentation-free word spotting in handwritten documents. In Document Analysis and
Recognition (ICDAR), 2013 12th International Conference on, pages 1305–1309. IEEE.
Rusinol, M., Aldavert, D., Toledo, R., and Llados, J. (2011). Browsing heterogeneous doc-
ument collections by a segmentation-free word spotting method. In Document Analysis
and Recognition (ICDAR), 2011 International Conference on, pages 63–67. IEEE.
Rusinol, M., Aldavert, D., Toledo, R., and Lladós, J. (2015). Efficient segmentation-free
keyword spotting in historical document collections. Pattern Recognition, 48(2):545–
555.
Rusinol, M. and Lladós, J. (2012). The role of the users in handwritten word spotting appli-
cations: query fusion and relevance feedback. In Frontiers in Handwriting Recognition
(ICFHR), 2012 International Conference on, pages 55–60. IEEE.
Salton, G. (1992). The state of retrieval system evaluation. Information processing &
management, 28(4):441–449.
Saon, G. and Belaı̈d, A. (1997). High performance unconstrained word recognition system
combining HMMs and markov random fields. International Journal of Pattern Recogni-
tion and Artificial Intelligence, 11(05):771–788.
Sharma, A. et al. (2015). Adapting off-the-shelf CNNs for word spotting & recognition. In
Document Analysis and Recognition (ICDAR), 2015 13th International Conference on,
pages 986–990. IEEE.
Handwritten Keyword Spotting The Query by Example (QbE) Case 319
Stamatopoulos, N., Gatos, B., and Georgiou, T. (2010). Page frame detection for double
page document images. In Proceedings of the 9th IAPR International Workshop on
Document Analysis Systems, pages 401–408. ACM.
Terasawa, K. and Tanaka, Y. (2009). Slit style hog feature for document image word spot-
ting. In Document Analysis and Recognition, 2009. ICDAR’09. 10th International Con-
ference on, pages 116–120. IEEE.
Wang, P., Eglin, V., Garcia, C., Largeron, C., Lladós, J., and Fornés, A. (2014). A novel
learning-free word spotting approach based on graph representation. In Document Anal-
ysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 207–211. IEEE.
Wang, P., Eglin, V., Garcia, C., Largeron, C., and McKenna, A. (2013). A comprehensive
representation model for handwriting dedicated to word spotting. In Document Analy-
sis and Recognition (ICDAR), 2013 12th International Conference on, pages 450–454.
IEEE.
Zagoris, K., Ergina, K., and Papamarkos, N. (2011). Image retrieval systems based on
compact shape descriptor and relevance feedback information. Journal of Visual Com-
munication and Image Representation, 22(5):378–390.
Zagoris, K., Pratikakis, I., and Gatos, B. (2014). Segmentation-based historical handwrit-
ten word spotting using document-specific local features. In Frontiers in Handwriting
Recognition (ICFHR), 2014 14th International Conference on, pages 9–14. IEEE.
Zagoris, K., Pratikakis, I., and Gatos, B. (2015). A framework for efficient transcription
of historical documents using keyword spotting. In Proceedings of the 3rd International
Workshop on Historical Document Imaging and Processing, pages 9–14. ACM.
Chapter 12
1. Introduction
Recently, electronic paper (e-paper) has attracted increasing attention as a reflective and
low-power display device and it has come into use in many fields such as e-readers, price
tags, digital signage, light modulation and even fashions. Although electrophoretic dis-
play(Ota et al., 1973; Jacobson et al., 1998) is the most famous display known as e-paper,
this is a generic term referring to thin and lightweight reflective displays and there are sev-
eral types of displays classified as e-paper such as electrochromic display (Schoot et al.,
1973; Kobayashi et al., 2008), electrowetting display(Beni and Hackwood, 1981), MEMS
display(Miles et al., 2003; Taii et al., 2006), liquid powder display(Hattori et al., 2004;
Sakurai et al., 2007) and cholesteric liquid crystal display (Berreman and Heffner, 1980;
Tamaoki, 2001). Twisting-ball display(Howard et al., 1998; Sheridon et al., 1999) is one of
them. Twisting-ball display was initially invented in the 1970s by Dr. Sheridon. The struc-
ture of a twisting-ball display is shown in Figure 1. “Janus ” particles which have black and
white hemisphere are sandwiched between two electrodes. Each hemisphere of the particles
is differently charged and the direction of the particles can be controlled by applied voltage.
Displaying grayscale images on twisting-ball display is difficult, but a quasi-grayscale im-
age can be displayed by “dithering” which is a common method to obtain quasi-grayscale
images on a monochrome display. Janus particles are suspended by silicone-oil-filled cav-
ities in silicone rubber sheet so that they do not move around and aggregate. The size of
Janus particles influences various performance of the display such as resolution, contrast
and driving voltage. With small particles, the resolution is improved and driving voltage is
reduced, but contrast is spoiled because a small amount of light penetrate front-side hemi-
sphere and color of back-side hemisphere is seen. Considering their balances, typical size
∗
E-mail address: [email protected] (Corresponding author).
322 Yusuke Komazaki and Toru Torii
of Janus particles is around 100 µm with resulting low resolution (254 ppi at a theoretical
maximum). However, large-sized twisting-ball display is easily fabricated due to its simple
structure and fabricating process. Therefore, large-sized sign board or digital signage will
be a suitable application of this display. Commercial products are very little at present, but
an outdoor timer is on sale from molten, Japan (Figure 2).
Although writing and drawing are an important function of paper, studies on writing
function of e-“paper” are limited. There are some e-reader products with writing functions,
but these functions are achieved by the same technology as smart phones using touch sen-
sors. For these devices, latency is inevitable while writing or drawing because refresh rate
of e-paper is generally low and processing time is needed. To eliminate latency and real-
ize smooth writing, direct driving of pixels by writing stimuli such as writing pressure and
magnetic field by the magnetic pen is effective (Figure 3). For this direct drive configura-
tion, product cost can be reduced because touch sensors and processors are not required.
For example, “Boogie Board” (Kent Displays, US) achieved very smooth writing by using
pressure-sensitive cholesteric liquid crystal display without touch sensors(Lee et al., 2007;
Montbach, 2008). Figure 4 shows a comparison of handwriting on an e-reader (PRS-350,
SONY) and Boogie Board. The latency of 0.2-0.5 sec was observed on the e-reader while
Boogie Board had no latency. In a similar concept, we had developed a handwriting-enabled
twisting-ball display by just adding magnetic nanoparticles into one hemisphere of Janus
particles(Komazaki and Torii, 2013, 2014; Komazaki et al., 2015). The structure of this
display is shown in Figure 5. The color of the display (black/white) can be controlled by
applied voltage like conventional twisting-ball display and handwriting in the absence of
voltage by using a magnetic pen is also possible. Because magnetic nanoparticles have su-
perparamagnetism, with which they have no remanent magnetization, there is no magnetic
interaction between each particle when the magnetic pen is not applied so that particles
maintain their direction(Ghosh et al., 2008). Therefore, this display is bistable not only
for electric control, but also magnetic handwriting. We describe the fabrication method,
performance and application of this display below.
Figure 2. Outdoor timer utilizing twisting-ball display (blue/white) which achieves high
contrast under strong sunlight.
Figure 3. Difference between handwriting system of touch sensor method and direct drive
method.
cavities for particle rotation were formed. The sheet was sandwiched by two ITO coated
glass plates and twisting-ball display was fabricated.
Figure 6. Setup for microfluidic Janus particle synthesis(Komazaki and Torii, 2013).
326 Yusuke Komazaki and Toru Torii
Figure 9. Color control of the display by applying ±80 V(Komazaki et al., 2015).
ticles was effective. For Janus particles containing 4 wt% magnetic nanoparticles in black
hemisphere, handwriting even while 80 V was applied was possible (Figure 12). Therefore,
328 Yusuke Komazaki and Toru Torii
Figure 10. Handwriting with a small magnet without voltage. (a) with a φ6 x 3 mm
neodymium magnet. (b) with a φ3 x 1.5 mmneodymium magnet(Komazaki et al., 2015).
Figure 11. Impossibility of handwriting right after voltage is applied(Komazaki and Torii,
2014). Handwriting was started when the voltage was turned off. The upper region of the
display remained white.
Figure 12. Handwriting while 80 V is applied to the display(Komazaki et al., 2015). For
Janus particles containing 4 wt% magnetic nanoparticles in black hemisphere, handwriting
even while 80 V was applied was possible
Figure 13. Connection of a resistor to discharge the charges on the electrodes immediately
after voltage is removed.
optical scanner can be placed on the back side of the display (Figure 14-b). In this case,
only offline handwriting recognition is possible. To achieve online recognition, optical sen-
sor matrix is required (Figure 14-c). In addition, if the detection of slight current on the
electrodes induced by rotation of polarized Janus particles is possible, handwritten strokes
can be acquired even without optical sensors. By these methods, large handwriting input
devices will be produced at low cost.
Conclusion
In this chapter, we discussed background, structure, fabrication method, performance and
applications of handwriting-enabled twisting-ball display. The display achieved smooth
handwriting in addition to conventional electronic display function by quite simple method.
Although contrast is not so good at this moment, it will be improved by inhibiting mixing of
black and white monomer and increasing the concentration of pigments. As for applications
of this display, small mobile devices will not be suitable due to low resolution. However,
330 Yusuke Komazaki and Toru Torii
Figure 14. Written character sensing from back side image. (a) Images on front and back
side of the handwriting-enabled twisting-ball display. (b) Offline handwriting recognition
with optical scanner. (c) Online handwriting recognition with optical sensor matrix.
Handwriting-Enabled E-Paper Based on Twisting-Ball Display 331
References
Beni, G. and Hackwood, S. (1981). Electro-wetting displays. Applied Physics Letters,
38(4):207–209.
Ghosh, A., Sheridon, N. K., and Fischer, P. (2008). Voltage-controllable magnetic compos-
ite based on multifunctional polyethylene microparticles. Small, 4(11):1956–1958.
Hattori, R., Yamada, S., Masuda, Y., Nihei, N., and Sakurai, R. (2004). A quick-response
liquid-powder display (QR-LPD
) R with plastic substrate. Journal of the Society for
Information Display, 12(4):405.
Howard, M. E., Richley, E. A., Sprague, R., and Sheridon, N. K. (1998). Gyricon electric
paper. Journal of the Society for Information Display, 6(4):215.
Jacobson, J., Comiskey, B., Albert, J. D., and Yoshizawa, H. (1998). Nature,
394(6690):253–255.
Kobayashi, N., Miura, S., Nishimura, M., and Urano, H. (2008). Organic electrochromism
for a new color electronic paper. Solar Energy Materials and Solar Cells, 92(2):136–139.
Komazaki, Y., Hirama, H., and Torii, T. (2015). Electrically and magnetically dual-driven
janus particles for handwriting-enabled electronic paper. Journal of Applied Physics,
117(15):154506.
Komazaki, Y. and Torii, T. (2013). Writable electronic paper based on twisting ball type
electronic paper. In The 20th International Display Workshops, pages 1340–1343.
Komazaki, Y. and Torii, T. (2014). Power-saving bar indicator for applied voltage utilizing
twisting ball technology. In The 20th International Display Workshops, pages 1192–
1193.
Lee, D. W., Shiu, J. W., Sha, Y. A., and Chang, Y. P. (2007). 6.3: Writable cholesteric
liquid crystal display and the algorithm used to detect its image. SID Symposium Digest
of Technical Papers, 38(1):61–64.
Miles, M., Larson, E., Chui, C., Kothari, M., Gally, B., and Batey, J. (2003). Digi-
tal paperTM for reflective displays. Journal of the Society for Information Display,
11(1):209.
332 Yusuke Komazaki and Toru Torii
Montbach, E. (2008). Flexible electronic flat-panel displays find novel uses. SPIE News-
room.
Nisisako, T., Torii, T., Takahashi, T., and Takizawa, Y. (2006). Synthesis of monodisperse
bicolored janus particles with electrical anisotropy using a microfluidic co-flow system.
Advanced Materials, 18(9):1152–1156.
Ota, I., Ohnishi, J., and Yoshiyama, M. (1973). Electrophoretic image display (epid) panel.
Proceedings of the IEEE, 61(7):832–836.
Sakurai, R., Ohno, S., ichi Kita, S., Masuda, Y., and Hattori, R. (2007). Color displays
and flexible displays using quick-response liquid-powder technology for electronic paper.
Journal of the Society for Information Display, 15(2):127.
Schoot, C. J., Ponjee, J. J., van Dam, H. T., van Doorn, R. A., and Bolwijn, P. T. (1973).
New electrochromic memory display. Applied Physics Letters, 23(2):64–65.
Sheridon, N. K., Richley, E. A., Mikkelsen, J. C., Tsuda, D., Crowley, J. M., Oraha, K. A.,
Howard, M. E., Rodkin, M. A., Swidler, R., and Sprague, R. (1999). The gyricon rotating
ball display. Journal of the Society for Information Display, 7(2):141.
Taii, Y., Higo, A., Fujita, H., and Toshiyoshi, H. (2006). A transparent sheet display by
plastic MEMS. Journal of the Society for Information Display, 14(8):735.
Tamaoki, N. (2001). Cholesteric liquid crystals for color information technology. Advanced
Materials, 13(15):1135–1147.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 13
1. Introduction
The domain of writing skills still constitutes an important goal which should be achieved
by school children (McCarney et al., 2013), and therefore deserves greater attention from
educators and health professionals (Feder and Majnemer, 2007), because it plays an im-
portant role in school performance, with relevant implications for the motor and cognitive
development (Accardo et al., 2013).
When it comes to academic performance, it can be observed that the academic skills
of the students from elementary to high school, are measured by tests involving proficient
writing, either by bimonthly tests or even by the Secondary Education National Examina-
tion (Enem). However, a poor handwriting may mask the academic ability of a student,
because teachers tend to give higher grades to legible handwriting tasks than to works ileg-
ible handwriting tasks (McCarney et al., 2013).
This negative perception tends to be come highlighted as schooling increases especially
when handwriting is confusing and difficult to read (Piek et al., 2008). When the child is
competent in other areas, the difficult issues concerning writing can be attributed to lazi-
ness or lack of motivation, rather than a specific learning difficulty that affects language
production in its written form (Berninger et al., 2009).
According to international studies, handwriting ability is responsible for alteration in
quality of texts of students, from the first year of primary school (Graham et al., 1997) to un-
dergraduate students (Peverly, 2006). The researches (Medwell and Wray, 2008) reported
∗
E-mail address: [email protected].
†
E-mail address: [email protected].
334 Monique Herrera Cardoso and Simone Aparecida Capellin
in their study that cognitive processes involved in writing compete with other processes,
such as planning and generating ideas. Thus, when the child has not reached a writing au-
tomation level, he/she starts to use working memory features to focus attention for forming
letters and words, generating a negative effect upon his/her textual composition quality.
This finding was replicated in a study comprising adult writers (Tucha et al., 2008),
whose fluency was proven impaired when they were required to pay close attention to their
writing. While the child still has to consciously focus on the writing mechanism, the amount
of working memory available to concentrate on the content in which they are writing is
reduced (McCarney et al., 2013; Prunty et al., 2013).
However, in Brazil, there are two problems: (1) the calligraphy teaching methodology
is not standardized, i.e., each school emphasizes and defines how much the student should
practice. (2) Specific assessment instruments, based on appropriate criteria for school chil-
dren, which measure the performance speed and observation of writing legibility aspects,
are not available.
The Brazilian researches (Cardoso et al., 2014) made the translation and adaptation of
the Detailed Assessment of Speed of Handwriting - DASH (Barnett et al., 2007) for the
Brazilian population. DASH is a standardized procedure with a representative stratified
sample of the UK and identifies children from 09 to 16 years, with handwriting difficulties.
It comprises five tasks: four writing tasks (Task 1 - Copy Best; Task 2 - Alphabet
Writing, Task 3 - Copy Fast, Task 5 - Free Writing), and one of them is a perceptual-motor
skill measure, i.e., it does not involve language linguistic aspects (Task 4 - Graph Speed).
It is clear that the thematic Free Writing Task is more similar to an exam / test envi-
ronment, for the student Therefore, acknowledging the expected from profficient writers,
as legibility and writing speed, it becomes possible to help professionals working in educa-
tional area to identify students presenting difficulties in writing skills development and to
quantify the underformance as for the chronological age and school grade.
Given the above, this chapter presents the performance of the Brazilian students during
the Thematic Free Writing, according to DASH Brazilian version.
2. Method
This study was approved to the Ethics Committee of the Faculty of Philosophy and Sci-
ences, São Paulo State University “Júlio de Mesquita Filho”, UNESP - Marı́lia- São Paulo,
Brazil, under protocol number 0444/2012.
This study comprised 05 public schools, located in the city of Marı́lia (SP) and Re-
gion: one state school and one municipal school, located in a rural area; one state and
one municipal school, located in the south area; and one municipal school located in the
north area. These schools take part of trainings conducted by the research group of the
Research Laboratory of Learning Deviations - LIDA - Department of Speech Pathology,
UNESP Marı́lia (SP), and therefore, the students are often accompanied by the entire staff
(pedagogues, speech language therapists, occupational therapists, neuropsychologists, and
child neurology) thus, they are diagnosed and referred for intervention, when necessary.
The age of the students selected to participate in this study ranged from 09 to 16 years
and 11 months, and the Informed Consent was signed by parents or guardians, and they
Speed and Legibility 335
should not present constant notes of sensory, motor or cognitive impairments and / or hear-
ing visual or motor complaints in school records. Failure to comply with at least one of the
above criteria, automatically excluded the student.
As a selection criterion, a cognitive assessment test was employed (Raven Progressive
Matrices), for excluding cases of mental retardation, and also a survey was conducted along
with the pedagogical coordinators of the students who had school complaints, psych affec-
tive problems or other diagnoses (e.g., autism, attention deficit and hyperactivity disorder,
dyslexia and others), and were excluded from the sample data for this study.
According to the criteria, 658 students took part in this study, both genders, from 09 to
16 years and 11 months old, and were divided into age groups (Table 1).
Males Females
Groups Ages N
N (%) N (%)
GI 09 to 09 years and 11 months 102 58 57 44 43
GII 10 to 10 years and 11 months 99 44 44 55 56
GIII 11 to 11 years and 11 months 86 41 47 45 53
GIV 12 to 12 years and 11 months 90 43 48 47 52
GV 13 to 13 years and 11 months 87 39 45 48 55
GVI 14 to 14 years and 11 months 68 28 41 40 59
GVII 15 to 15 years and 11 months 61 27 44 34 56
GVIII 16 to 16 years and 11 months 65 34 52 31 48
Total 658 314 47,65 344 52,35
Data collection was held in groups comprising 15 to 20 students in one unique session,
not exceeding 50 minutes. The students should use lined sheets, their own pencils and were
allowed to use erasers.
To perform Task 5 ( Thematic Free Writing), the student was requested to write an
essay on “My Life” for 10 minutes; however, every two minutes the student should make a
mark in the text, which allowed us to monitor the production rate of the student, in different
periods of time.
For educational and clinical purposes, these data can be very informative. For example,
they allow us to distinguish from the child which is always slow throughout the 10-minute
period; and the child that writes a lot for a minute, and then simply runs out of ideas.
There is no doubt that a task like this is very similar to an examination environment, for the
student.
To calculate the student writing speed, taking into account the writing legibility, it was
initially necessary to identify the legible words (step 1). After identification, it was possible
to calculate the writing speed (step 2).
Step 1 - Definition of the readable words
In order to obtain reliable and trustworthy results, the writing samples of the students
were judged as for legibility, by four professionals who work with calligraphy and analysis
of school spelling: a speech therapist, two occupational therapists and a pedagogue, i.e.
336 Monique Herrera Cardoso and Simone Aparecida Capellin
professionals who have the same concept concerning calligraphy, but work in different
areas. As precaution measure four judges were selected, in case any of them could not
perform the analysis within the time proposed by the researcher.
A meeting was held with the judges and the researcher, and it was handed a script
guidance, explaining how the samples should be analyzed; a 2-hour training and, finally,
clarifications and doubts were solved. At the end of the meeting, the researcher provided
her contacts (e-mail and phone) if, during the analysis of the samples, the judges had any
questions and / or difficulty.
The writing samples were scanned, and the names of the students were ommitted and
delivered to each judge in a pendrive containing 08 files, Microsoft Office PowerPoint Pro-
gram, version 97-2003 or updated. Each file was composed by samples concerning a spe-
cific age, however, only the researcher knew the age which corresponded to a particular file.
This measure was to ensure that the judges did not know the age of the student which corre-
sponded to that writing sample, and consequently did not make comparisons and judgments
related to the age of the student.
Judges should read every word written by the student, only once. If they did not under-
stand, they should not insist on re-reading, or even “try” to understand, making use of the
phrase context. Each word read by the judge should be classified as:
(A1) Legible: word which the judge easily decoded, regardless the context of the sen-
tence. Sometimes, there were poorly formed letters in a word, which, if considered out of
the word, would not be legible. However, if the entire word could be still legible, it should
be classified as such. When the judge classified the word as legible, no mark should be
performed.
(A2) Partially legible: word which the judge was able to read, however, had difficulty to
decode it. When the judge described the word as partially legible, he made a blue rectangle
around the word.
(A3) Totally illegible: word which the judge failed to read, due to difficulty in decoding
it. When the judge described the word as completely illegible, he made a red rectangle
around the word.
Cronbach’s alpha statistic test was employed (or, simply best known as Cronbach test)
to check the level of reliability in terms of so-called ’internal consistency’ of the observed
values. This, in turn, presented a variation between 0.700 to 1.000, for all variables analyzed
by the judges, which allowed us to consider a sample with ’high’ reliable level, which
provides an unbiased sample to this task.
After this stage, it can be inferred that the concept of legibility is not divergent among
the professionals, and therefore, the analysis of the pedagogue judge was randomly selected
to be employed for speed calculation (step 2).
Step 2 - Calculation of the writing speed
The writing speed calculation was performed according to the performance presented by
each study group, in the thematic writing task. It was calculated by measuring the readable
written words per minute (wpm), that is, the number of words written by each group, except
for the number of words considered unreadable.
For the Thematic Writing Task, the speed was calculated every two minutes, and also
during the activity as a whole (10 minutes).
From the analyses of the samples, data were tabulated in Microsoft Office Excell spread-
sheet, version 2010, then, descriptive analysis and statistical data were conducted.
Speed and Legibility 337
Taking into account only the legible words, the writing speed was calculated for each
studied group (GI, GII, GIII, GIV, GV, GVI, GVII, GVIII), every two minutes of the task,
and throughout the 10 minutes (Table 3).
With these results it can be observed that students tend to write faster (words per minute)
during the first two minutes of the task, except the students of 14, 15 and 16 (GVI, GVII and
VIII), who presented higher speed from the second to fourth minute of the task. These find-
ings show that students easily idealize the beginning of their textual composition, releasing
most of the attention resources and working memory to form letters and words (Graham
et al., 2006). However, with the task exposure time, the student must focus on the content
that he/she is writing, thus decreasing the speed for spelling the sequence of letters, words,
and finally, the construction of phrases.
It becomes possible to observe that writing speed decreases as task exposure time in-
creases, except for 12 years-old-students (GIV), who increase their speed in the variable
“4 to 6 minutes”. This writing speed reduction can be justified by tiredness / fatigue of
the students during the tasks development. This finding supports the study (O’Mahony
et al., 2008), reporting that writing speed is impacted by several cognitive, motivational and
physiological factors, such as muscle fatigue, due to a prolonged writing period.
338 Monique Herrera Cardoso and Simone Aparecida Capellin
Table 3. Writing speed, according to age, for thematic Free Writing Task
Then, the Jonckheere-Terpstra test was applied, in order to verify possible differences
among the eight groups, when compared simultaneously, and there were statistically sig-
nificant differences in variables “up to 2 minutes”, “from 2 to 4 minutes ”, “from 4 to 6
minutes”, and “during the 10 minutes”. That is, there was no difference in writing speed,
from the 6th minute of the task, when groups were compared.
This finding questions if the task should really take 10 minutes, considering that fa-
tigue may have been a detrimental factor for the students’performance of the eight groups.
Another factor for justifying this finding would be related to the pauses of students, while
writing, that is, a time variable (Benbow, 1995), as it should be necessary to investigate, if
during the 04 final minutes of the task, the students present more breaks, either for fatigue
reasons, or due to difficulty in continuing the suggested theme for writing.
According to the literature (Olive, 2010; Sumner et al., 2013), writing was considered as
a continuous movement, interrupted by ’breaks’, i.e., temporary interruptions in the written
trace. These breaks are “normal”, whereas they are simply imposed by the text to be written,
as in the case of spaces between words and between letters (Paz-Villagrán et al., 2014).
However, studies have shown that dyslexic students take longer to write the same text, not
because they have a slower movement, but due to the fact that they make more breaks, or
longer interruptions when compared with proficient writers (Rosenblum et al., 2003a,b).
Speed and Legibility 339
In addition to investigation concerning breaks, one should also think about the er-
gonomic factors such as posture during writing and / or pencil gripping, as they may change
along this task (Rosenblum et al., 2006; Tomchek and Schneck, 2006), and because they
are factors which can lead to difficulties for proper writing performance, hence causing re-
duction in legibility, pain and fatigue in the upper limbs (de Almeida et al., 2013; Sassoon,
2004), leading to decreased writing speed.
When analyzing Figure 1, it can be observed from 9 to 13 years, increased writing
speed; however, from 14 to 16 years the speed has declined, with little variation among
them.
Figure 1. Box plot for writing speed score in task 05, per group. The box represents 50%
of the results (and the bottom line, percentile 25, and the top line, percentile 75), the line
inside the Box represents the mean found, and the external lines of the box represent the
maximum and minimum values found
This increase in speed due to the increasing age can be justified, because in early learn-
ing, movements are slow and guided by visual and kinesthetic feedback (Chartrel and Vin-
ter, 2006). That is, during handwriting, information, such as the pressure on pencil and
paper, the positioning of the fingers and hand, the direction of pencil movements and mis-
takes, are stored in memory, to be recalled when writing is repeated (Almeida, 2013). With
practice, writing becomes automatic and control of writing coordinated movements are im-
proved with age and education, favoring increased writing speed (Sovik, 1993).
However, between 14 and 16 years (ninth grade of basic education to second year of
high school- sophomore), speed tends to decrease, because the concern goes beyond callig-
raphy, because the requirements are related to planning and written text revision (Sampaio
340 Monique Herrera Cardoso and Simone Aparecida Capellin
and Cardoso, 2015; Kim et al., 2015), which are the most critical for writing performance
(Bourdin and Fayol, 2002), therefore, ensuring a better written text - cohesive, coherent
and respecting the standard language and communication (Alves et al., 2008; Pontart et al.,
2013).
Conclusion
This chapter describes what is expected, in terms of writing proficiency, for each age group
in the thematic Free Writing Task of DASH, being similar to a test at environment / exami-
nations, for students, requiring legible writing, with sequential ideas within a certain period
of time. The main results showed that:
• Age is not a determining factor for the calligraphic quality, but it influences the writ-
ing speed;
• The students tend to write faster during the first two minutes of the task;
• Rate decreases as task exposure time increases
• From the 6th minute of the task, the performance of the students is similar, that is,
from that time, writing speed does not differ when groups are compared.
• From 9 to 13 years, writing speed increases, however, from 14 to 16 years the rate
declines.
From these findings, it can be emphasized here the importance that the education pro-
fessional exerts, not only for handwriting development, but also on the knowledge expected
in terms of legibility and writing speed, because, from this, the professional will be able
to monitor and identify the difficulties presented by the students and observe underperfor-
mance , in accordance to chronological age.
Among the limitations of this study, it is necessary to carry out new studies that aim to
compare the performance of students in relation to sexual gender and, also, the validation
of the instrument, as these investigations will provide greater credibility and confidence for
the findings presented here.
Acknowledgment
We thank the São Paulo Research Foundation - FAPESP, and Coordination for the Improve-
ment of Higher Education Personnel (CAPES) for financial support.
References
Accardo, A. P., Genna, M., and Borean, M. (2013). Development, maturation and learning
influence on handwriting kinematics. Human movement science, 32(1):136–146.
Almeida, I. C. (2013). Avaliao do processo de escrita manual em crianas com pobre qual-
idade da caligrafia e boa qualidade da caligrafia. Master Diss., Escola superior da sade
do Alcoito, Santa Casa da Misericrdia de Lisboa.
Speed and Legibility 341
Alves, R. A., Castro, S. L., and Olive, T. (2008). Execution and pauses in writing narratives:
Processing time, cognitive effort and typing skill. International journal of psychology,
43(6):969–979.
Barnett, A., Henderson, S., Scheib, B., and Schulz, J. (2007). The Detailed Assessment of
Speed of Handwriting (DASH). Manual. Pearson Education.
Benbow, M. (1995). Principles and practices of teaching handwriting. Hand function in the
child, pages 255–281.
Berninger, V. W., Abbott, R. D., Augsburger, A., and Garcia, N. (2009). Comparison of
pen and keyboard transcription modes in children with and without learning disabilities.
Learning Disability Quarterly, 32(3):123–141.
Bourdin, B. and Fayol, M. (2002). Even in adults, written production is still more costly
than oral production. International Journal of Psychology, 37(4):219–227.
Cardoso, M. H., Henderson, S., and Capellini, S. A. (2014). Translation and cultural adap-
tation of brazilian detailed assessment of speed of handwriting: conceptual and semantic
equivalence. Audiology-Communication Research, 19(4):321–326.
Chartrel, E. and Vinter, A. (2006). Rôle des informations visuelles dans la production de
lettres cursives chez lenfant et ladulte. L’Année psychologique, 106(1):43–63.
de Almeida, P. H. T. Q., Sorensen, C. B. S., Magna, L. A., Cruz, D. M. C., and Ferrigno,
I. S. V. (2013). Avaliação da escrita através da fotogrametria–estudo da preensão trı́pode
dinâmica. Revista de Terapia Ocupacional da Universidade de São Paulo, 24(1):38–47.
Graham, S., Berninger, V. W., Abbott, R. D., Abbott, S. P., and Whitaker, D. (1997). Role
of mechanics in composing of elementary school students: A new methodological ap-
proach. Journal of educational psychology, 89(1):170.
Graham, S., Struck, M., Santoro, J., and Berninger, V. W. (2006). Dimensions of good and
poor handwriting legibility in first and second graders: Motor programs, visual–spatial
arrangement, and letter formation parameter setting. Developmental neuropsychology,
29(1):43–60.
Kim, Y.-S., Al Otaiba, S., and Wanzek, J. (2015). Kindergarten predictors of third grade
writing. Learning and individual differences, 37:27–37.
McCarney, D., Peters, L., Jackson, S., Thomas, M., and Kirby, A. (2013). Does poor
handwriting conceal literacy potential in primary school children? International Journal
of Disability, Development and Education, 60(2):105–118.
Medwell, J. and Wray, D. (2008). Handwriting–a forgotten language skill? Language and
Education, 22(1):34–47.
342 Monique Herrera Cardoso and Simone Aparecida Capellin
Olive, T. (2010). Methods, techniques, and tools for the on-line study of the writing process.
Writing: processes, tools and techniques, pages 1–18.
O’Mahony, P., Dempsey, M., and Killeen, H. (2008). Handwriting speed: duration of test-
ing period and relation to socio-economic disadvantage and handedness. Occupational
Therapy International, 15(3):165–177.
Overvelde, A. and Hulstijn, W. (2011). Handwriting development in grade 2 and grade 3
primary school children with normal, at risk, or dysgraphic characteristics. Research in
developmental disabilities, 32(2):540–548.
Paz-Villagrán, V., Danna, J., and Velay, J.-L. (2014). Lifts and stops in proficient and
dysgraphic handwriting. Human movement science, 33:381–394.
Peverly, S. T. (2006). The importance of handwriting speed in adult writing. Developmental
Neuropsychology, 29(1):197–216.
Piek, J. P., Dawson, L., Smith, L. M., and Gasson, N. (2008). The role of early fine and
gross motor development on later motor and cognitive ability. Human movement science,
27(5):668–681.
Pontart, V., Bidet-Ildei, C., Lambert, E., Morisset, P., Flouret, L., and Alamargot, D. (2013).
Influence of handwriting skills during spelling in primary and lower secondary grades.
Frontiers in psychology, 4:818.
Prunty, M. M., Barnett, A. L., Wilmut, K., and Plumb, M. S. (2013). Handwriting speed in
children with developmental coordination disorder: Are they really slower? Research in
developmental disabilities, 34(9):2927–2936.
Rosenblum, S., Goldstand, S., and Parush, S. (2006). Relationships among biomechanical
ergonomic factors, handwriting product quality, handwriting efficiency, and computer-
ized handwriting process measures in children with and without handwriting difficulties.
American Journal of Occupational Therapy, 60(1):28–39.
Rosenblum, S., Parush, S., and Weiss, P. L. (2003a). Computerized temporal handwriting
characteristics of proficient and non-proficient handwriters. American Journal of Occu-
pational Therapy, 57(2):129–138.
Rosenblum, S., Weiss, P. L., and Parush, S. (2003b). Product and process evaluation of
handwriting difficulties. Educational Psychology Review, 15(1):41–81.
Sampaio, M. N. ; Cardoso, M. H. (2015). A produo textual e a interferłncia da legibilidade e
velocidade. In: Olga Valria Campana dos Anjos Andrade; Paola Matiko Martins Okuda;
Simone Aparecida Capellini. (Org.). Tpicos em Transtornos de Aprendizagem - Parte IV.
1ed.Marlia: Fundepe, IV(1):73–88.
Sassoon, R. (2004). The art and science of handwriting. Intellect Books.
Sovik, N. (1993). Development of children’s writing performance: Some educational im-
plications. Motor development in early and later childhood: Longitudinal approaches,
pages 229–246.
Speed and Legibility 343
Sumner, E., Connelly, V., and Barnett, A. L. (2013). Children with dyslexia are slow writers
because they pause more often and not because they are slow at handwriting execution.
Reading and Writing, 26(6):991–1008.
Tucha, O., Tucha, L., and Lange, K. W. (2008). Graphonomics, automaticity and handwrit-
ing assessment. Literacy, 42(3):145–155.
In: Handwriting: Recognition, Development and Analysis ISBN: 978-1-53611-937-4
Editors: Byron L. D. Bezerra et al.
c 2017 Nova Science Publishers, Inc.
Chapter 14
1. Introduction
In the modern society, biometrics technology is used in several security applications for
personal authentication. The aim of such systems is to confirm the person identity based
on physiological or behavioral traits. In the first case, recognition is based on biological
characteristics such as fingerprint, palmprint, iris, face, etc. The latter relies on behavioral
traits such as voice pattern and handwritten signature (Jain et al., 2004).
Handwritten signature remains as one of the main approaches for identity authentica-
tion. One of the reasons for its widespread is because the acquisition is easy, non-invasive
and most individuals are familiar with its use in their daily life (Impedovo and Pirlo, 2008).
Due to its convenient nature, signatures can be employed as a sign of confirmation in a wide
set of important documents, especially on bank checks, credit card transactions, identifica-
tion documents and a variety of business certificates and contracts.
∗
E-mail address: [email protected].
†
E-mail address: [email protected].
‡
E-mail address: [email protected].
§
E-mail address: [email protected].
¶
E-mail address: [email protected].
k
E-mail address: [email protected].
∗∗
E-mail address: [email protected].
346 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
However, as a behavioral trait, signatures are susceptible to spoof attacks, which is the
attempt to spoof the signature of an enrolled user to fool the system (Jain et al., 2004). Two
types of impostors are considered, specifically: casual impostors (producing random forg-
eries) when no information about authentic writer signature is known, and real impostors
(producing skilled forgeries) when some information of the signature is used (Fierrez and
Ortega-Garcia, 2008).
If a signature on a document is false, this document is also considered invalid. Thus,
prevent frauds in the signature verification process has been a challenge for researchers
around the world. However, manual signature-based authentication of a large set of docu-
ments is difficult and also a time-consuming and labor-intensive task. Hence, several Au-
tomatic Handwritten Signature Verification Systems (AHSVS) have been proposed. These
systems aim to automatically decide if a query signature is in fact of a particular person or
not.
AHSVS are essentially a pattern recognition application that works by receiving a sig-
nature as input, extracting a feature set from the data and classifying the sample using a
template database as a reference. As any pattern recognition system, AHSVS are learning-
based, which requires a dataset that can be used to assess their performances to create
accurate signature verification methods. These datasets contain signatures digitized by ei-
ther using an optical scanner to obtain the signature directly from the paper or by using an
acquisition device such as digitizing tablets or electronic pens with digital ink. The two
approaches are identified as offline (static) and online (dynamic), respectively. In the online
modality, data is stored during the writing process and consists of a temporal sequence of
the two-dimensional coordinates (x, y) of consecutive points, whereas in the offline case,
only a static representation of the completed writing is available as an image. Moreover,
each representation has specific attributes not present in the other (Viard-Gaudin et al.,
1999). For instance, online data do not include information about the width of the strokes
and the texture of the ink on the paper, while the offline representation has lost all dynamic
information of the writing process. As a result, features such as pen trajectory, which can be
easily computed in the online domain, can only be inferred from a static image (Nel et al.,
2005).
In the last few years, several handwritten signature datasets have been created and some
made publicly available. The general corpus consists of a set of genuine and forgery signa-
tures for each writer and can be categorized on different dimensions including the modality
(online or offline), script and size.
Although many datasets containing samples for offline, online or both modalities com-
bined haven been proposed, those datasets normally do not convey some important real
world challenges, not assessing the robustness of the systems on real-world scenarios. Con-
sequently, said systems often fail to deliver the expected results when employed in practice
(Ahmed et al., 2013).
In practical scenarios, signatures are acquired on a wide set of conditions and in both
modalities. Different conditions for online acquisition includes signatures acquired in sev-
eral types of devices, e.g., using smartphones or different models of digitizing tablets.
Moreover, when dealing with offline signatures, most of the samples are present in doc-
uments with complex backgrounds and with different signing area constraints. Examples
of such documents include bank checks, contracts, identification documents, forms, etc.
Datasets for Handwritten Signature Verification 347
(Ahmed et al., 2013, 2012). Those distinct types of signatures often need to be integrated
into the same system in an interoperable manner.
In regards to signature verification interoperability, many research problems are open
to investigation, such as (i) development of complete document authentication systems in-
volving both signature segmentation and verification process taking into account different
signing area constraints (ii) analysis of the implications on AHSVS of the combination of
signatures acquired on smartphones and conventional digitizing tablets (iii) development
of systems capable of integrating both online and offline samples interchangeably, towards
a unified signature biometry. With the currently available datasets, investigation on the
direction of the listed research problems is limited to samples acquired in controlled envi-
ronments or can not be made at all.
Works have been done on topics directed towards signature verification interoperability.
Qiao et al. (2007) proposed an offline signature verification system that uses online hand-
writing signatures instead of images in the registration phase, however, in the experiments
the authors used synthetic offline images generated by the interpolation of online signature
samples. Uppalapati (2007) proposed a system to integrate both modalities of handwritten
signatures, not only providing a method to match offline signatures against an online and
vice-versa, but also using both static and dynamic features, when available, to improve the
system performance. Ahmed et al. (2012) proposed a method for signature extraction from
documents, it is noteworthy that the segmentation accuracy was evaluated only on the patch
level. According to the authors, it is due to the lack of publicly available datasets containing
the ground truth of signatures on the stroke level. Ahmed et al. (2013) discuss the currently
non-applicability of most signature verification systems and the lack of complete document
authentication systems involving signature segmentation and verification. According to the
authors, it is due to the absence of datasets suitable for the development of such systems,
containing both patch and stroke level ground truth. Pirlo et al. (2015) investigated the ef-
fects of signing area constraints on geometric features of online signatures. Diaz-Cabrera
et al. (2014) proposed several approaches to synthetically generate offline signatures sim-
ulating the pen ink deposition on the paper based on dynamic information from online
signatures. Zareen and Jabin (2016) presented a comprehensive survey of mobile-biometric
systems and proposed a method for online signature verification. The approach was evalu-
ated on the SVC (Yeung et al., 2004) dataset and a database acquired using a mobile device
(Martinez-Diaz et al., 2014).
Aiming to overcome the limitations of the current state of handwritten signature
datasets, we present the RPPDI-SigData, an evaluation dataset for AHSVS that includes
signatures captured for both online and offline modalities and from different signing condi-
tions. Samples for the online modality were acquired on smartphones and digitizing tablets
and for the offline domain acquired in documents with complex backgrounds (including a
stroke level ground-truth) and different signing area constraints.
Alongside with the description of the RPPDI-SigData, this chapter also summarizes
17 publicly available handwritten signature datasets. Our goal is to provide the reader an
overview of the existing evaluation datasets and its main characteristics such as the number
of samples, protocols, type of forgeries, script.
The rest of the chapter is structured as follows: Section 2 presents an overview of
the existing evaluation datasets for handwritten signature verification. Section 3 describes
348 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
2.1. GPDS960
The GPDS-960 signature corpus (Vargas et al., 2007) is a Spanish offline handwritten signa-
ture database containing genuine and forgery samples for 960 individuals. For each subject
there are 24 genuine signatures and 30 forgeries, producing 960 x 24 = 23049 genuine sig-
natures and 960 x 30 = 28800 forgeries. The genuine signatures were taken in one session
which the subjects were asked to sign his/her signature on a form with 24 signing boxes,
which half the boxes were 5x3.5cm and the other half 5.5x2.5cm. The forgeries were made
by a different set of 1920 individuals, each forger filled up a form containing 15 boxes,
specifically for 5 randomly selected genuine signatures that should be forged 3 times. Each
form was scanned with an HP4400 device using 256 level grayscale and 300 dpi resolution.
There is a black and white and gray scale version of the dataset, however, due to a problem
during a move, the authors lost the gray scale signatures of 79 users, remaining 881 sets of
writers in the gray scale version of this dataset.
2.2. MCYT
The MCYT bimodal biometric database consists of a collection of fingerprint and signature
samples for 330 contributors from 4 different Spanish sites. Specifically for the MCYT-
Signature subcorpus, the samples were acquired using a pen tablet, model INTUOS A6
USB. During the acquisition, users provided their signatures using a pen and a paper tem-
plate over the tablet enabling both online as the offline sample to be captured. The creators
made available two subsets of the MCYT-Signature, namely the MCYT-100 (Ortega-Garcia
et al., 2003) and MCYT-75 (Fierrez-Aguilar et al., 2004). The first contains 25 genuine and
25 forgeries online samples of 100 authors, and the latter 15 genuine and 15 forgeries offline
signatures of 75 writers.
2.3. UTSig
The UTSig (University of Tehran Persian Signature) dataset (Soleimani et al., 2016) is a
Persian offline handwritten signature dataset consisting of samples for 115 male subjects.
Each subject has 27 genuine and 45 forged specimens of his signature, we can see in Fig-
ure 1 samples of 4 writers. Genuine signatures were acquired using a form containing 10
signing areas of 6 different sizes. The acquisition was made in 3 days, on each day the
writers signed 9 genuine signatures and the last the subject signed with his opposite hand.
They collected 3105 genuine and 345 opposite-hand signatures. The forged samples of the
dataset are divided into 3 categories.
Datasets for Handwritten Signature Verification 349
The creators consider as the first category of forgeries the 345 opposite-hand signatures
made by the authentic writers. The second category contains forged samples obtained from
a different set of 230 subjects. They were asked to fill 3 forms each containing 6 signing
boxes for 3 different writers, the forgers were free to practice as much as they want and the
observable sample varied from one to three genuine signatures of the authentic writer. The
last category was made by a more skilled forger, the form was the same as the second cat-
egory and the observable sample was only one genuine signature. All forms were scanned
with 600 dpi resolution and stored as 8-bit grayscale TIF files.
Figure 1. Four genuine samples and their respective forgeries of the UTSig dataset.
2.4. MyIDea
MyIDea (Dumas et al., 2005) is a multimodal biometric dataset that includes traits such as
talking face, audio, fingerprints, signature, handwriting and hand geometry. In particular,
the handwritten signatures were acquired using an A4 Intuos2 graphic tablet from Wacom,
the tablet records 5 parameters: x-y coordinates, pressure, azimuth, and altitude. An Intuos
350 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
InkPen is used to write on a paper over the tablet, the main advantage of this procedure is
to record both online and offline data at the same time, ensuring a natural writing using an
almost standard pen and paper. The dataset contains data from 73 users from whom 46 users
are French and 27 English. Each subject is represented with 18 genuine signatures and 36
skilled forgeries. The forgeries were captured in two conditions. In the first condition, the
forgers were asked to produce 18 samples using only a paper-copy of the authentic signature
as a reference. For the second condition, the forgers were able to use a dedicated software
to study the signature dynamics on screen. In both conditions, they were allowed to train as
long as they want to gain confidence on the authentic signature format.
2.5. SIGMA
SIGMA (Ahmad et al., 2008) is a dataset containing signatures for 213 individuals. For
each subject, 30 genuine signatures were collected, and 10 forgeries were produced. Dur-
ing the acquisition of the forgeries, the subjects were given as much time and number of
genuine samples as they want. The creators also checked the resemblance of the forgeries
to the original signatures before including the sample in the database. The acquisition was
made using an A4 Intuos3 graphic tablet from Wacom, the online data stored was the x-y
coordinates along with the pen pressure at each coordinate. The offline sample was also
captured using an electronic ink pen on a paper form shown on Figure 2. Offline forms
were scanned as grayscale images using a 600 dpi resolution scanner.
2.6. SigComp2009
ICDAR 2009 Signature Verification Competition (SigComp2009) (van den Heuvel et al.,
2009) is a signature verification competition, which provided an online and offline dataset
containing training and evaluation sets. For training, NISDCC signature collection acquired
in WANDA project (Franke et al., 2003) was used. The training dataset consists of 12 users,
for each subject 5 genuine signatures were collected and 5 forged samples were made. The
forgeries samples were made by 31 forgers. The test set was collected in the Netherlands
Forensic Institute (NFI). It contains data for 100 writers and each represented by 12 genuine
and 6 forged samples made by the other participants.
2.7. 4NSigComp2010
The 4NSigComp2010 comprises two scenarios, the dataset for the first scenario was col-
lected by La Trobe University and was made available with the competition. whereas the
second scenario is a subset of the GPDS-960 database, previously reported on this chapter.
Signatures for the first scenario were collected using a ball-point pen on paper and scanned
at 600 dpi.
The training set of the first scenario consists of 9 reference signatures by one author and
200 questioned signatures including 76 genuine, 104 simulated (forged) signatures made
by 27 freehand forgers, and 20 disguised samples. Genuine and disguised samples were
made over a week. In addition, the writer signed another 81 genuine signatures to be used
as a reference for the forgery acquisition. The forgeries were made with the contribution of
27 volunteers, each subject had 3 out of 81 genuine samples, and imitated without tracing
in two ways: inspect the genuine signature and forge 3 times without practice and practice
simulating the genuine signature 15 times then forge the signature 3 times. The test set has
25 signatures of another person written during 5 days and 100 questioned samples including
3 genuine, 7 disguise, and 90 simulated (forged) signatures written by 34 freehand forgers
whom were either laypersons or calligraphers.
2.8. SigComp11
SigComp2011 (Liwicki et al., 2011) includes a dataset containing Chinese and Dutch sig-
natures samples. The collection contains both offline and online signature samples. The
acquisition of the signatures was made using a Wacom Intuos3 A3 Wide USB Pen Tablet,
the paper placed over the tablet had 12 signing boxes of size 59 x 23 mm. The paper was
scanned at a 400 dpi, RGB color and stored as eps images. Online data stored includes x-z
coordinates and pressure. Due to some problems during the signature acquisition, a number
of signatures in the online data sets are different from those in the offline set, an overview
of the number of samples on both datasets and both modes are provided in Table 1
2.9. 4NSigComp2012
In 4NSigComp2012 (Liwicki et al., 2012) the dataset introduced is similar to the database
of Scenario 1 of 4NSigComp2010. The data collected contains 3 sets of authentic authors,
A1, A2, and A3 respectively. Table 2 shows the number of specimens for each category of
signature samples.
352 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
Genuine and disguised samples were collected during 10 to 15 days. The number of
forgers varied from 2 to 31. Each forger was provided with 3 to 6 authentic samples.
Forgers used pen and pencil and forged with and without practice.
2.10. SigWiComp13
In the SigWiComp13 (Malik et al., 2013) an offline Dutch and an online and offline
Japanese signature datasets were introduced. The offline Japanese dataset was converted
from online signatures that contain 30 classes with 42 genuine samples per class, made in 4
days, and 36 forgeries made by 4 forgers. Dutch dataset has 27 authentic writers who made
10 signatures with arbitrary writing instruments during 5 days. For forgeries, 9 subjects
used any to all of the supplied specimen signatures as a reference. In average, there are 36
forgeries for each authentic writer.
2.11. SigWiComp2015
With the SigWiComp2015 (Malik et al., 2015) three datasets were made available, namely
Bengali, Italian and German datasets. The Italian dataset consists of offline binarized signa-
tures divided into two separated sets: training and testing set. The training set is composed
of 50 writers with 5 reference signatures each. The testing set contains samples for the same
50 authors with 10 questioned signatures, each being either a genuine or a skilled forgery
signature. During the acquisition of the forgeries, the subjects were allowed to practice as
much as they want.
The Bengali offline signatures were collected from 10 contributors, each providing 24
genuine signatures. The imitators made 30 forgeries for each authentic writer and during
the process they were allowed to practice. All images were captured at 300 dpi resolution
with 256 levels of gray.
Datasets for Handwritten Signature Verification 353
The German online signatures were collected using the Anoto digitizing pen (Malik
et al., 2012) rather than a tablet. Data stored in this dataset includes x-y coordinates and
pressure. For the training set, 30 genuine authors provided 10 genuine signatures, for the
evaluation, data for the same 30 authors containing 15 questioned specimens including
genuine and skilled forgeries samples.
2.12. SUSIG
The SUSIG (Sabanci University Signature) (Kholmatov and Yanikoglu, 2009) is divided
into two subcorpora: visual and blind. Signatures in the Visual Subcorpus were acquired
using an Interlink Electronics’s ePad-ink tablet which has a pressure-sensitive LCD screen
such that subjects could see their signatures while signing, whereas the Blind subcorpus
was collected using Wacom’s Graphire2 tablet and pen which provides no visual feedback.
Data stored for both subcorpora contains the x-y coordinates, time stamp, and pressure level
for each point.
The Visual Subcorpus consists of signatures of 100 individuals acquired over two ses-
sions that were approximately one week apart. Each subject provided 10 samples of his/her
signature on each session, for a total of 20 genuine samples. Each individual was asked to
forge a randomly selected signature. Two types of forgeries were considered: skilled and
very skilled, 5 for each type. Regarding the first type, the forger could watch the signature’s
animation on a monitor, whereas in the second type the animation of the genuine reference
was also mapped on the LCD screen of the tablet in such a way that the forger could trace
over.
The Blind Subcorpus contains signatures of a different set of 100 individuals, a group
of 30 subjects provided 8 genuine signatures while the rest of the 70 supplied 10 genuine
signatures each. Each user provided 10 forgeries for a randomly selected writer and the
forgery acquisition process was the same as the skilled forgery of the Visual Subcorpus.
2.13. SVC2004
SVC2004 (Yeung et al., 2004) was the first international signature verification competition.
The competition was divided into two separate tasks using two different signature databases
of online handwritten signatures. The difference between the two tasks is the information
collected for the dataset. While in the first task the data collected contains only coordinate
information, data for the second task also contains additional information such as pressure
and pen orientation. The aim of different tasks is to simulate acquisition on devices such
as personal digital assistants (PDA) in the case of the first task and digitizing tablets on the
second task. Each dataset contains a set of signatures for 100 subjects, each set is composed
of 20 genuine signatures and 20 skilled forgeries made by at least four other contributors.
2.14. CEDAR
The CEDAR (Kalera et al., 2004) dataset is an offline handwritten signature dataset with 55
signers and a total of 2640 signature samples. For each signer, 24 genuine signatures were
collected and 20 arbitrary signers were chosen to skillfully forge the genuine specimens,
producing 24 skilled forgeries for each subject of the database. The signing area of the
354 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
form used to collect the signatures is a predefined space of 2 x 2 inches. The forms were
scanned at 300 dpi in 256 levels of gray and stored in eps format.
2.16. Tobacco800
Tobacco800 (tob, 2007) is a dataset composed of 1290 complex document images which
contain information about signatures on printed text documents. Resolutions of documents
in Tobacco800 vary significantly from 150 to 300 DPI, and the dimensions of images range
from 1200 by 1600 to 2500 by 3200 pixels. The dataset also includes a patch level ground
truth for the signatures. Figure 3 shows a document containing a signature extracted from
the dataset.
3. RPPDI-SigData Dataset
In this section, we describe our proposed dataset, RPPDI-SigData. The main contribution
of this new dataset is the effort to complement existing handwritten signature verification
datasets towards closing the gap between the AHSVS research and real-world applications.
This evaluation dataset provides signature samples acquired on both signature modalities
(online and offline).
During the acquisition of the online samples, two devices were used, namely, a Wacom
STU-530 digitizing tablet and a Samsung Galaxy Note, in both cases using the specific
stylus for each device. Signatures of the offline modality were collected on 4 different
documents and a form containing 4 different signing area sizes. Alongside the signed doc-
uments it is also included a stroke level ground truth.
The purpose of providing signature samples using different signature acquisition de-
vices and with different document formats is to support the performance assessment for
signature verification interoperability. For instance, researchers can, but are not restricted
to, use this dataset to evaluate (i) complete document authentication systems that employ
Datasets for Handwritten Signature Verification 355
both signature segmentation and verification methods (ii) the integration of signature ver-
ification in different acquisition devices using signatures acquired on a smartphone and a
conventional digitizing tablet (iii) systems capable of integrating both online and offline
samples interchangeable. Those open research problems are currently hardly addressed on
the current state of publicly available datasets, please see Section 2
Regarding the composition of the dataset, it consists of signature samples for 15 con-
tributors, for each contributor an Offline set and Online set were collected. The offline set
includes 1 signature on the ID, voter ID, driver’s license and check. Also, a filled signa-
ture frame of 8 signatures, hence 12 offline samples. The online set includes 6 smartphone
signatures and 4 graphical tablet signatures, thus 10 online signatures. The forgeries were
made in 2 checks, one filled signature frame (8 signatures), 6 on smartphones and 4 on a
digitizing tablet, hence 10 offline and 10 online forgeries. Table 3 summarizes the number
of samples collected in each category of the dataset.
The dataset was collected at the Polytechnic School of the University of Pernambuco,
the majority of the contributors were students. Participants were briefly informed about the
purpose of the acquisition but were not given details on how a signature verification system
works.
The genuine offline signatures were collected using personal documents such as ID,
voter ID and driver’s license, a check and a preprinted form containing 8 signing boxes
of 4 different sizes. All the documents were scanned at 300 dpi and stored as RGB files.
Figure 4 shows an example of those documents filled by a contributor. The online samples
were acquired using the Wacom STU-530 pressure tablet with a special type of pen and
also on a smartphone Samsung Galaxy Note, using an appropriate stylus. The data stored
in both devices was x-y coordinates, pressure and time.
The forgeries collected for the dataset were made by the participants. The forger could
take as long as they want to practice the signature and use all samples of the authentic
writer. For the evaluation of the process of extraction, the dataset contains a stroke level
ground truth of the signature on the documents. We followed a semi-automatic procedure
to create those images. Figure 5 shows a sample of ground truth in the dataset for a signed
ID document.
Datasets for Handwritten Signature Verification 357
Conclusion
In this chapter, we provide an overview of 17 publicly available handwritten signature
datasets, which are summarized in Table 4. Based on our review, we found the lack of
datasets containing signatures which were not pre-segmented from documents and enables
multi-domain investigations. This motivated us to build a new dataset, RPPDI-SigData,
which allows the integrated process of signature extraction and verification of samples ac-
quired in different types of documents with complex backgrounds. The dataset also consists
of online signatures acquired from a mobile device and a conventional digitizing tablet.
With the availability of our proposed dataset, we can address signature verification in-
teroperability. We plan in future works using this dataset to evaluate a system that verifies
signatures in different documents and acquisitions sources.
Acknowledgments
The authors would like to thank the CNPQ for supporting the development of this chapter
through the research projects granted by “Edital Universal” (Process 444745/2014-9) and
“Bolsa de Produtividade DT” (Process 311912/2015-0). In addition, the authors also ac-
knowledge the Document Solutions for providing some of the devices used in this research.
Datasets for Handwritten Signature Verification 359
References
(2007). The Legacy Tobacco Document Library (LTDL). University of California, San
Francisco.
Ahmad, S. M. S., Shakil, A., Ahmad, A. R., Balbed, M. A. M., and Anwar, R. M. (2008).
SIGMA-A Malaysian signatures database. In 2008 IEEE/ACS International Conference
on Computer Systems and Applications, pages 919–920. IEEE.
Ahmed, S., Malik, M. I., Liwicki, M., and Dengel, A. (2012). Signature segmentation from
document images. In Frontiers in Handwriting Recognition (ICFHR), 2012 International
Conference on, pages 425–429. IEEE.
Ahmed, S., Malik, M. I., Liwicki, M., and Dengel, A. (2013). Towards Signature Segmen-
tation & Verification in Real World Applications. In Proceedings of the 16th Biennial
Conference of the International Graphonomics Society, page 139–142.
Diaz-Cabrera, M., Gomez-Barrero, M., Morales, A., Ferrer, M. A., and Galbally, J. (2014).
Generation of enhanced synthetic off-line signatures based on real on-line data. In
Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on,
pages 482–487. IEEE.
Dumas, B., Pugin, C., Hennebert, J., Petrovska-Delacrétaz, D., Humm, A., Evéquoz, F., In-
gold, R., and Von Rotz, D. (2005). MyIDea-multimodal biometrics database, description
of acquisition protocols. Proc. Third COST, 275:59–62.
Franke, K., Schomaker, L., Veenhuis, C., Taubenheim, C., Guyon, I., Vuurpijl, L., van
Erp, M., and Zwarts, G. (2003). WANDA: A generic Framework applied in Forensic
Handwriting Analysis and Writer Identification. HIS, 105:927–938.
Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Les Jardins, J. L., Lunter, J., Ni,
Y., and Petrovska-Delacrétaz, D. (2003). Biomet: a multimodal person authentication
database including face, voice, fingerprint, hand and signature modalities. In Interna-
tional Conference on Audio-and Video-based Biometric Person Authentication, pages
845–853. Springer.
360 V. K. S. L. Melo, B. L. D. Bezerra, R. H. S. N. Do Nascimento, et. al.
Impedovo, D. and Pirlo, G. (2008). Automatic signature verification: the state of the art.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Re-
views), 38(5):609–635.
Jain, A. K., Ross, A., and Prabhakar, S. (2004). An introduction to biometric recognition.
IEEE Transactions on circuits and systems for video technology, 14(1):4–20.
Kalera, M. K., Srihari, S., and Xu, A. (2004). Offline signature verification and identifica-
tion using distance statistics. International Journal of Pattern Recognition and Artificial
Intelligence, 18(07):1339–1360.
Kholmatov, A. and Yanikoglu, B. (2009). SUSIG: an on-line signature database, associated
protocols and benchmark results. Pattern Analysis and Applications, 12(3):227–236.
Liwicki, M., Malik, M. I., Alewijnse, L., van den Heuvel, E., and Found, B. (2012).
ICFHR 2012 competition on automatic forensic signature verification (4NSIGCOMP
2012). In Frontiers in Handwriting Recognition (ICFHR), 2012 International Confer-
ence on, pages 823–828. IEEE.
Liwicki, M., Malik, M. I., van den Heuvel, C. E., Chen, X., Berger, C., Stoel, R., Blu-
menstein, M., and Found, B. (2011). Signature verification competition for online and
offline skilled forgeries (SigComp2011). In 2011 International Conference on Document
Analysis and Recognition, pages 1480–1484. IEEE.
Liwicki, M., van den Heuvel, C. E., Found, B., and Malik, M. I. (2010). Forensic signature
verification competition 4NSigComp2010-detection of simulated and disguised signa-
tures. In Frontiers in Handwriting Recognition (ICFHR), 2010 International Conference
on, pages 715–720. IEEE.
Malik, M. I., Ahmed, S., Dengel, A., and Liwicki, M. (2012). A signature verification
framework for digital pen applications. In Document Analysis Systems (DAS), 2012 10th
IAPR International Workshop on, pages 419–423. IEEE.
Malik, M. I., Ahmed, S., Marcelli, A., Pal, U., Blumenstein, M., Alewijns, L., and Li-
wicki, M. (2015). ICDAR2015 competition on signature verification and writer identifi-
cation for on-and off-line skilled forgeries (SigWIcomp2015). In Document Analysis and
Recognition (ICDAR), 2015 13th International Conference on, pages 1186–1190. IEEE.
Malik, M. I., Liwicki, M., Alewijnse, L., Ohyama, W., Blumenstein, M., and Found, B.
(2013). ICDAR 2013 Competitions on Signature Verification and Writer Identification
for On- and Offline Skilled Forgeries (SigWiComp 2013). In 2013 12th International
Conference on Document Analysis and Recognition, pages 1477–1483. IEEE.
Martinez-Diaz, M., Fierrez, J., Krish, R. P., and Galbally, J. (2014). Mobile signature
verification: feature robustness and performance comparison. IET Biometrics, 3(4):267–
277.
Nel, E.-M., Du Preez, J. A., and Herbst, B. M. (2005). Estimating the pen trajectories of
static signatures using Hidden Markov models. IEEE transactions on pattern analysis
and machine intelligence, 27(11):1733–1746.
Datasets for Handwritten Signature Verification 361
Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy, M., Es-
pinosa, V., Satue, A., Hernaez, I., Igarza, J.-J., Vivaracho, C., et al. (2003). MCYT base-
line corpus: a bimodal biometric database. IEE Proceedings-Vision, Image and Signal
Processing, 150(6):395–401.
Pirlo, G., Rizzi, F., Vacca, A., and Impedovo, D. (2015). Interoperability of biometric sys-
tems: Analysis of geometric characteristics of handwritten signatures. In International
Conference on Image Analysis and Processing, pages 242–249. Springer.
Qiao, Y., Liu, J., and Tang, X. (2007). Offline signature verification using online handwrit-
ing registration. In 2007 IEEE Conference on Computer Vision and Pattern Recognition,
pages 1–8. IEEE.
Soleimani, A., Fouladi, K., and Araabi, B. N. (2016). UTSig: A Persian Offline Signature
Dataset. arXiv preprint arXiv:1603.03235.
van den Heuvel, C., Franke, K., and Vuurpijl, L. (2009). The ICDAR 2009 signature
verification competition. ICDAR2009 proceedings.
Vargas, J. F., Ferrer, M. A., Travieso, C. M., and Alonso, J. B. (2007). Off-line handwritten
signature GPDS-960 corpus. In Ninth International Conference on Document Analysis
and Recognition (ICDAR 2007).
Viard-Gaudin, C., Lallican, P. M., Knerr, S., and Binter, P. (1999). The ireste on/off (ironoff)
dual handwriting database. In Document Analysis and Recognition, 1999. ICDAR’99.
Proceedings of the Fifth International Conference on, pages 455–458. IEEE.
Yeung, D.-Y., Chang, H., Xiong, Y., George, S., Kashi, R., Matsumoto, T., and Rigoll, G.
(2004). SVC2004: First international signature verification competition. In Biometric
Authentication, pages 16–22. Springer.
Chapter 15
1. Introduction
In the modern society, along with the growing need for secure personal recognition, there
is an increasing interest toward biometric systems. In fact, while traditional techniques per-
form personal recognition considering the possession of a token or the knowledge of some-
thing, biometric techniques use physiological or behavioural traits of the individual. Phys-
iological traits are based on the measurement of physical characteristics of users, like fin-
gerprint, retina and iris, hand-geometry, face. Behavioural traits are related to behavioural
characteristics of users, like voice, keystroke dynamics, handwritten signature (Boyer et al.,
2007; Phillips et al., 2000).
Today, in the era of the networked society, handwritten signature is rightly considered
a very special biometric trait. The verification of a person’s identity by signature analysis
does not involve an invasive measurement procedure and people are habituate to using the
handwritten signatures as a means for personal verification in their daily lives. In addition,
handwritten signatures are a long been established means of personal identification, and
their use well-recognized by administrative and financial institutions (Plamondon and Sri-
hari, 2000; Vielhauer, 2005). On the basis of the data acquisition method, two categories of
systems for handwritten signature verification can be identified: static (off-line) systems and
dynamic (on-line) systems. Static systems use off line acquisition devices which perform
data acquisition after the writing process has been completed. Dynamic systems use on line
acquisition devices that generate electronic signals representative of the signature during
the writing process (Impedovo and Pirlo, 2008; Plamondon and Lorette, 1989; Plamondon,
1994). Of course, integration of low-cost systems for on-line handwriting acquisition in a
multitude of personal devices, like tablets, smartphones and PDAs, makes on-line signa-
ture verification a relevant opportunity for a multitude of daily activities (Plamondon et al.,
2014). The interest toward online signatures is also demonstrated by the standardization of
364 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
signature data interchange formats that has been supported by several national associations
and international bodies, as well as the definition of specific regulations on signature-based
personal identity verification systems and procedures (ANSI, 2005; ISO, 2007).
Therefore, it is not surprising the efforts that in the last years more and more researchers
from academies and companies have devoted to the field. In fact, automatic signature ver-
ification involves aspects from disciplines ranging from neuroscience to engineering, from
computer science to human anatomy (Impedovo and Pirlo, 2008), since a handwritten sig-
nature is the result of a complex process originating in the signer’s brain as a motor control
“program”, and implemented through the neuromuscular system (Plamondon, 1995; Plam-
ondon and Djioua, 2006).
This chapter provides an overview on the field of on-line signature verification. The
chapter is organised as follows: Section “Signature Verification System” describes the struc-
ture of a generic system for automatic signature verification. Section “Data Acquisition and
Preprocessing” deals with the data acquisition process. The feature extraction phase is
discussed in Section “Feature Extraction”. Section “Verification” presents the main classi-
fication techniques used in the field of automatic signature verification. In Section “On-line
Signature Verification Systems” a comparative analysis of the state-of-the-art systems for
on-line signature verification is presented. Section “Discussion and Conclusion” addresses
the conclusion of the chapter, and provides some considerations of the most valuable direc-
tions for future research in the field.
of the distance between the test signature and one or more reference signatures. Model-
based approaches verify the test signature by estimating how much it fits the signature
reference model of the user (Impedovo and Pirlo, 2008; Plamondon and Lorette, 1989).
two or more signatures can be segmented into the same number of segments that correspond
more or less perfectly (Dimauro et al., 1993; Lee et al., 2004; Yue and Wijesoma, 2000).
A model-guided segmentation technique has also been proposed, which uses DTW to seg-
ment an input signature based on its correspondence to the reference model (Impedovo and
Pirlo, 2008).
4. Feature Extraction
Two types of feature can be considered for signature verification: functions or parameters.
When function features are used, the signature is characterized by a time function, the
values of which constitute the feature set. When parameter features are used, the signature
is characterized as a vector of parameters, each of which represents the value of one feature
(Plamondon and Lorette, 1989).
Typical function features are the position, velocity and acceleration functions, as well as
pressure, force and direction of pen movements. The position function is conveyed directly
by the acquisition device. Conversely, velocity and acceleration functions can be derived
numerically from position (Di Lecce et al., 1999; Wu et al., 1997b). Specific devices are
now available to capture pressure and force functions directly during the signing process
(Wang et al., 2010). Recently, pressure function was used (Hongo et al., 2005; Huang and
Yan, 2003; KOMIYA et al., 2001; Ortega-Garcia et al., 2003a; Qu et al., 2003; Schmidt and
Kraiss, 1997). Directions of the pen movements and the pen inclinations have been also
considered for automatic signature verification (Igarza et al., 2003; Ortega-Garcia et al.,
2003a).
Concerning consistency of features for on-line signature verification, a comparative
study has demonstrated that the position, velocity and pen inclination functions can be con-
sidered to be among the most consistent when a distance-based consistency model is applied
(Lei and Govindaraju, 2005). Robustness analysis of function features by personal entropy,
to short-term and long-term variability, demonstrated that position is the most robust feature
(Houmani et al., 2008).
Typical parameter features are the total signature duration (Kashi et al., 1998; Lee et al.,
1996; Qu et al., 2003), the number of pen lifts (pen-down, pen-up) (Lee et al., 1996; Qu
et al., 2003), the pen-down/pen-down time ratio (Kashi et al., 1998; Nelson et al., 1994), and
parameters derived from the analysis of direction, curvature and moments of the signature
trace. Another wide set of parameters can be derived numerically from the representative
time functions of a signature: the average (AVE), root mean square (RMS), maximum
(MAX) and minimum (MIN) values of position, displacement, speed and acceleration (Lee
et al., 1996; Nelson et al., 1994; Qu et al., 2003). Also correlations and time-dependent
relations between function features can be considered as parameter features (Kashi et al.,
1998; Lee et al., 1996; Nelson et al., 1994) as well as coefficients derived from Fourier-
(Dimauro et al., 1994; Wu et al., 1997a, 1998) and Wavelet (Lejtman and George, 2001;
Ortega-Garcia et al., 2003a; Nakanishi et al., 2004) transforms.
Some major function and parameter features for on-line signature verification are listed
in Table 1.
Of course, features can be considered at global or local level. At global level features
reflect the holistic characteristics of the signature. Al local level, features describe some
Processing of Handwritten Online Signatures: An Overview and Future Trends 367
Table 1. Feature Types
Function Features
(Hongo et al., 2005), (KOMIYA et al., 2001),
Position
(Ortega-Garcia et al., 2003a), (Wu et al., 1997b).
(Di Lecce et al., 2000, 1999), (Fuentes et al., 2002),
(Huang and Yan, 1995), (Jain et al., 2002),
Velocity
(Ortega-Garcia et al., 2003a), (Qu et al., 2003),
(Schmidt and Kraiss, 1997), (Wu et al., 1997a).
Acceleration (Schmidt and Kraiss, 1997).
(Hongo et al., 2005), (Huang and Yan, 1995),
Pressure (KOMIYA et al., 2001), (Ortega-Garcia et al., 2003a),
(Qu et al., 2003), (Schmidt and Kraiss, 1997).
Direction of pen movement (Fuentes et al., 2002), (Igarza et al., 2003).
(Igarza et al., 2003), (KOMIYA et al., 2001),
Pen inclination
(Martens and Claesen, 1998), (Ortega-Garcia et al., 2003a).
Parameter Features
(Kashi et al., 1998), (Lee et al., 2004),
Total signature duration (Lee et al., 1996), (Nelson et al., 1994),
(Qu et al., 2003).
Pen-down time ratio (Kashi et al., 1998), (Nelson et al., 1994).
(Lee et al., 2004), (Lee et al., 1996),
Number of Pen-Ups/Pen-Downs
(Qu et al., 2003).
(Kashi et al., 1998), (Lee et al., 2004),
Direction-based (Nelson et al., 1994), (Qu et al., 2003),
(Zou et al., 2003).
Curvature-based (Jain et al., 2002).
Moment-based (Kashi et al., 1998).
(Fuentes et al., 2002), (Kashi et al., 1998),
AVE/ RMS/ MAX/ MIN of (Khan et al., 2006), (Lee et al., 2004),
Posit., Displ., Speed, Accel. (Lee et al., 1996), (Nelson et al., 1994),
(Qu et al., 2003).
Duration of Positive/Negative (Kashi et al., 1998), (Lee et al., 2004),
Posit., Displ., Speed, Accel. (Lee et al., 1996), (Nelson et al., 1994).
X-Y correlation of Posit., Displ., (Fuentes et al., 2002), (Kashi et al., 1998),
Speed, Accel. (Nelson et al., 1994).
Fourier Transform (Dimauro et al., 1994), (Wu et al., 1997a, 1998).
(Lejtman and George, 2001),
Wavelet Transform (Martens and Claesen, 1998),
(Nakanishi et al., 2004).
very specific characteristics of a signature region. For instance, function features can be
considered at global level, i.e. for the whole signature, or at local level, i.e. for individual
signature segments. Concerning parameters, typical global parameters are total duration,
number of pen lifts and number of components of the signature. Typical local parame-
ters are related to direction-based, curvature-based and moment-based features estimated at
regional level of a signature (Impedovo and Pirlo, 2008).
368 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
5. Verification
The aim of the verification phase is to evaluate the authenticity of a test signature by match-
ing its features against those stored in the knowledge base developed during the enrolment
stage. The result for the authenticity of a signature can be provided as a Boolean value
or a real value, when a confidence value concerning the decision is required. In the lit-
erature, two main types of comparison techniques were considered: distance-based and
model-based techniques (Impedovo and Pirlo, 2008; Plamondon and Lorette, 1989).
When function features are considered, Dynamic Time Warping (DTW) was the most
exploited technique for signature comparison, since this allows the time axis of two time se-
quences that represent a signature to be compressed or expanded locally, in order to obtain
the minimum of a given distance value (Impedovo and Pirlo, 2008; Parizeau and Plamon-
don, 1990). In order to reduce the computational cost of the comparison, several advanced
DTW strategies were proposed for data reduction based on Genetic Algorithms (GA), Prin-
cipal or Minor Component Analysis (PCA/MCA) and Linear Regression (LR) (Impedovo
and Pirlo, 2008). When parameters are considered as features, both Euclidean (Khan et al.,
2006) and Mahalanobis (Martens and Claesen, 1998; Nyssen et al., 2002) distance have
been considered for matching the target specimen to the parameter-based model of the de-
clared signer’s signatures. Similarity measures (Wu et al., 1997a), split-and-merge strate-
gies (Wu et al., 1997b) and string-matching (Chen and Ding, 2002) have also been consid-
ered for signature comparison. Another effective approach for on-line signature verification
uses Support Vector Machines (SVM), that can map input vectors to a higher dimensional
space, in which clusters may be determined by a maximal separation hyper-plane (Fuentes
et al., 2002; Kholmatov and Yanikoglu, 2005).
Some of the most valuable model-based techniques for signature comparison concern
neural networks, multi-layer perceptrons (MLP), time-delay neural networks, backpropa-
gation networks, self-organizing map (Fuentes et al., 2002; Huang and Yan, 1995; Lejtman
and George, 2001; Wessels and Omlin, 2000). Of course, the use of neural network for on-
line signature verification can be limited by the fact they generally require large amounts
of learning data, which are not available in many applications (Impedovo and Pirlo, 2008;
Leclerc and Plamondon, 1994). Another model-based comparison technique very effec-
tive for signature modelling uses Hidden Markov Models (HMM). Studies have found that
HMM is highly adaptable to personal variability (Fuentes et al., 2002; Martı́nez Dı́az et al.,
2008; Van et al., 2007), therefore they can support effective signature modelling techniques
(Huang and Yan, 2003). Although both Left-to-Right (LR) and Ergodic (E) topologies have
been considered In the literature for on-line signature verification, the LR topology is gen-
erally considered best suited to signature modelling (Igarza et al., 2003; Woch et al., 2011;
Zou et al., 2003).
Table 2 shows some of the most valuable distance-based and model-based techniques
for signature comparison.
Multi-expert approaches based on abstract-level, ranked-level and measurement-level
combination methods have also be considered in order to improve verification performance
(Hongo et al., 2005; Nanni et al., 2010; Nanni, 2006). In particular, multi-expert approaches
have been used for implementing top-down and bottom-up signature verification strategies
(Impedovo and Pirlo, 2007; Pirlo, 1994). When bottom-up strategies are considered, a sig-
Processing of Handwritten Online Signatures: An Overview and Future Trends 369
Table 2. Comparison Techniques
nature is verified starting with the analysis of its basic elements, like strokes or components.
This approach can lead to lower error rates, compared to global approaches, since a large
amount of personal information is conveyed in specific parts of the signature and cannot
be detected when the signature is viewed as a whole (Brault and Plamondon, 1993b; Di-
mauro et al., 1993, 1994; Schmidt and Kraiss, 1997). When top-down verification strategies
are used, hybrid topologies have shown to combine the performance advantages of serial
approaches in quickly rejecting very poor forgeries, while retaining the reliability of par-
allel combination schemes. For example, multi-level verification systems first verify the
structural organization of a target signature and then analyse in detail the authenticity of its
individual elements (Dimauro et al., 1994; Huang and Yan, 2003).
370 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
tion techniques. Arora et al. (2014) used both the discrete fractional cosine transformation
(DFrCT) and the discrete cosine transform (DCT) for feature extraction. The experimental
results demonstrate the superiority of DFrCT-based features with respect to DCT-based fea-
tures. DTW was used by Bovino et al. (2003), who presented a multi-expert system based
on a stroke-oriented description of signatures. Each stroke was analysed in the position,
velocity and acceleration domain. A two-level scheme is used for decision combination.
For each stroke, soft- and hard-combination rules were used at the first level to combine
decisions obtained by DTW from different representation domains. Simple and weighted
averaging was used at the second level to combine decisions from different parts of the
signature. Di Lecce et al. (2000) performed signature verification by combining the ef-
forts of three experts. The first expert used shape-based features and performed signature
verification by means of global analysis. The second and third experts used speed-based
features and a regional analysis. The expert decisions were combined by a Majority Voting
strategy. The system proposed by Gomez-Barrero et al. (2015) combined the high per-
formance of DTW-based systems in verification tasks, with the high potential for skilled
forgery detection of the features of the Sigma-LogNormal model. Griechisch et al. (2014)
used position-based, velocity-based and pressure features and exploited the potential of
the Kolmogorov-Smirnov statistics in on-line signature verification. Signature compari-
son is performed by a method based on the distribution distance determined by applying
the Kolmogorov-Smirnov test. Guru and Prakash (2009) represented on-line signatures by
interval-valued symbolic features. They used parameter-based features derived by a global
analysis of signatures and achieved the best verification results when a writer-dependent
threshold was adopted for distance-based classification. Jain et al. (2002) used a set of
local parameters which described both spatial and temporal information. In the verifica-
tion process, string-matching was used to compare the test signature to all the signatures in
the reference set. Three methods were investigated to combine the individual dissimilarity
values into one value: the minimum, the average and the maximum of all the dissimilar-
ity values. Common and personalized (signer-dependent) thresholds were also considered.
The best results were achieved by considering the minimum of all the dissimilarity values
combined with personalized threshold values. In the system by Nakanishi et al. (2004),
the position signals of an on-line signature are decomposed into sub-band signals using the
Discrete Wavelet Transform (DWT), and Dynamic Programming was used for signature
matching. The system by Pirlo et al. (2015a) considered four domains of representation of
on-line signature: position, velocity, acceleration and pressure. They used stability infor-
mation to detect the most profitable domain of representation of a signature for verification
aims, according to a local analysis strategy (Impedovo and Pirlo, 2010). Successively, local
verification decisions obtained by DTW were combined to provide the verification decision
of the entire signature. Wibowo et al. (2014) considered stable features of signature that are
K-L coefficients of the forward and backward variances between the reference and signa-
ture to be verified. Fourier analysis was applied by Wu et al. (1998) for on-line signature
verification, as they extracted and used cepstrum coefficients for verification, according to
a dynamic similarity measure. Yeung et al. (2004) reported the results of the First Inter-
national Signature Verification Competition (SVC2004), in which teams from all over the
world participated. SVC2004 considered two separate signature verification tasks using
two different signature databases. The signature data for the first task contained position
372 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
information only. The signature data for the second task contained position, pen inclination
and pressure information. In both cases, DTW-based approaches provided the best results.
The system presented by Nyssen et al. (2002) used global, local and function features. In
the first verification stage, a parameter-based method was implemented, in which the Maha-
lanobis distance was used as a measure of dissimilarity between the signatures. The second
verification stage involved corner extraction and corner matching, as well as signature seg-
mentation. In the third verification stage, an elastic matching algorithm was used, and a
point-to-point correspondence was established between the compared signatures. By com-
bining the three different types of verification, a high security level was reached. Zou et al.
(2003) used local shape analysis for on-line signature verification. Specifically, FTT was
used to derive spectral and tremor features from well-defined segments of the signature. A
weighted distance was finally considered, in order to combine the similarity values derived
from the various feature sets.
Table 3 reports some on-line signature verification systems using model-based classi-
fication techniques. Igarza et al. (2003) also used a Left-to-Right HMM for on-line signa-
ture verification and demonstrated its superiority over Ergodic HMMs. The superiority of
Principal Component Analysis and Minor Component Analysis for on-line signature verifi-
cation over DTW and Euclidean-based verification was also investigated and demonstrated
by Igarza et al. (2004). The on-line signature verification system proposed by Kashi et al.
(1997) used a Fourier transform-based normalization technique, and both global and lo-
cal features for signature modelling. The global features captured the spatial and temporal
information of the signature, and the local features, extracted by a Left-to-Right HMM,
captured the dynamics of the signature production process. The verification result was
achieved by combining the information derived by both global and local features. Lee et al.
(2004) performed signature verification by means of a back-propagation neural network,
which integrated verification results at segment level, using a bottom-up strategy. They per-
formed signature segmentation by means of a DP technique based on geometric extrema.
Segment matching was performed by global features and a DP approach. Martı́nez Dı́az
et al. (2008) presented a Left-to-Right HMM-based signature verification system for hand-
held devices. Signatures were captured by a PDA and described by the position function
only. The best results were achieved by user-dependent HMM modelling. Muramatsu and
Matsumoto (2003) used HMM-LR to incorporate signature trajectories for on-line signature
verification. Individual features were extracted as high frequency signals in sub-bands. The
total decision for verification was carried out by averaging the verification results achieved
at each sub-band. Ortega-Garcia et al. (2003a) presented an investigation of the ability of
HMM-LR to model the signing process, based on a set of 24 function features (8 basic
function features and their first and second derivatives). In Shafiei and Rabiee’s system
(Shafiei and Rabiee, 2003), each signature was segmented using its perceptually impor-
tant points. For every segment, four dynamic and three static parameters were extracted,
which are scale- and displacement-invariant. HMM was used for signature verification.
Swanepoel and Coetzer (2014) used pen-positions, pressure and pen-tilt as features and
adopted a dynamic time warping-based dichotomy transformation and a writer-specific dis-
similarity normalisation technique for signature representation. Signature comparison is
performed by a support vector machines with linear and radial basis function kernels. The
results demonstrate that non-linear kernel significantly outperforms linear kernel. Wessels
Processing of Handwritten Online Signatures: An Overview and Future Trends 373
Table 3. System Performances: distance-based techniques
Full Database (FD) , Signature (S), Genuine Signatures (G) , Forgeries (F), Random Forgeries (RF), Simple
Forgeries (SF), Skilled Forgeries (SK), Number of Authors (A)
Matching
Main features Database Results Reference
Technique
FD (SVC 2004) (Test 1) EER: 5%
Euclidean Test 1: DFrCT
100(G)(20(G)x5(A)) (Arora et al., 2014)
Distance Test 2: CDT
100(F)(20(F)x5(A)) (Test 2) EER: 7.04%
DTW Position, Training) 45(G) (3(G) x 15(A))
(ME by simple velocity, Test) 750(G) EER: 0,4% (Bovino et al., 2003)
averaging) acceleration (50(G)x15(A)) , 750 (F) (50(F)x15(A))
DTW Shape-based Training) 45(G) (3(G) x 15(A))
FRR: 3,2%
(ME by features (segmentation- Test) 750 (G) (Di Lecce et al., 2000)
. FAR: 0.55%
majority voting) dependent), Velocity (50(G)x15(A)) , 750 (F) (50(F)x15(A))
Training Set)
Test 1) DTW
16 (G) x
EER: 5.80%
50(A), 12 (SF)x50(A)
(SF), 1.07% (RF)
Development set)
Log-Normal-based
DTW 16 (G) x (Gomez-Barrero et al., 2015)
Features Test 2) DTW +
100(A), 12 (SF)x100(A)
SL.
Test set)
EER: 4.77 (SF),
16 (G) x
0.50 (RF)
250(A), 12 (SF)x250(A)
position-based, FD)
Kolmogorov-Smirnov
velocity-based, (12(G x10(A), EER: 13% (Griechisch et al., 2014)
Distance.
pressure 24(F)x10(A))
FD1)
Training1) 2000 (G) (20(G) x 100(A))
Test1) 500 FD1)
Total signature (G) (5(G) x 100 (A)), 9900 (RF) EER : 3.80%
duration, number of (99 (RF) x 100(A)) , 500 (SK) (5(SK) x (Test1 - SK)
pen-ups, STD velocity 100(A)) EER : 1.65% (Test1 - RF)
Distance-based and acceleration in x FD2) (Guru and Prakash, 2009)
and y direction, number Training2) 6600(G) FD2)
of local maxima in (20(G) x 330(A)) EER : 4.70%
x direction, etc. Test2) 1650 (Test2 - SK)
(G) (5(G) x 330(A)), 75570 (RF) EER : 1.67% (Test2 - RF)
(229(RF) x 330(A)) , 8250(SK) (25(SK) x
330(A))
Velocity,
DTW FD) 4600 (S) EER: 4% (Huang and Yan, 1995)
pressure
FRR:
Velocity, 3,3%.. FAR: 2,7% (common threshoold)
String-matching FD) 1232 (S) (from 102 (A)) (Jain et al., 2002)
curvature-based. FRR: 2,8%. FAR:
1,6% (writer dependent threshold)
Training) 405 (G) (5(G) x 81(A))
DTW Position ,
Test) 405 EER: 5,00% (Li et al., 2004)
(PCA, MCA) velocity
(G) (5(G) x 81(A)), 405(F) (5(F) x 81(A))
Dynamic Wavelet Training) 20(G) from(4(A))
EER: 4% (Nakanishi et al., 2004)
Programming Transform. Test) 98 (G) (from 4(A)) , 200(F) (from 5(A))
Training Database)
15(G)x100(A)
Position,
15(SF)x100(A)
velocity, FRR: 2.15%
DTW (Pirlo et al., 2015a)
acceleration, FAR: 2.10%
Test Database)
pressure
15(G)x100(A)
5(G)x100(A)
FD)
5000 (S):
2500(G) (25(G)x100(A))+2500(F) (25(F)x100(A))
Euclidean K-L
(signature EER: 4.49% (Wibowo et al., 2014)
Norm coefficients
are from the MCYT-100 - 5
signature are considered as reference
for each signer)
Total signature time,
AVE/RMS speed,
Membership Test) 60 (G) FRR:
pressure, direction- (Qu et al., 2003)
function , 60 (F) 6,67%. FAR: 1,67%
based, number of
pen-ups/pen-downs, etc.
Training) 270(G) (from 27(A))
Dynamic Fourier transform (cepstrum FRR: 1,4%
Test) 560 (G) (Wu et al., 1997a)
similarity measure coefficients) . FAR: 2,8%
(from 27(A)), 650 (F)
Training)
(Test 1) EER: 2,84% (Task
Task 1: 800(G)(20(G)x40(A)),800(SK)(20(SK)x40(A))
1). EER: 2,89% (Task 2) (Yeung et al., 2004)
DTW (best Position Test 1)
(1st Signature
result) Task 2: Position, pen 600(G)(10(G)x60(A)), 1200(SK)(20(SK)x60(A))
(Test 2) EER: 2,79% (Task Verification Competition)
inclination (azimuth), pressure, etc. Test 2)
1). EER: 2,51% (Task 2)
600(G)(10(G)x60(A)), 1200(RF)(20(RF)x60(A))
Euclidean
Distance, Geometric-based, FD) 306 (G) , FRR: 5,8%
(Nyssen et al., 2002)
Mahalanobis Curvature-based 302 (F) . FAR: 0%
distance, DTW
Speed, pressure,
Membership FD) 1000 FRR:
direction-based, (Zou et al., 2003)
function (G) , 10000 (F) 11,30%. FAR: 2,00%
Fourier transform
and Omlin (2000) combined a Kohonen self-organizing feature map and HMM. Both Left-
to-Right and Ergodic models were considered. Yang et al. (1995) used directional features
374 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
along with several HMM topologies for signature modelling. The results demonstrated that
HMM-LR is superior to other topologies in capturing the individual features of the signa-
tures, while at the same time accepting variability in signing. A polar coordinate system was
considered for signature representation by Yoon et al. (2002), whose aim was to reduce nor-
malization error and computing time. Signature modelling and verification were performed
by HMMs, which demonstrated their ability to capture the local characteristics in the time
sequence data, as well as their flexibility in terms of modelling signature variability.
time the point is a DMP, when the signature is matched against other genuine signatures.
Following this procedure low- and high-stability regions can be identified and used to adapt
system behavior to the specific characteristics of the signature (Congedo et al., 1994, 1995;
Dimauro et al., 2002). Variability of handwritten patterns have also been estimated using
the Kinematic Theory of human movements (Elliott, 2004) or through a client-entropy mea-
sure based on local density estimation by a HMM (Salicetti et al., 2008). In particular, the
entropy-based measure is used to access whether a signature contains or not enough in-
formation to be successfully used for personal verification (Salicetti et al., 2008; Houmani
et al., 2008). Stability can be also estimated by the analysis of common features extracted
from the signature, in order to obtain global information on signature repeatability (Galbally
et al., 2009a; Guest, 2004). From these approaches it results a set of features exists that re-
main stable over long time periods, while there are other features which change significantly
in time (Houmani et al., 2008, 2009; Lei and Govindaraju, 2005). This information can be
very useful to improve over time verification performance. In general, stability analysis is
used to estimate both short-term and long-term modifications of a signature. Short-term
modifications depend on the psychological condition of the writer and on the writing con-
ditions. Information about short-term modification can be used to select the best sub-set
of reference signatures and the most effective feature functions for signature verification,
while providing useful information to weight the verification decision obtained at the stroke
level. Long-term modifications depend on the alteration of the physical writing system of
the signer (arm and hand, etc. ), as well as on the modification of the motor control “pro-
gram” in his/her brain (Plamondon et al., 2014). Concerning complexity, it is worth noting
that no common meaning of handwriting complexity was defined yet. Notwithstanding, it is
generally argued that the complexity of a signature can be critical to the reliability of the ex-
amination process (Huber and Headrick, 1999). In general, in signature analysis, signature
complexity can be thought to be an estimator of the difficulty for its imitation. Signature
complexity can be obtained as the result of the difficulty in perceiving, preparing and ex-
ecuting each stroke of the signature itself (Brault and Plamondon, 1993a). A complexity
theory, which is based on the theoretical relationship between the complexity of features of
the handwriting process and the number of concatenated strokes, was also considered for
complexity estimation. According to this theory signature complexity can be estimated by
analyzing variables that are indirectly related to the number of concatenated strokes, like
for instance the number of turning points, the number of feathering points, and the number
of intersections and retraces (Found and Rogers, 1995).
In addition, it is worth noting that as the number of devices available for signature
acquisition is continuously growing, device interoperability is an even more relevant issue
that needs specific research. In fact, signature signals can change significantly depending
on the type of the acquisition device, writing area and stylus type, but also on the basis of
modification of personal characteristics (Guest, 2006; Guest and Fairhurst, 2006; Impedovo
and Pirlo, 2008; Maiorana et al., 2010).
Finally, new interesting directions of research have been devoted to developing of
handwriting-based cryptosystems (Uludag et al., 2004) as well as to the use of handwritten
signatures to discriminate the health conditions of the subject. Recent studies have been
devoted to the analysis of handwriting (Djioua and Plamondon, 2009; Longstaff and Heath,
2006; O’Reilly and Plamondon, 2009, 2012a; Woch et al., 2011) but some research are de-
Processing of Handwritten Online Signatures: An Overview and Future Trends 377
voted to the use of signatures to detect brain stroke risk factors (O’Reilly and Plamondon,
2011, 2012b) or to the diagnosis of Parkinson (Rosenblum et al., 2013; Van Gemmert et al.,
2003) and Alzheimer (Impedovo et al., 2013; Pirlo et al., 2015b; Yan et al., 2008) diseases.
References
Alonso-Fernandez, F., Fierrez, J., Gilperez, A., Galbally, J., and Ortega-Garcia, J. (2009).
Robustness of signature verification systems to imitators with increasing skills. In Doc-
ument Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on,
pages 728–732. IEEE.
Arora, M., Singh, K., and Mander, G. (2014). Discrete fractional cosine transform based
online handwritten signature verification. In Engineering and Computational Sciences
(RAECS), 2014 Recent Advances in, pages 1–6. IEEE.
Ballard, L., Lopresti, D., and Monrose, F. (2007). Forgery quality and its implications for
behavioral biometric security. IEEE Transactions on Systems, Man, and Cybernetics,
Part B (Cybernetics), 37(5):1107–1118.
Bovino, L., Impedovo, S., Pirlo, G., and Sarcinella, L. (2003). Multi-expert verification of
hand-written signatures. In ICDAR, volume 3, pages 932–936.
Boyer, K. W., Govindaraju, V., and Ratha, N. K. (2007). Special issue on recent advances in
biometric systems. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 37(5).
Brault, J.-J. and Plamondon, R. (1993b). Segmenting handwritten signatures at their per-
ceptually important points. IEEE Transactions on Pattern Analysis and Machine Intelli-
gence, 15(9):953–957.
Bunke, H., Von Siebenthal, T., Yamasaki, T., and Schenkel, M. (1999). Online handwriting
data acquisition using a video camera. In Document Analysis and Recognition, 1999.
ICDAR’99. Proceedings of the Fifth International Conference on, pages 573–576. IEEE.
Chen, Y. and Ding, X. (2002). On-line signature verification using direction sequence string
matching. In Second International Conference on Image and Graphics, pages 744–749.
International Society for Optics and Photonics.
378 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
Congedo, G., Dimauro, G., Forte, A., Impedovo, S., and Pirlo, G. (1995). Selecting refer-
ence signatures for on-line signature verification. In International Conference on Image
Analysis and Processing, pages 521–526. Springer.
Congedo, G., Dimauro, G., Impedovo, S., and Pirlo, G. (1994). A new methodology for the
measurement of local stability in dynamical signatures. In 4th International Workshop
on Frontiers in Handwriting Recognition, pages 135–144.
Di Lecce, V., Dimauro, G., Guerriero, A., Impedovo, S., Pirlo, G., and Salzo, A. (1999).
Image basic features indexing techniques for video skimming. In Image Analysis and
Processing, 1999. Proceedings. International Conference on, pages 715–720. IEEE.
Di Lecce, V., Dimauro, G., Guerriero, A., Impedovo, S., Pirlo, G., and Salzo, A. (2000).
A multi-expert system for dynamic signature verification. In International Workshop on
Multiple Classifier Systems, pages 320–329. Springer.
Dimauro, G., Impedovo, S., Modugno, R., Pirlo, G., and Sarcinella, L. (2002). Analysis of
stability in hand-written dynamic signatures. In Frontiers in Handwriting Recognition,
2002. Proceedings. Eighth International Workshop on, pages 259–263. IEEE.
Dimauro, G., Impedovo, S., and Pirlo, G. (1993). A signature verification system based
on dynamical segmentation technique. In Proceedings of the International Workshop on
Frontiers in Handwriting Recognition, pages 262–271.
Dimauro, G., Impedovo, S., and Pirlo, G. (1994). Component-oriented algorithms for signa-
ture verification. International Journal of Pattern Recognition and Artificial Intelligence,
8(03):771–793.
Djioua, M. and Plamondon, R. (2009). A new algorithm and system for the characterization
of handwriting strokes with delta-lognormal parameters. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 31(11):2060–2072.
Dolfing, J., Aarts, E., and Van Oosterhout, J. (1998). On-line verification signature with
hidden markov models. In Proc. 14th International Conference Pattern Recognition,
pages 1–309.
Feng, H. and Wah, C. C. (2003). Online signature verification using a new extreme points
warping technique. Pattern Recognition Letters, 24(16):2943–2951.
Fuentes, M., Garcia-Salicetti, S., and Dorizzi, B. (2002). On line signature verification:
Fusion of a hidden markov model and a neural network via a support vector machine. In
Frontiers in Handwriting Recognition, 2002. Proceedings. Eighth International Work-
shop on, pages 253–258. IEEE.
Processing of Handwritten Online Signatures: An Overview and Future Trends 379
Galbally, J., Fierrez, J., Martinez-Diaz, M., and Ortega-Garcia, J. (2009a). Evaluation of
brute-force attack to dynamic signature verification using synthetic samples. In Doc-
ument Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on,
pages 131–135. IEEE.
Galbally, J., Fierrez, J., Martinez-Diaz, M., and Ortega-Garcia, J. (2009b). Improving
the enrollment in dynamic signature verfication with synthetic samples. In Document
Analysis and Recognition, 2009. ICDAR’09. 10th International Conference on, pages
1295–1299. IEEE.
Galbally, J., Fierrez, J., Martinez-Diaz, M., Ortega-Garcia, J., Plamondon, R., and O’Reilly,
C. (2010). Kinematical analysis of synthetic dynamic signatures using the sigma-
lognormal model. In Frontiers in Handwriting Recognition (ICFHR), 2010 International
Conference on, pages 113–118. IEEE.
Galbally, J., Fierrez, J., Ortega-Garcia, J., and Plamondon, R. (2012a). Synthetic on-line
signature generation. part ii: Experimental validation. Pattern Recognition, 45(7):2622–
2632.
Galbally, J., Plamondon, R., Fierrez, J., and Ortega-Garcia, J. (2012b). Synthetic on-
line signature generation. part i: Methodology and algorithms. Pattern Recognition,
45(7):2610–2621.
Garcia-Salicetti, S., Beumier, C., Chollet, G., Dorizzi, B., Les Jardins, J. L., Lunter, J., Ni,
Y., and Petrovska-Delacrétaz, D. (2003). Biomet: A multimodal person authentication
database including face, voice, fingerprint, hand and signature modalities. In Interna-
tional Conference on Audio-and Video-based Biometric Person Authentication, pages
845–853. Springer.
Gomez-Barrero, M., Galbally, J., Fierrez, J., Ortega-Garcia, J., and Plamondon, R. (2015).
Enhanced on-line signature verification based on skilled forgery detection using sigma-
lognormal features. In Biometrics (ICB), 2015 International Conference on, pages 501–
506. IEEE.
Griechisch, E., Malik, M. I., and Liwicki, M. (2014). Online signature verification based
on kolmogorov-smirnov distribution distance. In Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on, pages 738–742. IEEE.
Guest, R. (2006). Age dependency in handwritten dynamic signature verification systems.
Pattern Recognition Letters, 27(10):1098–1104.
Guest, R. and Fairhurst, M. (2006). Sample selection for optimising signature enrolment.
In Tenth International Workshop on Frontiers in Handwriting Recognition. Suvisoft.
Guest, R. M. (2004). The repeatability of signatures. In Frontiers in Handwriting Recog-
nition, 2004. IWFHR-9 2004. Ninth International Workshop on, pages 492–497. IEEE.
Guru, D. and Prakash, H. (2009). Online signature verification and recognition: An ap-
proach based on symbolic representation. IEEE transactions on pattern analysis and
machine intelligence, 31(6):1059–1073.
380 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
Hongo, Y., Muramatsu, D., and Matsumoto, T. (2005). Adaboost-based on-line signature
verifier. In Defense and Security, pages 373–380. International Society for Optics and
Photonics.
Houmani, N., Garcia-Salicetti, S., and Dorizzi, B. (2008). A novel personal entropy mea-
sure confronted with online signature verification systems’ performance. In Biometrics:
Theory, Applications and Systems, 2008. BTAS 2008. 2nd IEEE International Confer-
ence on, pages 1–6. IEEE.
Houmani, N., Garcia-Salicetti, S., and Dorizzi, B. (2009). On assessing the robustness
of pen coordinates, pen pressure and pen inclination to time variability with personal
entropy. In Biometrics: Theory, Applications, and Systems, 2009. BTAS’09. IEEE 3rd
International Conference on, pages 1–6. IEEE.
Huang, K. and Yan, H. (1995). On-line signature verification based on dynamic segmenta-
tion and global and local matching. Optical Engineering, 34(12):3480–3487.
Huang, K. and Yan, H. (2003). Stability and style-variation modeling for on-line signature
verification. Pattern Recognition, 36(10):2253–2270.
Ibrahim, M. T., Kyan, M., Khan, M. A., and Guan, L. (2010). On-line signature verification
using 1-d velocity-based directional analysis. In Pattern Recognition (ICPR), 2010 20th
International Conference on, pages 3830–3833. IEEE.
Igarza, J. J., Goirizelaia, I., Espinosa, K., Hernáez, I., Méndez, R., and Sánchez, J. (2003).
Online handwritten signature verification using hidden markov models. In Iberoameri-
can Congress on Pattern Recognition, pages 391–399. Springer.
Igarza, J. J., Gómez, L., Hernáez, I., and Goirizelaia, I. (2004). Searching for an optimal
reference system for on-line signature verification based on (x, y) alignment. In Biometric
Authentication, pages 519–525. Springer.
Impedovo, D. and Pirlo, G. (2008). Automatic signature verification: the state of the art.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Re-
views), 38(5):609–635.
Impedovo, D., Pirlo, G., Mangini, F. M., Barbuzzi, D., Rollo, A., Balestrucci, A., Impe-
dovo, S., Sarcinella, L., O’Reilly, C., and Plamondon, R. (2013). Writing generation
model for health care neuromuscular system investigation. In International Meeting
on Computational Intelligence Methods for Bioinformatics and Biostatistics, pages 137–
148. Springer.
Processing of Handwritten Online Signatures: An Overview and Future Trends 381
ISO (2007). Information technology biometric data interchange formats part 7: Signa-
ture/sign time series data. ISO/IEC FCD 19794-7.
Jain, A. K., Griess, F. D., and Connell, S. D. (2002). On-line signature verification. Pattern
recognition, 35(12):2963–2972.
Kamel, N. S., Sayeed, S., and Ellis, G. A. (2008). Glove-based approach to online sig-
nature verification. IEEE Transactions on Pattern Analysis and Machine Intelligence,
30(6):1109–1113.
Kashi, R., Hu, J., Nelson, W., and Turin, W. (1998). A hidden markov model approach to
online handwritten signature verification. International Journal on Document Analysis
and Recognition, 1(2):102–109.
Kashi, R. S., Hu, J., Nelson, W., and Turin, W. (1997). On-line handwritten signature ver-
ification using hidden markov model features. In Document Analysis and Recognition,
1997., Proceedings of the Fourth International conference on, volume 1, pages 253–257.
IEEE.
Khan, M. K., Khan, M. A., Khan, M. A., and Ahmad, I. (2006). On-line signature verifica-
tion by exploiting inter-feature dependencies. In Pattern Recognition, 2006. ICPR 2006.
18th International Conference on, volume 2, pages 796–799. IEEE.
KOMIYA, Y., Ohishi, T., and Matsumoto, T. (2001). A pen input on-line signature ver-
ifier integrating position, pressure and inclination trajectories. IEICE transactions on
information and systems, 84(7):833–838.
Leclerc, F. and Plamondon, R. (1994). Automatic signature verification: The state of the
art1989–1993. International Journal of Pattern Recognition and Artificial Intelligence,
8(03):643–660.
Lee, J., Yoon, H.-S., Soh, J., Chun, B. T., and Chung, Y. K. (2004). Using geometric ex-
trema for segment-to-segment characteristics comparison in online signature verification.
Pattern Recognition, 37(1):93–103.
Lee, L. L., Berger, T., and Aviczer, E. (1996). Reliable online human signature verification
systems. IEEE Transactions on pattern analysis and machine intelligence, 18(6):643–
647.
Li, B., Wang, K., and Zhang, D. (2004). On-line signature verification based on pca (prin-
cipal component analysis) and mca (minor component analysis). In Biometric Authenti-
cation, pages 540–546. Springer.
Maiorana, E., Campisi, P., Fierrez, J., Ortega-Garcia, J., and Neri, A. (2010). Cancelable
templates for sequence-based biometrics with application to on-line signature recogni-
tion. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Hu-
mans, 40(3):525–538.
Martens, R. and Claesen, L. (1998). Incorporating local consistency information into the
online signature verification process. International Journal on Document Analysis and
Recognition, 1(2):110–115.
Martı́nez Dı́az, M., Fiérrez, J., and Ortega-Garcı́a, J. (2008). Incorporating signature verifi-
cation on handheld devices with user-dependent hidden markov models.
Munich, M. E. and Perona, P. (2002). Visual input for pen-based computers. IEEE Trans-
actions on Pattern Analysis and Machine Intelligence, 24(3):313–328.
Nabeshima, S., Yamamoto, S., Agusa, K., and Taguchi, T. (1995). Memo-pen: a new
input device. In Conference companion on Human factors in computing systems, pages
256–257. ACM.
Nakanishi, I., Nishiguchi, N., Itoh, Y., and Fukui, Y. (2004). On-line signature verification
based on discrete wavelet domain adaptive signal processing. In Biometric Authentica-
tion, pages 584–591. Springer.
Nanni, L., Maiorana, E., Lumini, A., and Campisi, P. (2010). Combining local, regional
and global matchers for a template protected on-line signature verification system. Expert
Systems with Applications, 37(5):3676–3684.
Processing of Handwritten Online Signatures: An Overview and Future Trends 383
Nelson, W., Turin, W., and Hastie, T. (1994). Statistical methods for on-line signature
verification. International Journal of Pattern Recognition and Artificial Intelligence,
8(03):749–770.
Nyssen, E., Sahli, H., and Zhang, K. (2002). A multi-stage online signature verification
system. Pattern Analysis & Applications, 5(3):288–295.
OReilly, C. and Plamondon, R. (2011). Impact of the principal stroke risk factors on human
movements. Human movement science, 30(4):792–806.
O’Reilly, C. and Plamondon, R. (2012b). Looking for the brain stroke signature. In Pattern
Recognition (ICPR), 2012 21st International Conference on, pages 1811–1814. IEEE.
Ortega-Garcia, J., Fierrez, J., Alonso-Fernandez, F., Galbally, J., Freire, M. R., Gonzalez-
Rodriguez, J., Garcia-Mateo, C., Alba-Castro, J.-L., Gonzalez-Agulla, E., Otero-Muras,
E., et al. (2010). The multiscenario multienvironment biosecure multimodal database
(bmdb). IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(6):1097–
1111.
Ortega-Garcia, J., Fierrez-Aguilar, J., Simon, D., Gonzalez, J., Faundez-Zanuy, M., Es-
pinosa, V., Satue, A., Hernaez, I., Igarza, J.-J., Vivaracho, C., et al. (2003b). Mcyt base-
line corpus: a bimodal biometric database. IEE Proceedings-Vision, Image and Signal
Processing, 150(6):395–401.
Parizeau, M. and Plamondon, R. (1989). What types of scripts can be used for personal
identity verification? Computer Recognition and Human Production of Handwriting,
pages 77–90.
Phillips, P. J., Martin, A., Wilson, C. L., and Przybocki, M. (2000). An introduction evalu-
ating biometric systems. Computer, 33(2):56–63.
Pirlo, G., Cuccovillo, V., Diaz-Cabrera, M., Impedovo, D., and Mignone, P. (2015a). Mul-
tidomain verification of dynamic signatures using local stability analysis. IEEE Transac-
tions on Human-Machine Systems, 45(6):805–810.
Pirlo, G., Diaz, M., Ferrer, M. A., Impedovo, D., Occhionero, F., and Zurlo, U. (2015b).
Early diagnosis of neurodegenerative diseases by handwritten signature analysis. In In-
ternational Conference on Image Analysis and Processing, pages 290–297. Springer.
Plamondon, R. (1995). A kinematic theory of rapid human movements: Part iii. kinetic
outcomes. Biological Cybernetics, 72(4):295–307.
Plamondon, R. and Lorette, G. (1989). Automatic signature verification and writer identi-
ficationthe state of the art. Pattern recognition, 22(2):107–131.
Plamondon, R., Pirlo, G., and Impedovo, D. (2014). Online signature verification. In
Handbook of Document Image Processing and Recognition, pages 917–947. Springer.
Plimmer, B., Grundy, J., Hosking, J., and Priest, R. (2006). Inking in the ide: Experi-
ences with pen-based design and annotatio. In Visual Languages and Human-Centric
Computing, 2006. VL/HCC 2006. IEEE Symposium on, pages 111–115. IEEE.
Qu, T., El Saddik, A., and Adler, A. (2003). Dynamic signature verification system using
stroked based features. In Haptic, Audio and Visual Environments and Their Applica-
tions, 2003. HAVE 2003. Proceedings. The 2nd IEEE Internatioal Workshop on, pages
83–88. IEEE.
Rabasse, C., Guest, R., and Fairhurst, M. (2007). A method for the synthesis of dynamic
biometric signature data. In Document Analysis and Recognition, 2007. ICDAR 2007.
Ninth International Conference on, volume 1, pages 168–172. IEEE.
Rosenblum, S., Samuel, M., Zlotnik, S., Erikh, I., and Schlesinger, I. (2013). Handwriting
as an objective tool for parkinsons disease diagnosis. Journal of neurology, 260(9):2357–
2361.
Salicetti, S. G., Houmani, N., and Dorizzi, B. (2008). A client-entropy measure for on-line
signatures. In Biometrics Symposium, 2008. BSYM’08, pages 83–88. IEEE.
Schmidt, C. and Kraiss, K.-F. (1997). Establishment of personalized templates for auto-
matic signature verification. In Document Analysis and Recognition, 1997., Proceedings
of the Fourth International Conference on, volume 1, pages 263–267. IEEE.
Processing of Handwritten Online Signatures: An Overview and Future Trends 385
Schomaker, L. R. and Plamondon, R. (1990). The relation between pen force and pen-point
kinematics in handwriting. Biological Cybernetics, 63(4):277–289.
Swanepoel, J. and Coetzer, J. (2014). Feature weighted support vector machines for writer-
independent on-line signature verification. In Frontiers in Handwriting Recognition
(ICFHR), 2014 14th International Conference on, pages 434–439. IEEE.
Tolosana, R., Vera-Rodriguez, R., Ortega-Garcia, J., and Fierrez, J. (2015). Preprocessing
and feature selection for improved sensor interoperability in online biometric signature
verification. IEEE Access, 3:478–489.
Uludag, U., Pankanti, S., Prabhakar, S., and Jain, A. K. (2004). Biometric cryptosystems:
issues and challenges. Proceedings of the IEEE, 92(6):948–960.
Van, B. L., Garcia-Salicetti, S., and Dorizzi, B. (2007). On using the viterbi path along
with hmm likelihood information for online signature verification. IEEE Transactions
on Systems, Man, and Cybernetics, Part B (Cybernetics), 37(5):1237–1247.
Van Gemmert, A., Adler, C. H., and Stelmach, G. (2003). Parkinson’s disease patients un-
dershoot target size in handwriting and similar tasks. Journal of Neurology, Neurosurgery
& Psychiatry, 74(11):1502–1508.
Wang, D., Zhang, Y., Yao, C., Wu, J., Jiao, H., and Liu, M. (2010). Toward force-based
signature verification: A pen-type sensor and preliminary validation. IEEE Transactions
on Instrumentation and Measurement, 59(4):752–762.
Wessels, T. and Omlin, C. W. (2000). A hybrid system for signature verification. In Neural
Networks, 2000. IJCNN 2000, Proceedings of the IEEE-INNS-ENNS International Joint
Conference on, volume 5, pages 509–514. IEEE.
Wibowo, C. P., Thumwarin, P., and Matsuura, T. (2014). On-line signature verification
based on forward and backward variances of signature. In Information and Commu-
nication Technology, Electronic and Electrical Engineering (JICTEE), 2014 4th Joint
International Conference on, pages 1–5. IEEE.
Wirotius, M., Ramel, J.-Y., and Vincent, N. (2005). Comparison of point selection for
characterizing on-line signature. In Defense and Security, pages 307–313. International
Society for Optics and Photonics.
386 Alessandro Balestrucci, Donato Impedovo and Giuseppe Pirlo
Woch, A., Plamondon, R., and O’Reilly, C. (2011). Kinematic characteristics of successful
movement primitives in young and older subjects: a delta-lognormal comparison. Hum.
Mov. Sci, 30:1–17.
Wolf, F., Basu, T., Dutta, P. K., Vielhauer, C., Oermann, A., and Yegnanarayana, B.
(2006). A cross-cultural evaluation framework for behavioral biometric user authentica-
tion. In From Data and Information Analysis to Knowledge Engineering, pages 654–661.
Springer.
Wu, Q.-Z., Jou, I.-C., and Lee, S.-Y. (1997a). On-line signature verification using lpc
cepstrum and neural networks. IEEE Transactions on Systems, Man, and Cybernetics,
Part B (Cybernetics), 27(1):148–153.
Wu, Q.-Z., Lee, S.-Y., and Jou, I.-C. (1997b). On-line signature verification based on split-
and-merge matching mechanism. Pattern Recognition Letters, 18(7):665–673.
Wu, Q.-Z., Lee, S.-Y., Jou, I.-C., et al. (1998). On-line signature verification based on
logarithmic spectrum. Pattern Recognition, 31(12):1865–1871.
Yan, J. H., Rountree, S., Massman, P., Doody, R. S., and Li, H. (2008). Alzheimer’s disease
and mild cognitive impairment deteriorate fine movement control. Journal of Psychiatric
Research, 42(14):1203–1212.
Yang, L., Widjaja, B., and Prasad, R. (1995). Application of hidden markov models for
signature verification. Pattern recognition, 28(2):161–170.
Yeung, D.-Y., Chang, H., Xiong, Y., George, S., Kashi, R., Matsumoto, T., and Rigoll, G.
(2004). Svc2004: First international signature verification competition. In Biometric
Authentication, pages 16–22. Springer.
Yoon, H., Lee, J., and Yang, H. (2002). An online signature verification system using
hidden markov model in polar space. In Frontiers in Handwriting Recognition, 2002.
Proceedings. Eighth International Workshop on, pages 329–333. IEEE.
Yue, K. and Wijesoma, W. (2000). Improved segmentation and segment association for on-
line signature verification. In Systems, Man, and Cybernetics, 2000 IEEE International
Conference on, volume 4, pages 2752–2756. IEEE.
Zou, M., Tong, J., Liu, C., and Lou, Z. (2003). On-line signature verification using local
shape analysis. In Document Analysis and Recognition, 2003. Proceedings. Seventh
International Conference on, pages 314–318. IEEE.
ABOUT THE EDITORS
Byron Leite completed his Ph.D. with an emphasis in Artificial Intelligence in the
Federal University of Pernambuco (Brazil) in 2008. He began his activities as Professor
of the Polytechnic School at the University of Pernambuco in 2009. He is currently an
Associate Professor, Researcher, and Coordinator of the Post-graduation in Computer
Engineering at the University of Pernambuco. At the leadership of the Pattern
Recognition Research Group, Byron has developed dozens of research and technological
innovation projects, bringing together students and researchers from his network of
collaborations. He has experience in Computer Science, working mainly on the following
topics: document processing, handwriting recognition, gesture recognition,
recommendation systems, filtering and information retrieval. Since 2010 it has been
developing important collaborations with researchers and excellence groups with
international relevance. Through fruitful collaborations with companies, Byron has
developed several research and development projects in its research areas, highlighting
technologies for processing digital image documents for the private and public market.
Through these partnerships with the high-tech companies, Byron has contributed to the
design and improvement of document capture and imaging systems, forms processing,
handwriting recognition and signature verification software, which, after 10 years of use,
have already processed more than 1 billion documents.
388 About the Editors
Cleber Zanchettin
Universidade Federal de Pernambuco (Brazil)
Adjunct Professor
Cleber Zanchettin received the Ph.D. degree in computer science from the
Universidade Federal de Pernambuco, Recife, Brazil, in 2008. He is currently a Professor
and a Technical Reviewer with the Center of Informatics, Universidade Federal de
Pernambuco. He has authored over 60 papers in international refereed journals and
conferences in pattern recognition, artificial neural networks (ANNs), and intelligent
systems. He serves as reviewer for many international journals including IEEE-
Transactions on Systems, Man and Cybernetics, IEEE-Transactions on Neural Networks
and Learning Systems, Elsevier-Talanta, Elsevier-Applied Soft Computing, Elsevier-
Neurocomputing, Elsevier-Engineering Applications of Artificial Intelligence, Springer-
Neural Computing & Applications, Springer-International Journal of Machine Learning
and Cybernetics, Wiley-International Journal of Adaptive Control and Signal Processing,
World Scientific-International Journal of Computational Intelligence and Applications,
etc.. He has been in the scientific committee and program committee of many
International Conferences like NCTA, IJCCI, IJCNN, ICONIP, IBERAMIA, BRACIS,
ENIA, SBC, and SBRN. C. Zanchettin was editor of the Special Issue “Advances in
Intelligent Systems — Selected papers from the 2012 Brazilian Symposium on Neural
Networks (SBRN 2012)” of the Elsevier-Neurocomputing (2014) and of the Special Issue
"Feature and algorithm selection with Hybrid Intelligent Techniques" of the International
Journal of Hybrid Intelligent Systems (2012). His current research interests include
hybrid neural systems and applications of ANNs.
Giuseppe Pirlo
Università degli Studi di Bari Aldo Moro (Italy)
Associate Professor
computation, 30, 60, 63, 96, 120, 125, 145, 162, 175,
A 192, 196, 201, 220, 303, 314
computer, 3, 11, 24, 30, 45, 81, 95, 113, 114, 144,
academic performance, 333 169, 273, 278, 364
accuracy, vii, 14, 17, 18, 20, 25, 27, 35, 64, 66, 67, computer systems, 3, 113
68, 70, 71, 72, 74, 75, 77, 79, 88, 97, 120, 155, construction, 19, 47, 49, 149, 184, 209, 230, 232,
158, 182, 193, 211, 216, 221, 222, 268, 279, 283, 233, 234, 235, 236, 237, 238, 239, 252, 272, 337
288, 347, 375 cultural heritage, vii, 227, 277
alphabet, 334
ancestors, 196
anisotropy, 332 D
annotation, 198, 259, 261, 262, 264, 265
assessment, 49, 156, 278, 334, 335, 341, 343, 354, database, 25, 70, 75, 78, 79, 80, 81, 83, 84, 88, 103,
370 105, 106, 115, 116, 124, 126, 140, 141, 146, 149,
authentication, 345, 346, 347, 354, 359, 379, 381, 150, 151, 152, 153, 163, 220, 233, 251, 269, 270,
386 273, 287, 294, 311, 313, 346, 347, 348, 349, 350,
authenticity, 364, 368, 369 351, 353, 359, 360, 361, 370, 379, 383
automate, 192 datasets, vii, 5, 22, 23, 24, 25, 57, 62, 65, 71, 76, 82,
automaticity, 343 91, 97, 149, 150, 151, 153, 154, 158, 160, 182,
automation, 273, 334 190, 200, 213, 297, 298, 302, 306, 310, 314, 346,
347, 348, 351, 352, 354, 356, 358
decoding, 82, 96, 99, 118, 133, 139, 141, 148, 280,
B 282, 283, 284, 287, 289, 290, 291, 292, 293, 306,
336
basic education, 339 deformation, 19, 20, 21, 22, 25, 27, 31, 193, 303
Bentham, 66, 73, 75, 76, 82, 83, 84, 97, 103, 104, degradation, 4, 35, 46, 53, 58, 63, 64, 85, 95, 171,
105, 106, 107, 108, 109, 115, 116, 121, 122, 124, 297
125, 126, 133, 152, 310, 311, 312 Dense SIFT (DSIFT), 78
binarization, 5, 33, 35, 36, 37, 38, 42, 44, 45, 46, 47, detection, 12, 36, 47, 53, 54, 57, 63, 66, 67, 68, 70,
49, 52, 53, 54, 55, 56, 57, 58, 59, 62, 63, 64, 65, 86, 99, 175, 205, 206, 213, 223, 234, 251, 254,
69, 84, 86, 87, 89, 90, 91, 92, 152, 157, 158, 159, 271, 306, 319, 329, 360, 371, 379
163, 164, 194, 229, 230, 231, 232, 233, 234, 235, dimensionality, 102, 211
236, 237, 238, 239, 240, 241, 244, 259, 262, 265, Discrete Wavelet Transform (DWT), 100, 371
267, 271, 272, 273, 274, 299, 301, 302, 303, 304, disorder, 335, 342
316 displacement, 13, 19, 20, 366, 372
binary decision, 78 distortions, 19, 21, 27, 58, 286
diversity, 5, 98, 160
dominance, 97, 142, 175, 209
C drawing, 46, 79, 86, 322, 382
dyes, 227, 230
CDF 9/7 Wavelet Transform, 5, 97, 102, 108
dyslexia, 335, 343
comparative analysis, 364, 383
complexity, 19, 20, 26, 95, 114, 150, 189, 193, 194,
198, 199, 201, 284, 375, 376, 377
comprehension, 27, 34
392 Index
330, 331, 333, 334, 339, 340, 341, 342, 343, 347,
E 349, 361, 363, 365, 376, 377, 378, 384, 385
handwriting analysis, vii
economic disadvantage, 342 handwriting synthesis, 153, 162
education, 331, 339, 340 handwritten character recognition, 31, 97, 230, 245
educational psychology, 341 handwritten signature, vii, 345, 346, 347, 348, 349,
elementary school, 242, 341 353, 354, 358, 361, 363, 364, 365, 375, 377, 380,
encoding, 81, 191, 192, 195, 199, 213, 288, 300 381, 382, 384
equalization, 36, 37, 43, 44, 55 handwritten signature verification systems, vii
ergonomics, 282, 290 handwritten text recognition (HTR), vii, 57, 73, 95,
evolution, 6, 27, 135, 136 97, 99, 101, 103, 105, 107, 109, 110, 111, 144,
expert decisions, 371 147, 277, 294, 295
expert systems, 375 hardware design for on-line HTR, vii
extraction, vii, 5, 33, 39, 47, 54, 64, 70, 71, 75, 85, HCC, 384
88, 89, 91, 92, 96, 97, 98, 100, 105, 106, 122, 125, high school, 242, 333, 339
126, 142, 153, 182, 201, 211, 212, 213, 224, 230, histogram, 34, 36, 40, 43, 55, 212, 220, 246, 248,
246, 247, 249, 250, 254, 263, 267, 270, 271, 272, 249, 252, 254, 301, 304, 305
273, 274, 275, 286, 300, 302, 304, 305, 307, 347, historical archives, vii
356, 358, 364, 371, 372 historical data, 57, 67, 70, 71, 73, 152
extractionx method, 71, 246, 272, 275 historical handwritten documents, 57, 59, 62, 66, 67,
extracts, 71, 141, 298 73, 74, 87, 89, 92, 95, 161, 279, 315
history, 307
HTR systems, vii, 98, 279, 286
F HTR workflow, vii
HTR-related applications, vii
feature detectors, 145
HTR-related topics, vii
feature selection, 385
feature(s) extraction, vii, 5, 39, 64, 75, 96, 97, 100,
105, 106, 122, 123, 125, 126, 142, 182, 211, 212, I
213, 224, 230, 246, 247, 249, 250, 251, 254, 263,
267, 270, 272, 273, 274, 286, 299, 300, 305, 314, IAM, 71, 97, 115, 121, 122, 124, 125, 126, 131, 132,
364, 366, 371 133, 139, 140, 141, 142, 143, 146, 151, 310, 312,
filters, 21, 101, 102, 103, 194 313
fingerprints, 349 image analysis, 228, 265, 274, 297
formation, 214, 216, 222, 326, 341 image thresholding, 35, 52, 53, 55, 164
formula, 170, 174, 207, 208, 257 individual character, 78, 286
Fourier analysis, 371 individual differences, 341
individuals, 16, 211, 345, 348, 349, 350, 353, 370,
375
G international competition, 199, 311
interoperability, 347, 354, 358, 365, 376, 377, 385
grades, 333, 342
issues, 4, 15, 16, 25, 28, 33, 41, 48, 68, 171, 190,
graph, 11, 61, 67, 76, 77, 90, 120, 125, 174, 175,
191, 227, 333, 365, 378, 385
181, 196, 197, 198, 199, 208, 209, 280, 300, 303,
319
grouping, 69, 70, 181 K
keyword spotting, vii, 6, 57, 79, 80, 81, 82, 83, 84,
H 85, 87, 90, 91, 93, 152, 297, 314, 315, 316, 317,
318, 319
handheld devices, 365, 372, 382
handwriting, vii, 4, 5, 6, 9, 12, 14, 17, 27, 28, 29, 71,
82, 86, 89, 90, 96, 97, 98, 99, 105, 109, 110, 113, L
114, 115, 117, 121, 142, 143, 144, 145, 146, 147,
148, 149, 150, 153, 154, 155, 156, 160, 161, 162, languages, 12, 79, 163, 169, 184, 212, 214, 246, 251,
163, 164, 171, 182, 191, 207, 208, 209, 211, 212, 305, 313
213, 214, 216, 222, 224, 225, 270, 279, 290, 313, latency, 322, 326
316, 319, 322, 323, 324, 325, 326, 327, 328, 329, latent semantic indexing (LSI), 251, 252, 263, 268,
269, 306
Index 393
rules, 28, 66, 73, 175, 176, 184, 185, 189, 201, 204, technologies, 5, 57, 95, 270
212, 216, 218, 300, 301, 371 technology, vii, 6, 95, 153, 277, 279, 322, 331, 332,
345, 360, 365, 377, 381
test data, 80, 82, 288, 289
S testing, 11, 22, 23, 27, 75, 77, 78, 82, 84, 97, 105,
179, 220, 222, 259, 263, 300, 342, 352
school, 333, 334, 335 text image preprocessing, vii
school performance, 333 texture, 5, 35, 48, 62, 67, 86, 109, 234, 314, 346
science, 3, 144, 271, 315, 340, 342, 364, 382, 383, thresholding, vii, 5, 33, 34, 35, 36, 38, 39, 40, 42, 44,
384 47, 48, 52, 53, 54, 55, 62, 63, 64, 69, 86, 102, 122,
scripting language recognition, vii 158, 164, 236, 274, 301, 303
scripts, 5, 77, 83, 169, 172, 211, 212, 222, 224, 228, time constraints, 27
245, 246, 251, 273, 383 time periods, 13, 376
short-term memory, 30, 73, 87, 145, 162, 213 time series, 381
SIGMA, 350, 358, 359 time warp, 90, 317, 372, 383
skimming, 378 traits, 345, 349, 363, 378
society, 3, 271, 273, 315, 363, 374 transactions, 29, 55, 305, 345, 360, 377, 379, 381
spatial information, 62, 176, 179, 180, 188, 300, 303 transcription, 5, 49, 61, 73, 78, 80, 87, 95, 98, 104,
speech, vii, 6, 12, 31, 96, 97, 111, 114, 121, 125, 107, 113, 133, 141, 142, 143, 148, 151, 152, 162,
135, 143, 144, 145, 146, 147, 148, 212, 225, 278, 229, 234, 236, 237, 259, 277, 278, 279, 280, 281,
279, 280, 281, 282, 283, 284, 286, 290, 291, 293, 282, 283, 289, 290, 291, 293, 294, 314, 319, 341
294, 295, 301, 305, 334, 335 transcripts, 107
speech processing, 301 transformation, 10, 17, 19, 100, 103, 106, 108, 210,
spelling, 73, 304, 335, 337, 342 371, 372
statistic test, 336
statistical, 30, 31, 32, 109, 110, 154, 177, 208, 211,
224, 225, 294, 383 V
statistics, 54, 150, 289, 360, 371
stroke, 4, 34, 35, 36, 39, 40, 43, 44, 45, 48, 54, 62, validation, 11, 21, 24, 26, 75, 77, 78, 105, 107, 115,
63, 64, 81, 89, 96, 156, 170, 171, 177, 180, 181, 116, 122, 124, 125, 140, 141, 200, 201, 203, 217,
182, 189, 196, 197, 198, 202, 203, 204, 205, 210, 218, 259, 267, 269, 282, 340, 379, 385
212, 213, 214, 215, 216, 217, 218, 219, 220, 222, valuation, 143, 146, 239, 240, 241, 244
223, 234, 239, 289, 303, 347, 354, 356, 371, 376, variables, 45, 119, 179, 289, 336, 337, 338, 376
377, 380, 383, 384 variations, 15, 22, 24, 26, 27, 39, 42, 45, 46, 63, 124,
stroke width, 34, 39, 40, 43, 44, 45, 48 216, 230, 257, 287, 297, 298, 304, 354
structural contrast, 39, 40, 41, 42, 53 visual acuity, 43
structure, 6, 11, 13, 15, 21, 24, 70, 79, 97, 98, 103, visual perception, 42, 53, 55
104, 108, 122, 150, 153, 157, 164, 172, 173, 174, visual system, 44
175, 176, 177, 178, 179, 184, 187, 191, 192, 197, vocabulary, 74, 75, 77, 90, 96, 105, 109, 114, 122,
199, 201, 205, 207, 252, 315, 321, 322, 329, 331, 124, 125, 126, 135, 139, 140, 141, 142, 146, 147,
364 155, 277, 279, 287, 289
structuring, 44, 237
style(s), 4, 16, 22, 66, 73, 76, 77, 95, 98, 104, 170,
211, 214, 246, 251, 297, 304, 313, 319, 375, 380 W
symbols and connections, 153
wavelet, 60, 92, 97, 98, 100, 101, 102, 104, 106, 111,
112, 382
T wavelet analysis, 112
wavelet transform, 5, 97, 98, 100, 101, 102, 108,
techniques, vii, 4, 5, 6, 9, 12, 16, 17, 23, 26, 27, 28, 109, 367, 371
53, 55, 57, 58, 59, 60, 61, 62, 63, 64, 65, 71, 73, word recognition, 16, 75, 77, 88, 90, 96, 110, 144,
74, 81, 88, 95, 96, 98, 99, 105, 150, 153, 159, 160, 152, 212, 223, 318
164, 224, 270, 271, 273, 278, 298, 305, 342, 363, writing process, 4, 6, 342, 346, 363, 364
364, 365, 368, 371, 372, 373, 374, 375, 378 writing tasks, 334
technological advances, 4