Joint Recognition of Handwritten Text and Named Entities With A Neural End-To-End Model
Joint Recognition of Handwritten Text and Named Entities With A Neural End-To-End Model
End-to-end Model
Abstract—When extracting information from handwritten would be much easier to predict correctly since it restricts
arXiv:1803.06252v2 [cs.CV] 22 Mar 2018
documents, text transcription and named entity recognition the language model within the corresponding category.
are usually faced as separate subsequent tasks. This has the The downside is that we rarely have large amounts of
disadvantage that errors in the first module affect heavily the
performance of the second module. In this work we propose word level segmented data, a key for most ANNs proper
to do both tasks jointly, using a single neural network performance. In case that automatic word segmentation is
with a common architecture used for plain text recognition. needed, the whole information extraction process involves
Experimentally, the work has been tested on a collection three steps which will probably accumulate errors in each
of historical marriage records. Results of experiments are of them. Another and most common option is to perform
presented to show the effect on the performance for different
configurations: different ways of encoding the information, handwritten text recognition (HTR) first and then named
doing or not transfer learning and processing at text line or entity recognition (NER). An advantage of this approach
multi-line region level. The results are comparable to state of is that it has one less step than the previous explained
the art reported in the ICDAR 2017 Information Extraction approach, but it has the counterpart that if the transcription
competition, even though the proposed technique does not is wrong, the NER part is affected.
use any dictionaries, language modeling or post processing.
Recent work in ANNs suggests that using models that
Keywords-Named entity recognition; handwritten text solve tasks as general as possible, might give similar or
recognition; neural networks better performance than concatenating subprocesses due
to error propagation in the different steps, as shown in
I. I NTRODUCTION [6], [7]. This is the main motivation of this work, and
Extracting information from historical handwritten text consequently we propose a single convolutional-sequential
documents in an optimal and efficient way is still a model to jointly perform transcription and semantic an-
challenge to solve, since text in these kind of documents notation. Adding a language model, the transcription can
are not as simple to read as printed characters or modern be restricted to each semantic category and therefore
handwritten calligraphies [1], [2]. Historical manuscripts improved. The contribution of this work is to show the
contain information that gives an interpretation of the improvement when joining a sequence of processes in a
past of societies. Systems designed to search and retrieve single one, and thus, avoiding to commit accumulation of
information from historical documents must go beyond errors and achieving generalization to emulate human-like
literal transcription of sources. Indeed it is necessary to intelligence.
shorten the semantic gap and get semantic meaning from Some examples of historical handwritten text docu-
the contents, thus the extraction of the relevant information ments include birth, marriage and defunction records
carried out by named entities (e.g. names of persons, or- which provide very meaningful information to reconstruct
ganizations, locations, dates, quantities, monetary values, genealogical trees and track locations of family ancestors,
etc.) is a key component of such systems. Semantic an- as well as give interesting macro-indicators to scholars in
notation of documents, and in particular automatic named social sciences and humanities. The interpretation of such
entity recognition is neither a perfectly solved problem types of documents unavoidably requires the identification
[3]. of named entities. As experimental scenario we illustrate
Many existing solutions make use of Artificial Neural the performance of the proposed method on a collection
Networks (ANNs) to transcribe handwritten text lines and of handwritten marriage records.
then parse the transcribed text with a Named Entity Recog- The rest of the paper is organized as follows: Next
nition model, but the precision of those existing solutions section explains the task being considered. In section III
is still to improve [1], [2], [4]. One possible approach is we review the state of the art work in HTR and NER. In
to start with already segmented words, by an automatic or IV we explain our model architecture, ground truth setup
manual process, and predict the semantic category using and training details. In Section V we analyze the results
visual descriptors (c.f. [5]) which has the benefit that when for the different configurations and last in VI we give the
the name entity prediction is correct, the transcription conclusions.
Figure 1. An example of a document line annotation from [4].
Table I: Semantic and person categories in the IEHHR Table II: Marriage Records dataset distribution
competition Train Validation Test
Semantic Person Pages 90 10 25
Name Wife Records 872 96 253
Surname Husband Lines 2759 311 757
Occupation Wife’s father Words 28346 3155 8026
Location Wife’s Mother Out of vocabulary words: 5.57 %
Civil State Husband’s father
Other Husband’s mother
Other person This idea can also be applied to information extraction
None from handwritten text documents which consists of HTR
followed by NER. From the HTR side there is still a long
way to improve until human level transcription is achieved
II. T HE TASK : I NFORMATION E XTRACTION IN
[8]. Attention models have helped to understand the in-
M ARRIAGE R ECORDS
side behavior of neural networks when reading document
The approach presented in this paper is general enough images but still have lower accuracy than Recurrent Neu-
to be applied to many information extraction tasks, but due ral Network with Connectionist Temporal Classification
to time constraints and our access to a particular dataset, (RNN+CTC) approaches [9].
the approach is evaluated on the task of information ex-
Named entity recognition is the problem of detecting
traction in a system for the analysis of population records,
and assigning a category to each word in a text, either at
in particular handwritten marriage records. It consists of
part-of-speech level or in pre-defined categories such as
transcribing the text and to assign to each word a semantic
the names of persons, organizations, locations, expressions
and person category, i.e. to know which kind of word
of times, quantities, monetary values, percentages, etc. The
has been transcribed (name, surname, location, etc.) and
goal is to select and parse relevant information from the
to what person it refers to. The dataset and evaluation
text and relationships within it. One could think that it
protocol are exactly the same as the one proposed in
would be sufficient to keep a list of locations, common
the ICDAR 2017 Information Extraction from Historical
names and organizations, but the case is that these lists are
Handwritten Records (IEHHR) competition [4]. The se-
rarely complete, or one single name can refer to different
mantic and person categories to identify in the IEHHR
kind of entities. Also it is not easy to detect properties of a
competition are listed in table I.
named entity and how different named entities are related
Two tracks were proposed. In the basic track the goal is
to each other. Most widely used kind of models for this
to assign the semantic class to each word, whereas in the
task are conditional random fields (CRFs), which were the
complete track it is also necessary to identify the person.
state of the art technique for some time [10], [11].
An example of both tracks is shown in Figure 1.
In the area of Natural Language Processing, Lample
The dataset for this competition contains 125 pages with
et al. [3] proposed a combination of Long Short-term
1221 marriage records (paragraphs), where each record
Memory networks (LSTMs) and CRFs, obtaining good
contains several text lines giving information of the wife,
results for the CoNLL2003 task. The problem is similar
husband and their parents’ names, occupations, locations
to the one we are facing, except that it starts from raw
and civil states. The text images are provided at word and
text. In this work the input to the system are images of
line level, naturally having the increased difficulty of word
handwritten text lines, for which it is not even known how
segmentation when choosing to work with line images.
many characters or words are present. This undoubtedly
More details of the dataset can be found in table II.
introduces a higher difficulty.
III. S TATE OF THE ART In Adak’s work [12] a similar end-to-end approach from
Recent work shows that neural models allow generaliza- image to semantically annotated text is proposed, but in
tion of problems that earlier were solved separately [7]. that case the key relies in identifying capital letters to
detect possible named entities. The problem is that in This kind of encoding is not expected to perform well
many cases, such as in the IEHHR competition [4] dataset, in the IEHHR task, since tags are assigned to only one
named entities do not always have capital letters, and also, word at a time, so it is redundant to have two tags for
it is a task-specific approach that could not be used in each word. However, in other tasks it could make sense
many other cases. having opening and closing tags and this is why it has
Finally, another concept that can help to improve the been considered in this work.
quality of our models’ prediction is curriculum learning 2) Single separate tags: Similar to the previous ap-
[13]. Letting the model look at the data in a meaningful proach, in this case both category and person tags are
and ordered way, such that the difficulty of prediction goes independent symbols but there is only one for each word
from easy to hard, and therefore, can make the training added before the word. Thus, the ground truth of the
evolve with a much better performance. previous example would be encoded as:
IV. M ETHODOLOGY
h a b i t a t {space} e n {space}
The main goal of this work is to explore a few pos- <location/> <husband/> B a r a {space}
sibilities for a single end-to-end trainable ANN model a b {space} <name/> <wife/> E l i s a b
that receives as input text images and gives as output e t h {space} J u a n a {space}
transcripts, already labeled with their corresponding se- <state/> <wife/> {space} d o n s e l l
mantic information. One possibility to solve it could be to a ...
propose a ANN with two sequence outputs, one for the
transcript and the other for the semantic labels. However, 3) Change of person tag: In this variation of the
keeping an alignment between these two independent semantic encoding the person label is only given if there
outputs complicates a solution. An alternative would be to is a change of person, i.e. the person label indicates that
have a single sequence output that combines the transcript all the upcoming words refer to that person until another
and semantic information, which is the approach taken person label comes, in contrast to previous approaches
here. There are several ways in which this information where we give the person label for each word. This
can be encoded such that a model learns to predict it. The approach is possible due to the structured form of the
next subsection describes the different ways of encoding sentences in the dataset. As we can see in Figure 2 the
it that were tested in this work. Then there are subsections marriage records give the information of all the family
describing the architecture chosen for the neural network, members without mixing them.
the image input and characteristics of the learning. <wife/> <name/> E l i s a b e t h
{space} <name/> J u a n a {space}
A. Semantic encoding <state/> d o n s e l l a ...
The first variable which we explored is the way in 4) Single combined tags: The final possibility tested
which ground truth transcript and semantic labels are for encoding the named entity information is to combine
encoded so that the model learns to predict them. To category and person labels into a single tag. So the
allow the model to recognize words not observed during example would be as:
training (out-of-vocabulary) the symbols that the model
learns are the individual characters and a space to identify h a b i t a t {space} e n {space}
separation between words. For the semantic labels special <location_husband/> B a r a {space} a b
tags are added to the list of symbols for the recognizer. {space} <name_wife/> E l i s a b e t h
The different possibilities are explained below. {space} <name_wife/> J u a n a {space}
1) Open & close separate tags: In the first approach, <state_wife/> d o n s e l l a ...
the words are enclosed between opening and closing tags B. Level of input images: lines or records
that encode the semantic information. Both the category
and the person have independent tags. Thus, each word The IEHHR competition dataset includes manually seg-
is encoded by starting with opening category and person mented images at word level. But to lower ground truthing
symbols, followed by a symbol for each character and cost or avoid needing a word segmentator, we will assume
ends by closing person and category symbols. The “other” that only images at line level are available. Having text line
and “none” semantics are not encoded. For example, the images then the obvious approach is to give the system
ground truth of the image shown in Figure 1 would be individual line images for recognition. However, there are
encoded as: semantic labels that would be very difficult to predict if
only a single line image is observed due to lack of context.
For example, it might be hard to know if the name of a
h a b i t a t {space} e n {space} person corresponds to the husband or the father of the
<location> <husband> B a r a </husband> wife if the full record is not given. Because of this, in the
</location> {space} a b {space} <name> experiments we have explored having as input both text
<wife> E l i s a b e t h </wife> line images and full marriage record images, concatenating
</name> ... all the lines of a record one after the other.
sequence are calculated with a dynamic programming
algorithm called ”forward-backward”.
Some special features of our model are that the activa-
tion function for the convolutional layers is leaky ReLu
f (x) = x if x > 0.01, 0.01x otherwise.
We also use batch normalization to reduce internal
covariate shift [19].
is a recurrent neural network with m inputs, n outputs and 1 Scripts used for the experiments available at http://doi.org/10.5281/
weight vector w. The probabilities of a labeling of an input zenodo.1174113
Figure 3. Used model architecture