A Lexicon Verification Strategy in A BLSTM Cascade Framework
A Lexicon Verification Strategy in A BLSTM Cascade Framework
A Lexicon Verification Strategy in A BLSTM Cascade Framework
Abstract—Handwriting recognition always has been a diffi- previous offline handwriting recognition competitions [3],
cult problem, with image related problems on the one hand [4]. However, we believe that they are still sensitive to the
and language processing on the other hand. Significant im- undesirable lexicon effects described above.
provements have been made in handwriting recognition thanks
to new recurrent neural networks based on LSTM cells. The In this article, we propose a new recognition strategy
high character recognition performances of these networks are based on a lexicon free decoding approach. It is based on
almost systematically combined with linguistic knowledge, that a lexicon verification strategy that operates as an efficient
is to say lexicon driven decoding method, to correct character rejection stage, efficiently coupled with the strength of
misrecognitions. However with such high performance, we LSTM neural classifiers. For that, we propose the following
wonder on the possibility to use them without lexical decoding
for word recognition. In this article, we explore this idea by simple decision rule:
proposing a lexicon verification strategy that provides a very ”If the sequence output of the BLSTM recognizer belongs
low error rate, while conceding a consequent amount of rejects. to the lexicon, then it is accepted; if it is not in the lexicon,
Therefore, this verification approach perfectly fits in a cascade then it is rejected.”
framework, where the rejects of a classifier are processed by The advantage of such strategy lies in its ability to easily
the next cascade’s classifier. The resulting system is nearly
insensitive to the lexicon size, while providing a much faster generate rejects. Indeed, since reject is a key element in any
decoding process than a standard lexicon driven decoding. recognition system, this new approach allows a significant
Furthermore, when processing the final rejects of the cascade decrease of recognition errors, without decreasing the recog-
by a basic lexical decoding, our approach reach state of the nition performance. The reasoning behind this decision rule
art performance for isolated word recognition. is that it is very unlikely that a system will generate an
hypothesis belonging to a lexicon by mistake. This method
I. I NTRODUCTION also allows the design of a recognition system with low
State of the art handwriting recognition systems are gen- sensitivity to lexicon size and faster than lexicon driven
erally guided by linguistic resources such as lexicon or/and a decoding approach.
language model. Although lexicon driven recognition strat- One of the key-point for designing a recognition system
egy allows to correct character misrecognition, it raises many with state of the art performance without lexicon driven
problems, such as time and precision problems for large strategy is to achieve a very high characters recognition ac-
vocabularies. Indeed, in order to cover the largest proportion curacy. For this purpose, we propose to use complementary
of a language, one has to consider very large lexicons that BLSTM networks that can be trained and combined in a
requires time consuming decoding process1 . Using a lexicon cascade-like architecture.
also faces the problem of dealing with Out Of Vocabulary This article first recall the main aspects of lexicon driven
(OOV) words. Indeed, when submitting an OOV word, a recognition systems and recurrent neural networks. Then, the
lexicon driven system will output a wrong solution, even cascade architecture with LSTM recurrent neural network
if all the characters are successfully recognized. Finally, and lexicon verification stage are presented. Finally, the
using an unsuitable lexicon may deteriorate the system’s results of our experiments on the Rimes database are detailed
performance for a given task. in the last section.
On the other hand, recurrent neural networks with Long- II. R ELATED WORKS
Short Term Memory (LSTM) cells [1] coupled with the A. Lexicon driven handwriting recognition
Connectionist Temporal Classification (CTC) training al-
gorithm [2] has recently allowed significant progress in Handwriting recognition models can be classified accord-
handwriting recognition. Generally paired with a lexicon ing to the character segmentation approach which can be
driven approach, these systems achieved excellent results in either explicit or implicit [5], but also according to the
choice of the characters’ recognition method (discriminant
1 For example, the french dictionary Français-Gutenberg contains more classifiers for hybrid approaches [6] or Gaussian Mixtures
than 300K words. for Hidden Markov Models (HMM) [7]). The common
235
each sequence’s frame, and by removing every successive This verification with a lexicon consists in accepting the
repetitions of each class (joker included), then the joker. characters string if it belongs to the lexicon, or rejecting it
Performances at the end of this lexicon free decoding otherwise. The character string is the result of the ”Best path
scheme on a recognition task, although below state of decoding” [17] described previously (II-B), which does not
the art, are interesting and have not, to our knowledge, rely on any lexicon.
motivated further studies. Considering these primary results, Notice that it is very unlikely that an erroneous recognized
we investigate how lexicon free BLSTM decoding scheme sequence belongs to the lexicon.
can be enhanced. An interesting strategy to explore is to use A first experiment was carried out on isolated word recog-
a cascade, which provides a simple and effective classifier nition from the Rimes database in order to evaluate such
combination strategy in many application domains of pattern verification strategy with BLSTM. As shown on table I, our
recognition. lexicon-verification system is able to reject efficiently at the
expense of accepting a small proportion of wrong hypothesis
III. LSTM NETWORKS CASCADE
that are not detected, as they belong to the lexicon. Most of
A. Classifiers cascade these errors are type case, accents and plurals errors. This
Cascade of classifiers is a particular combination method, first experiment demonstrates the strength of a verification
combining classifiers decisions sequentially by exploiting strategy in combination with BLSTM classifiers.
the complementary behavior of the classifiers, in order to
progressively refine recognition decisions along the cascade. Network Recognized Error Rejection
BLSTM + verification 66.37 2.25 31.38
The core of classifiers cascade is the decision stage allowing
rejection. Often the rejection criterion consists in applying a Table I
R ESULTS OF OUR LEXICON VERIFICATION STRATEGY APPLIED TO A
threshold on the classifier’s confidence score at the cascade’s BLSTM NETWORK ON THE R IMES DATABASE .
current stage. This rejection mechanism is essential for the
cascade and enable to significantly speed up the process, We now demonstrate how BLSTM networks can be effi-
when many classifiers are involved. However to be efficient, ciently combined into a cascade, by building complementary
the classifiers need to be complementary with each other. networks.
Many ways to achieve the cascade principle have been
proposed in the literature.
The most known contribution regarding cascade of clas- C. Cascade design
sifiers is from Viola and Jones [18], dedicated to face
detection. In this article the cascade is based on a large In our context of strong classifiers, the general idea behind
ensemble of weak and diverse classifiers but allowing a the cascade is that each classifier should process rejects of
quick process with a strong reject. The image is analyzed the previous classifier in the cascade. To be efficient this
with a sliding window, each window shall pass through method requires to have complementary classifiers.
the entire set of classifiers in order to be considered as A first idea of complementary networks could have been
containing the object of interest. to learn the networks with the rejects of the previous
In [19] and [20], the cascade principle is used to combine one. However the more networks there are, the less there
the results of a set composed of strong classifiers with are training examples, leading to training difficulties and
different architecture or different input features. This sets poor network convergence. Moreover these networks will
recognize an important number of objects with a low error specialize and will not bring complementarity anymore, that
rate, while relying on a decision system allowing them to is why we orientate to a simpler and as efficient solution.
transfer rejects to the next classifiers. Some previous experiments have shown that similar
The last principle is the one which get our attention. In- LSTM recurrent neural networks trained with different initial
deed LSTM networks seems to possess every characteristics weights can be combined with success [21]. These networks
of a strong classifier. So we must find a decision system have similar recognition rates but the connection weights are
coherent with the cascade principle and the LSTM networks different.
properties. Inspired by these observations which show a specific
property of BLSTM architectures, we made the choice of
B. Decision stage by lexical verification exploiting the BLSTM complementarity when trained with
One of the most important aspect in the design of a different initial weights. In the following experiments we
cascade is the decision stage in charge of rejecting or propose a three stage cascade architecture based on the
accepting recognition hypothesis. In our case, where we try complementarity of three identical networks, learned with
to recognize sequences of any characters, we adopt a lexicon different random weight initialization. Figure 2 shows the
verification strategy. overall architecture and process.
236
on the rejects at the end of the cascade.
Lexicon driven decoding is performed after removing the
frames having the joker as the most probable class. Once
the sequence is reduced, we apply a lexical decoding based
on Viterbi algorithm [26].
C. The Rimes database
We use the public Rimes database of isolated words used
during the ICDAR 2011 competition on french handwriting
recognition [27]. This database is divided into three parts,
training (51737 images), validation (7464 images) and test
(7776 images). Word error rate is evaluated on two recog-
nition tasks : the WR2 task where the lexicon is composed
of all the words of the test data set (1692), and the WR3
task where the lexicon is composed of all the words of the
database (5744).
D. Preliminary experiment without cascade
Figure 2. The proposed BLSTM Cascade. Each classifier process the
rejects from the previous layer. This preliminary experiment was performed in order to
assess the effects of the verification strategy. We compare
IV. E XPERIMENTS AND RESULTS the approach to lexicon driven decoding. This experiment
A. Selected network architecture involves one single BLSTM network, from which three
strategies are derived :
As previously mentioned, our cascade is composed of
• Lexical verification;
3 BLSTM networks based on the same architecture. The
• Lexical decoding;
BLSTM’s architecture is identical to the one used in [22]: it
• Lexical verification, followed by lexicon driven decod-
is a two layer network composed respectively of 70 and
120 LSTM blocks, separated by a subsampling layer of ing of the rejects.
100 hidden neurons without bias, and with an hyperbolic
Network Recognition Error Rejection
tangent activation function. The network also has two layers
BLSTM + verification 66.37 2.25 31.38
to reduce the sequence length. The first one concatenates BLSTM + decoding 88.35 11.65 0
the input vectors in pairs, while the second one concatenates BLSTM + verification + decoding 88.72 11.28 0
the output vectors of the first layer in pairs. For example, Table II
a sequence composed of 12 frames of 1 pixel width is P RELIMINARY RESULTS ON A SINGLE BLSTM NETWORK .
transformed into a sequence of length 3 corresponding to
3 groups of 4 pixels width. The results are presented in table II. The most remarkable
As input features, we use histogram of oriented gradients element is the low word error rate (2.25%) obtained thanks
(HOG) [23] that have proven their efficiency for handwriting to our lexicon verification strategy, thus reducing the error
recognition [24]. Images are normalized to 64 pixels height. by 80% compared to a standard lexicon driven decoding
A sliding window of 8 pixel width extracts the HOG features approach. However this system is rejecting 31.38% of the
at a 1 pixel pace. hypotheses.
This architecture has been selected for its balanced char- Applying the lexicon driven decoding after the verification
acteristics, both performing slightly better than the reference stage does not affect much the performances (0.37 points dif-
architecture [1] and allowing a fast decoding, 30 millisec- ference) compared to the standard lexicon driven decoding
onds on average. Other experiments showed us that networks scheme. This slight difference is due to the verification that
with same order of parameters, have similar performances. validates words before the decoding, that can do mistakes,
Training is performed with RNNLIB [25]. We realize that the ”best path” doesn’t make due to the lexicon we use.
3 trainings and we obtain 3 networks with highly similar The interest lies in the significant amount of hypotheses
performances (character error rate) : 11.85%, 11.81% et that does not require any further processing, only 31% of
11.88%. Results being very close to each other, networks them require lexicon driven decoding, thus dividing the
sequence doesn’t matter. processing time of decoding by 3. This first experiment
highlights the lexicon verification strategy when combined
B. Lexicon driven decoding with BLSTM. Very low word error rate can be obtained
In order to compare our proposition to state of the art because very few wrong characters sequences hypotheses
methods, we apply a simple lexicon driven decoding stage belong to the lexicon, thus allowing to have an important
237
amount of rejects, while having good recognition results constraint. We observe a decrease of the word error rate of
thanks to the BLSTM. The next section shows how the 19% between a one network cascade and a three networks
rejects can be processed using the cascade framework. cascade.
We now look for the generalization of this first conclusion
E. Cascade results
on the WR3 task and analyze the effect of lexicon size on
From the previous results, we consequently test the cas- the cascade performances.
cade architecture on the Rimes database. We present 4 tables
of results, with and without lexicon driven decoding and Network Recognition Error Rejection Misclassification
for WR2 and WR3 tasks. Every figures (word recognition Stage 1 66.37 2.86 30.78 4.1
Stage 2 73.84 3.38 22.78 6.5
rate, word error rate, or word rejection rate) are computed Stage 3 77.76 3.76 18.48 8.8
on the whole database, at each stage of the cascade. We
Table V
also calculate the word misclassification rate at each stage C ASCADE RESULTS FOR WR3 TASK
individually, that is to say the error made by the stage on
the words it classifies. Network Recognition Error Rejection
Network Recognition Error Rejection Misclassification
Stage 1 88.72 11.28 0
Stage 2 89.99 10.01 0
Stage 1 66.37 2.25 31.38 3.3 Stage 3 90.63 9.37 0
Stage 2 74.05 2.59 23.37 4.2
Stage 3 78.07 2.82 19.11 5.4 Table VI
RESULTS FOR WR3 WITH LEXICAL DECODING APPLIED TO EACH STAGE
Table III
C ASCADE RESULTS FOR WR2 TASK Tables V and VI show the results of the same experiment
for the WR3 task. The first remarkable element is that the
Network Recognition Error Rejection WR2 and WR3 recognition rates are equal at the output
Stage 1 91.18 8.82 0 of the first stage, only the error rate is slightly increased.
Stage 2 92.23 7.77 0
Stage 3 92.85 7.15 0 This error increase is due to the increase of the lexicon size
from 1692 to 5744 words, which introduces more possible
Table IV
RESULTS FOR WR2 WITH LEXICAL DECODING APPLIED TO EACH STAGE confusions between words. We find highly similar results
with a classification of 26% and 18.9% of the previous
Tables III and IV report on the WR2 task. rejects at the end of stage 2 and stage 3 respectively. We also
First we can notice the good network complementarity. observe an increase of the misclassification errors along the
Indeed looking at table III, we can observe that the second stages of the cascade, as was observed for WR2 task. Here
stage of the cascade classify 25% of the rejects of the the error is slightly higher due to the larger vocabulary.
first stage, on which 4.2% are misclassified, yet increasing We draw the same conclusions for WR3 than previously
the overall error of only 0.34%. Of course 4.2% is greater for WR2, the recognition rate is increased with a slight in-
than the 3.3% misclassification rate made at the first stage crease of errors and misclassifications. Likewise, the lexicon
of the cascade, however it is low considering that this driven decoding error decreases over cascade levels, namely
error is at a higher stage in the cascade. Indeed, we can a 17% decrease of the error between the first and the last
consider that recognition decisions taken at higher levels in stage. For further studies with more networks, one must pay
the cascade concern difficult examples to recognize. The attention to the misclassification rate increase.
complementarity of the BLSTM is confirmed looking at Comparing the performance on WR2 and WR3 tasks, we
the output of the third network, with no less than 18% can notice that the recognition rates are nearly similar (from
classified rejects, for an additional error of 0.23%, and a 78.07 to 77.77) due to the strength of lexicon verification
5.4% misclassification rate on the classified rejects, which strategy. WR2’s lexicon being included in WR3’s, a word
remains consistent with our previous remark. correctly classified by WR2 will also be by WR3. The slight
We can yet conclude that cascade benefits from the difference is due to the fact that there are a few more errors
decision strategy and the complementarity of the networks, due to the increase of the lexicon size. However it is to be
the recognition rate is increased with the number of stages, noticed that while the recognition rate decrease is low, the
while the error rate does increase in a smaller proportion. error rate evolves a bit more significantly (0.94 points).
This first conclusion is supported by the good results of We now compare these results to those of the TUM system
the lexicon driven decoding applied on the final rejects. which was the winner of the the 2009 Rimes competition
Table IV shows the performance evolution of a lexicon (since we don’t have results on the WR2 task in 2011). This
driven decoding strategy when applied at every stage of system was the one which had the lowest sensitivity to the
the cascade. Indeed the more networks in the cascade, the lexicon size in the competition. TUM has a 2.36% normal-
better the performances. We see that each network brings ized recognition rate difference between WR2 (93.2%, 1612
additional information even with the lexicon driven decoding words) and WR3 (91%, 4943 words), whereas our system
238
shows a much lower sensitivity of 0.39% between our WR2 [8] L. R. Rabiner, “A tutorial on hidden markov models and
and WR3 tasks. selected applications in speech recognition,” Proceedings of
Thanks to this very low recognition difference between the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
[9] T. Plötz and G. A. Fink, “Markov models for offline handwrit-
WR2 and WR3, we can deduce that our classification ing recognition: a survey,” IJDAR, vol. 12, no. 4, pp. 269–298,
scheme is nearly insensitive to the lexicon size. Our system 2009.
also has the specificity to have a low error rate (3.76%), [10] A. L. Koerich, R. Sabourin, and C. Y. Suen, “Large vocabu-
which is the best performance of the 2011 Rimes competi- lary off-line handwriting recognition: A survey,” PAA, vol. 6,
tion [27] regarding the error. However our word recognized no. 2, pp. 97–121, 2003.
[11] A. Brakensiek, J. Rottland, and G. Rigoll, “Handwritten
rate is far less good because of rejects, and we are only address recognition with open vocabulary using character n-
on the fourth spot on the recognition metric. In order to grams,” in IWFHR. IEEE, 2002, pp. 357–362.
better compare ourselves, we apply a lexical decoding on the [12] C. Chatelain, L. Heutte, and T. Paquet, “A two-stage outlier
cascade final rejects, we get a very encouraging error rate rejection strategy for numerical field extraction in handwritten
of 9.37%, the second best performance of the 2011 Rimes documents,” in ICPR, Hong Kong, China, vol. 3, 2006, pp.
224–227.
competition, behind a system which uses a combination [13] M. Hamdani, A. E.-D. Mousa, and H. Ney, “Open vocabulary
of 7 classifiers including MLP/HMM, GMM/HMM and arabic handwriting recognition using morphological decom-
MDLSTM. position,” in ICDAR. IEEE, 2013, pp. 280–284.
[14] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
V. C ONCLUSION ory,” Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997.
[15] M. Schuster and K. K. Paliwal, “Bidirectional recurrent neural
In this article we proposed a BLSTM cascade architecture networks,” SP, vol. 45, no. 11, pp. 2673–2681, 1997.
with a lexical verification stage. This new approach shows [16] A. Graves, S. Fernández, F. J. Gomez, and J. Schmidhuber,
“Connectionist temporal classification: labelling unsegmented
some interesting properties, which are : low sensitivity
sequence data with recurrent neural networks,” in ICML,
to lexicon size, low error rate, fast computing for large Pittsburgh, Pennsylvania, USA, 2006, pp. 369–376.
lexicons, and rejection ability. By processing the remaining [17] A. Graves, Supervised sequence labelling with recurrent
rejects with a lexicon driven decoding, we obtain promising neural networks. Springer, 2012, vol. 385.
results enabling us to get close to state of the art methods. [18] P. Viola and M. Jones, “Rapid object detection using a boosted
cascade of simple features,” in CVPR, vol. 1. IEEE, 2001,
One of the weak points is the impossibility to correct
pp. I–511.
mistakes made by early networks in the cascade. Another [19] B. Zhang, “Reliable classification of vehicle types based on
weak point is the inability of the system to recognize out of cascade classifier ensembles,” ITS, vol. 14, no. 1, pp. 322–
vocabulary words. A lexicon is still required for verification. 332, 2013.
The perspectives of these work will focus on controlling the [20] P. Zhang, T. D. Bui, and C. Y. Suen, “A novel cascade en-
semble classifier system with a high recognition performance
error, analyzing the cascade when used with more networks,
on handwritten digits,” PR, vol. 40, no. 12, pp. 3415–3429,
and studying other networks for better complementarity. The 2007.
method should also be applied to another database and for [21] F. Menasri, J. Louradour, A.-L. Bianne-Bernard, and C. Ker-
larger lexicon to validate it. morvant, “The a2ia french handwriting recognition system at
the rimes-icdar2011 competition,” in IS&T/SPIE Electronic
R EFERENCES Imaging, 2012, pp. 82 970Y–82 970Y.
[22] L. Mioulet, G. Bideault, C. Chatelain, T. Paquet, and
[1] A. Graves and J. Schmidhuber, “Offline handwriting recog- S. Brunessaux, “Exploring multiple feature combination
nition with multidimensional recurrent neural networks,” in strategies with a recurrent neural network architecture for off-
NIPS, 2009, pp. 545–552. line handwriting recognition,” in DRR, San Francisco, USA,
[2] ——, “Framewise phoneme classification with bidirectional 2015, pp. 94 020F–94 020F.
lstm and other neural network architectures,” Neural Net- [23] N. Dalal and B. Triggs, “Histograms of oriented gradients
works, vol. 18, no. 5, pp. 602–610, 2005. for human detection,” in CVPR, vol. 1. IEEE, 2005, pp.
[3] E. Grosicki and H. El Abed, “Icdar 2009 handwriting recog- 886–893.
nition competition,” in ICDAR. IEEE, 2009, pp. 1398–1402. [24] G. Bideault, L. Mioulet, C. Chatelain, and T. Paquet, “Spot-
[4] H. El Abed, V. Margner, M. Kherallah, and A. M. Alimi, “Ic- ting handwritten words and regex using a two stage blstm-
dar 2009 online arabic handwriting recognition competition,” hmm architecture,” in DRR, San Francisco, USA, 2015.
in ICDAR. IEEE, 2009, pp. 1388–1392. [25] A. Graves, “Rnnlib: A recurrent neural network library for
sequence learning problems,” https://sourceforge.net/projects/
[5] R. Plamondon and S. N. Srihari, “Online and off-line hand-
rnnl.
writing recognition: a comprehensive survey,” PAMI, vol. 22,
[26] A. J. Viterbi, “Error bounds for convolutional codes and an
no. 1, pp. 63–84, 2000.
asymptotically optimum decoding algorithm,” IT, vol. 13,
[6] A. Senior and T. Robinson, “Forward-backward retraining of
no. 2, pp. 260–269, 1967.
recurrent neural networks.” NIPS, pp. 743–749, 1996.
[27] E. Grosicki and H. El-Abed, “Icdar 2011-french handwriting
[7] A. El-Yacoubi, M. Gilloux, R. Sabourin, and C. Y. Suen, “An recognition competition,” in ICDAR. IEEE, 2011, pp. 1459–
hmm-based approach for off-line unconstrained handwritten 1463.
word modeling and recognition,” PAMI, vol. 21, no. 8, pp.
752–760, 1999.
239