20) Training An End-to-End System For Handwritten Mathematical Expressions by Generated Patterns
20) Training An End-to-End System For Handwritten Mathematical Expressions by Generated Patterns
Abstract—Motivated by recent successes in neural machine that presents all recognizable parses of the input. Then, the
translation and image caption generation, we present an end-to- extraction process finds the top to nth-most highly-ranked trees
end system to recognize Online Handwritten Mathematical from the forest. By using horizontal and vertical order, this
Expressions (OHMEs). Our system has three parts: a method reduces infeasible partitions and makes the method
convolution neural network for feature extraction, a bidirectional independent from stroke order. However, the worst-case
LSTM for encoding extracted features, and an LSTM and an number of sub-partitions that must be considered during
attention model for generating target LaTex. For recognizing parsing and the complexity of the parsing algorithm are still
complex structures, our system needs large data for training. We
quite large as O(n4) and O(n4|P|), respectively. This system
propose local and global distortion models for generating
OHMEs from the CROHME database. We evaluate the end-to-
incorporates a correction mechanism to help users to edit
end system on the CROHME database and the generated recognition errors.
databases. The experiential results show that the end-to-end A global approach allowing mathematical symbols and
system achieves 28.09% and 35.19% recognition rates on structural relations to be learned directly from expressions was
CROHME without and with the generated data, respectively. proposed by Awal et al. [5]. During the training phase, symbol
hypotheses are generated without using a language model. The
Keywords—Online Handwritten Mathematical Expression
dynamic programming algorithm finds the best segmentation
Recognition, End-to-End Model, Encoder-Decoder Model, Patterns
Generation
and recognition of the input. The classifier learns both the
correct and incorrect segmentations. The training process is
I. INTRODUCTION repeated to update the classifier until the classifier recognizes
the training set of OHMEs correctly. Furthermore, contextual
Recognition of online handwritten mathematical expression modeling based on structural analysis of the expression is
(OHME) is one of the current challenges concerning employed, where the models are learnt directly from
handwriting recognition. It can be divided into three main expressions using the global learning scheme.
processes. First, a sequence of input strokes is segmented into
hypothetical symbols (symbol segmentation). Then A formal model for OHME recognition based on 2D
hypothetical symbols are recognized by a symbol classifier Stochastic Context Free Grammar (SCFG) and Hidden Markov
(symbol recognition). Finally, structural relations among the Model (HMM) was proposed by Alvaro et al. [6]. HMM uses
recognized symbols are determined and the structure of the both online and offline features to recognize mathematical
expression is analyzed by a parsing algorithm in order to symbols. The Cocke-Younger-Kasami (CYK) algorithm is
provide the most likely interpretation of an input OHME modified to parse an input OHME in two dimensions (2D).
(structural analysis). The recognition problem requires not only They use the range search to improve time complexity from
segmentation and recognition of symbols but also analysis of O(n4|P|) to O(n3logn|P|). To determine structural relations
two-dimensional (2D) structures and interpretation of the among symbols and sub-expressions, a Support Vector
structural relations. Ambiguities arise in all stages of the Machine (SVM) learns geometric features between bounding
process. boxes.
Many approaches have been proposed for recognizing Le et al. presented a recognition method based on SCFG
OHMEs especially during last two decades. They are [7]. Stroke order is employed to reduce the search space and
summarized in the survey papers [1, 2] and the recent the CYK algorithm is employed to parse a sequence of input
competition papers [3]. Most of them follow three strokes. Therefore, the complexity of the parsing algorithm is
interdependent processes as mentioned above. These processes still O(n3|P|), like that of the original CYK algorithm. They
can be handled independently [2] or jointly [4, 5, 6, 7]. In the extended the grammar rules to cope with multiple symbol
following, we will review a few recent approaches participated variations and proposed a concept of body box with two SVM
in the recent Competition on Recognition of Online models for classifying structural relations. The experiments
Handwritten Mathematical Expressions (CROHME). showed the good recognition rate and practical processing
time.
A system for recognizing OHMEs by using a top-down
parsing algorithm was proposed by MacLean et al. [4]. The A modified version of the Minimum Span Tree (MST)
incremental parsing process constructs a shared parse forest based parsing algorithm was presented by Hu et al. [8]. The
897
distortions. The local distortion is applied for symbols in an Ԣ ൌ Ⱦ Ⱦ
OHME, while the global distortion is applied for the whole ൜ (7)
ᇱ ൌ Ⱦ Ⱦ
OHME. The local distortion includes shear, shrink,
perspective, shrink plus rotation, and perspective plus rotation. where (x’, y’) is the new coordinate transformed by any of
The global distortion includes scaling and rotation. The local distortion models, Į is the angle of shear, shrink, and
process of distortion is shown in Figure 2. First, all symbols in perspective distortion models and ȕ is the angle of rotation
an OHME are distorted by the same distortion models. Then, distortion model. The local distortion model and its
the OHME is distorted by scaling and rotation models parameters are presented by (id, Į, ȕ) where id is the identifier
sequentially. The distortion models are described in the of the distortion model from 1 to 5, Į and ȕ are from -10o to
following sections. 10o.
Local distortion
Figure 3 show examples of local distortion models with Į
= 10o and ȕ = 10o.
Shear
Original Shrink +
Original Vertical shear Horizontal shear
ME Rotation Scaling Rotation
Distorted
Perspective ME
Perspective
+ Rotation
The shrink plus rotation model applies shrink and rotation B. Global Distortion
models sequentially. It is similar to the perspective plus Global distortion distorts an OHME in baseline and size.
rotation models. The rotation model is shown in Eq. (7). We employ rotation and scaling models. The rotation model is
Ԣ ൌ Ƚ Ԣ ൌ similar to the local distortion. Scaling model is shown in Eq.
൜ (1) ൜ ᇱ (2) (8). An example of the global distortion is shown in Figure 4.
ᇱ ൌ ൌ Ƚ
୶ ୱ୧୬ሺሻ Ԣ ൌ
ᇱ ൌ ሺ ቀ െ Ƚቁ െ ቀ ቁሻ ൜ (8)
ቊ ଶ ଵ ሺ3) ᇱ ൌ
ᇱ ൌ where k is the scaling factor. The parameters of the global
Ԣ ൌ distortion model are presented by (k, Ȗ) where Ȗ is the angle of
ቊ ᇱ ୷ ୱ୧୬ሺሻ (4) the global rotation distortion model, k is from 0.7 to 1.3, and Ȗ
ൌ ሺ ቀ െ Ƚቁ െ ቀ ቁሻ
ଶ ଵ is from -10o to 10o.
ଶ ୶ିହ
Ԣ ൌ ሺ ͷͲ ቀͶȽ ቁሻ
ଷ ଵ
ቐ ଶ గ ୷ ୱ୧୬ሺሻ
(5)
ᇱ ൌ ሺ ቀ െ ߙቁ െ ቀ ቁሻ
ଷ ଶ ଵ
ଶ గ ୶ ୱ୧୬ሺሻ
ᇱ ൌ ൬ ቀ െ ߙቁ െ ቀ ቁ൰
ଷ ଶ ଵ
ቐ ሺ6)
ଶ ୷ିହ Scaling factor = 0.7 and rotation
ᇱ ൌ ቀ ͷͲ ቀͶȽ ቁቁ Original
ଷ ଵ angle = 7o
Fig. 4. Examples of global distortion by scaling and rotation models.
898
C. Patterns generation Next, we compared the performance of the best end-to-end
To generate an OHME, we first randomize five variables system in the above with the other systems which participated
(id, Į, ȕ, k, Ȗ). Then, all symbols in an OHME are distorted by CROHME 2014.
local distortion models with (id, Į, ȕ). Then, the OHME is A. Databases
distorted by the global distortion model with (k, Ȗ). Figure 5
shows some generated OHMEs from the original OHME that We use the CROHME 2014 database [11]. Organized at
appeared in Figure 3. ICHFR 2014, CROHME 2014 was a contest in which OHME
recognition algorithms competed. It allows the performance of
the proposed system to be compared with others under the
same conditions. There were seven participants. The
CROHME 2014 database contains 8,835 OHMEs for training
and 986 OHMEs for testing. The number of symbol classes is
101.
We generated more patterns by using the above-mentioned
distortion models. We prepared two new training sets named as
G_CROHME1 and G_CROHME2. G_CROHME1 and
G_CROHME2 were created by generating 3 and 5 new
OHMEs from every OHME in the CROHME training set,
respectively. They also include original OHMEs from the
CROHME training set. The number of OHMEs and generated
OHMEs for each training set are shown in table I.
IV. EVALUATION
We employ the CROHME 2013 test set for validation and
First, we trained the end-to-end system by the CROHME the CROHME 2014 test set for evaluation.
training set. We repeatedly employ the training set to train the
system. The training terminates when no increase of Input
recognition rate is observed after 10 epochs. The resultant Conv (3x3): 64
system is referred as the baseline system. Then, we created the Batch Norm
two new training datasets G_CROHME1 and G_CROHME2 ReLU
by patterns generation which is detailed in the next section. For Max pooling (2x2)
each generated dataset, we trained the end-to-end system with
applying global distortions of different values for parameters at Conv (3x3): 128
Batch Norm
every epoch as shown in Figure 6(a). We also trained the ReLU
system without global distortion as shown Figure 6(b). Namely,
training with global distortions uses images from training set Max pooling (2x2)
with global distortions applied at the beginning of every epoch Conv (3x3): 256
while training without distortion employ the same images from Batch Norm
the training set for every epoch. Then, we evaluated all the ReLU
systems on the CROHME 2014 test set. Max pooling (1x2)
(a) with global distortions (b) without global distortions Fig. 7. Structure of CNN feature extraction. The parameters of the
convolution and max pooling layers are denoted as “Conv (filter size):
number of filters” and “Max pooling (filter size)”, respectively.
Fig. 6.The training process of the end-to-end model with and without
distortions.
899
B. End-to-end system configuration latex format, so that we obtain only expression recognition rate.
A CNN with convolution, batch norm, ReLU, and max- The best end-to-end system is ranked third after systems I and
pooling layers was employed for feature extraction as shown in III.
Figure 7. A single layer bidirectional LSTM and a single layer
TABLE IV. COMPARISION OF END TO END MODEL AND THE RECOGNITION
LSTM are used for the encoder and decoder, respectively. The SYSTEMS ON CROHME 2014 (%)
size of hidden states of the encoder and decoder is 256 and 512,
respectively. We used mini-batch stochastic gradient descent to Sym Seg
Measure Sym Exp
learn the parameters. The initial learning rate was set to 0.1. + Rel Tree
Method Seg Rec
The training process was stopped when the recognition rate on Rec
validation set stopped improving after 10 epochs. The system I 93.31 86.59 84.23 37.22
was implemented by using Torch and the Seq2seq-attn NMT II 76.63 66.97 60.31 15.01
system [14]. All the experiments were performed on a 4GB
III 98.42 93.91 94.26 62.68
Nvidia Tesla K20.
IV 85.52 76.64 70.78 18.97
C. Results V 88.23 78.45 61.38 18.97
The first experiment evaluated the performance of the end- VI 83.05 69.72 66.83 25.66
to-end systems trained on the CROHME training set,
G_CROHME1, and G_CROHME2. An OHME is recognized VII 89.43 76.53 71.77 26.06
correctly in terms of expression level if all of its symbols, End-to-end N/A N/A N/A 35.19
relations and its structure are recognized correctly. For
measurement, we use expression recognition rate which counts Finally, we evaluate the end-to-end system by structure
OHMEs recognized at the expression level over all the testing
recognition rate. Structure recognition rate is calculated by the
OHMEs. The training process is shown in Figure 6(b). Table II
percent of OHMEs whose structure is recognized correctly
shows the recognition rate on validation and testing sets by
using different training sets. The recognition rates on both irrespective of symbol labels. For example, the two OHMEs
validation and testing set increase when the number of training (x2 + 1 and x3 + 7) share the same structure. Table V shows the
patterns increases. structure recognition rates of the end-to-end systems trained
on the CROHME training set, G_CROHME1, and
TABLE II. PERFORMANCE OF END TO END SYSTEM ON DIFFERENT G_CROHME2. It shows that the end-to-end systems can learn
TRAINING SETS. well the structures of OHMEs. If we want to improve the
CROHME G_CROHME G_CROHME expression recognition rates of the end-to-end systems, the
Rec. rate
training set 1 2 remaining problem is how to improve the symbol recognition
Validation(%) 17.16 19.55 21.64 inside the end-to-end systems.
Testing(%) 18.97 21.10 26.27 TABLE V. STRUCTURE RECOGNITION RATE OF END-TO-END SYSTEMS
WITH DISTORTION ON TRAINING SET.
900
V. CONCLUSION [5] A. M. Awal, H. Mouchère, and C. Viard-Gaudin, A global learning
approach for an online handwritten mathematical expression recognition
In this paper, we have presented the end-to-end system for system, Pattern Recognition Letters, 2014, pp. 68–77.
recognizing OHMEs. We proposed a combination of the local [6] F. Alvaro, J. Sánchez, and J. Benedí, Recognition of On-line
and global distortion models for patterns generation. The Handwritten Mathematical Expressions Using 2D Stochastic Context-
efficiencies of the proposed local and global distortion models Free Grammars and Hidden Markov Models, Pattern Recognition
are demonstrated through the experiments. The recognition rate Letters, pp. 58-67, 2014.
is improved when we increase the number of training patterns. [7] A.D. Le and M. Nakagawa, A system for recognizing online
handwritten mathematical expressions by using improved structural
It achieves 28.09%, 34.99% and 35.19% by using distortion on analysis, International Journal of Document Analysis and Recognition, ,
the CROHME training set, G_CROHME1, and G_CROHME2, vol. 19, pp 305–319, 2016.
respectively. It shows that the end-to-end system is a potential [8] L Hu, R Zanibbi, MST-based visual parsing of online handwritten
system to compare with existing systems of OHME recognition. mathematical expressions, 15th International Conference on Frontiers in
Handwriting Recognition, 2016, pp. 337-342.
There still remain problems to improve the expression [9] T. Zhang, H. Mouchere, C. Viard-Gaudin: Using BLSTM for
recognition rate of the end-to-end system as follows. First, we interpretation of 2-D languages. Case of handwritten mathematical
should generate more OHMEs whose structures are more expressions. Document Numérique 19(2-3): 135-157 (2016).
varied and employ a larger memory GPU for training. Then, [10] Bahdanau, D.; Cho, K.; and Bengio, Y. 2014. Neural machine
we should improve the symbol recognition inside the end-to- translation by jointly learning to align and translate. arXiv preprint
end system by employing tree-structured LSTM [15] or arXiv:1409.0473.
decomposable attention model [16]. [11] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhudinov, R.;
Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image
ACKNOWLEDGMENT caption generation with visual attention. In Proceedings of The 32nd
International Conference on Machine Learning, 2048̽2057.
This research has been supported by JSPS fellowship under the [12] Y. Deng, A. Kanervisto, and A. M. Rush, What You Get Is What You
number 15J08654. See: A Visual Markup Decompiler, arXiv preprint
htp://arxiv.org/pdf/1609.04938v1.pdf
REFERENCES [13] B. Chen, B. Zhu and M. Nakagawa: Training of an On-line Handwritten
Japanese Character Recognizer by Artificial Patterns, Pattern
Recognition Letters, Vol. 35, No. 1, pp.178-185.
[1] K. Chan and D. Yeung, Mathematical Expression Recognition: A
Survey, International Journal of Document Analysis and Recognition, [14] G. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M.Rush, OpenNMT:
pp. 3-15, 2000 Open-Source Toolkit for Neural Machine Translation, ArXiv e-prints,
eprint = {1701.02810}.K. S. Tai, R. Socher, and C. Manning,
[2] R. Zanibbi and D. Blostein, Recognition and Retrieval of Mathematical Improved semantic representations from tree-structured long short-
Expressions, International Journal of Document Analysis and term memory networks. arXiv preprint arXiv:1503.00075, 2015.
Recognition , pp.331-357, 2012.
[15] K. S. Tai, R. Socher, and C. Manning, Improved semantic
[3] H. Mouchere, C. Viard-Gaudin, R. Zanibbi, and U. Garain, ICFHR 2014 representations from tree-structured long short-term memory
competition on recognition of on-line handwritten mathematical networks. arXiv preprint arXiv:1503.00075, 2015.
expressions (CROHME 2014). Proc. Int'l Conf. Frontiers in Handwriting
Recognition, pp. 791-796, 2014. [16] A. P. Parikh, O. Tackstrom, D. Das, J. Uszkoreit, A decomposable
attention model for natural language inference. arXiv preprint
[4] S. MacLean and G. Labahn, A new approach for recognizing arXiv:1606.01933 , 2016.
handwritten mathematics using relational grammars and fuzzy sets.
International Journal of Document Analysis and Recognition, vol. 16,
pp. 139-163, 2013.
901