Learning by Injection: Attention Embedded Recurrent Neural Network For Amharic Text-Image Recognition

Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Preprints (www.preprints.

org) | NOT PEER-REVIEWED | Posted: 15 October 2020

Learning by Injection: Attention Embedded


Recurrent Neural Network for Amharic Text-image
Recognition
Birhanu Belay∗+ , Tewodros Habtegebrial∗ , Gebeyehu Belay † , Million Meshesha‡ ,
Marcus Liwicki§ and Didier Stricker ¶
∗ Dept. of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany
Email:{ b [email protected]}
† + Faculty of Computing, Bahir Dar Institute of Technology, Bahir Dar, Ethiopia
‡ Faculty of Informatics, Addis Ababa University, Addis Ababa, Ethiopia
§ Department of Computer Science, Lulea University of Technology, Lulea, Sweden
¶ DFKI-German Research Center for Artificial Intelligence, Kaiserslautern, Germany

Abstract—In the present, the growth of digitization and world- Tewahdo Churches, public and academic libraries in the form
wide communications make OCR systems of exotic languages a of hardcover books, and preserved in a manual catalog [7].
very important task. In this paper, we attempt to develop an OCR In the Amharic script, there are about 317 different alpha-
system for one of these exotic languages with a unique script,
Amharic. Motivated with the recent success of the Attention bets, including 238 core characters, 50 labialaize characters,
mechanism in Neural Machine Translation (NMT), we extend 9 punctuation marks, and 20 numerals which are written and
the attention mechanism for Amharic text-image recognition. read, like English, from left to right [1], [8], [9]. All vowels
The proposed model consists of CNNs and attention embedded and labialized characters in Amharic script are derived, with
recurrent encoder-decoder networks that are integrated following a small change, from the 34 consonant characters.
the configuration of the seq2seq framework. The attention
network parameters are trained in an end-to-end fashion and the As shown in Table I, the 1st column is the base symbol
context vector is injected, with the previous predicted output, at with no explicit vowel indicator usually called consonants. In
each time steps of decoding. Unlike the existing OCR model the 2nd and 3rd columns, the corresponding consonants are
that minimizes the CTC objective function, the new model modified by a projection half-way down the right leg and the
minimizes the categorical cross-entropy loss. The performance base of the right leg respectively. The 4th column has a short
of the proposed attention-based model is evaluated against the
test dataset from the ADOCR database which consists of both left leg while the 5th column has a loop on the right leg. The
printed and synthetically generated Amharic text-line images, 6th and 7th columns are less systematic, but some regularity
and achieved a promising results with a CER of 1.54% and happened on the left and right leg of the base characters.
1.17% respectively. Due to these small modifications on the consonants, Amharic
Index Terms—Amharic script , Attention mechanism, OCR, characters have similar shapes which may make the task of
Encoder-decoder, Text-image
recognition hard for machines as well as humans [3], [10] .
Even though there is considerable shape modification reg-
I. I NTRODUCTION
ularity of characters across some columns, there are also
Amharic is an official working language of the Federal unpredictable modification patterns in some columns. Some of
Democratic Republic of Ethiopia. As many as 100 million them, such as the 3rd and 5th column are more consistent than
people around the world speak Amharic, making it the second others, such as the 2nd and 4th columns, while others, such
most spoken Semitic language next to Arabic and it has a rich as 6th and 7th columns are completely inconsistent. These
collection of documents ranging from historical to modern, features are particularly interesting in research on character
from vallum written to the paper printed and from simple recognition because a small change in the basic physical
to complex layouts. Amharic is widely spoken in different features may affect the orthographic identities of letters.
countries like Eritrea, USA, Israel, Somalia, and Djibouti [1], Numerous works, in the area of Optical Character Recog-
[2], [3], [4]. nition (OCR) and Document Image Analysis (DIA), have
Amharic has been the working language of the courts, the been done and widely used for decades to digitize various
language of trade and everyday communications, the military, historical and modern documents [11], [12], [13]. Many of the
dated back from the late 12th century and remains the official well-known scripts have OCR systems with sufficiently high
language of Ethiopia today [5]. Since then there are multi- performance that enables OCR applications to be applied in
ple documents, containing religious and academic contents, industrial/commercial settings. However, OCR systems yield
written in Amharic script [6]. Since then, these documents very-good results only on a narrow domain and very specific
are stored in different places such as Ethiopian Orthodox use cases. Thus, it is still considered as challenging task

© 2020 by the author(s). Distributed under a Creative Commons CC BY license.


Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

Table I from the filed of NMT, and has often been shown to improve
S HAPE FORMATION OF SAMPLE BASIC A MHARIC CHARACTERS [3]. the performance over the existing approaches. Therefore, this
O RDERS OF CONSONANT- VOWEL VARIANTS (34 × 7). C HARACTERS IN
THE FIRST COLUMN ARE CONSONANTS AND THE OTHERS ARE DERIVED paper presents an attention-based encoder-decoder model for
VARIANTS . VOWELS ARE DERIVED BY ADDING DIACRITICS AND / OR Amharic OCR.
REMOVE PART OF CONSONANTS AND THE ORTHOGRAPHIC IDENTITIES OF Text-line image recognition has recently been widely treated
EACH CHARACTER VARY ACROSS ROWS AS MARKED WITH THE VIOLET
COLOR .
as a sequence-to-sequence learning task while a traditional
segmentation based character recognition has applied for a
longer time. Later an LSTM-CTC based techniques have been
used for recognition of multiple scripts including Amharic
script which impair the valuable spatial and structural in-
formation of text-line images. This work also aims to push
4 መmâ ሙ mu ሚ mi ማ ma ሜmé ምmi ሞmo the limits of such techniques using attention-based text-line
image recognition which is based on neural networks. Unlike
. . . the CTC, the attention model explicitly uses the history of
. . . the target sequence without any conditional independence
. . . assumptions.
In addition, attention enables the networks to potentially
model language structures, rather than simply mapping an
input to an output [14]. Moreover, encouraged by recent atten-
tion given by researchers for Amharic document digitization
and inspired by its success in employing attention for neural
machine translation [15], [16] and speech recognition [17],
[18], we investigate models that can attend to a salient part of a
text-image, in the context of Amharic script, while generating
a character.
As a continuation of our previous work [1], [3], which
employed the CTC as a cost function to train and tune the
parameters of an LSTM network this paper presents the first
result using the concept of attention mechanism with the
following additional contributions:
1) We propose an attention-based OCR framework for
Amharic text-image recognition for the first time.
Figure 1. Sample Amharic text-line image. A word marked by violet color is 2) The proposed method is trained by injection learning
composed of six individual Amharic characters and the corresponding sounds
of each character is described with English letters. strategy which allows the model to learn from its error
and reduce exposure of bias, unlike the existing attention
mechanisms that are usually trained by teacher forcing
and there are other indigenous scripts for which no well techniques.
developed OCR systems exist [1]. In the present, the growth 3) Different from the existing attention-based encoder-
of digitization and world wide communications make OCR decoder model that uses the last hidden state of the
systems of exotic languages, like Amharic, a very important encoder as an initial hidden states of the decoder, the
task. proposed model uses independent and randomly initial-
In literature, attempts made for Amharic script recognition ized hidden states for both encoder and decoder.
so far are based on the classical machine learning techniques 4) The proposed model is designed by leveraging the ar-
and they are very limited in addressing the issues for Amharic chitecture of the Seq2Seq framework, used in NMT, and
OCR. Moreover, these attempts are neither shown results on then stacking CNN layers before the encoder network as
large datasets nor considering all possible characters used in a feature extractor. During training, the overall model
the Amharic writing system. Recently published work [1], components are treated as a unified framework.
introduced an Amharic OCR database called ADOCR. We 5) We validate the proposed attention mechanisms on
took a sample text-line image from the ADOCR database Amharic text-image recognition task and show the ad-
whose word formation and character arrangements in a sample vantage of attention mechanism against the methods
word are illustrated in Figure 1. employed for the recognition of Amharic text-images
RNNs that are trained with CTC objective function has so far through empirical analysis.
been employed for text-image recognition of multiple scripts. The rest of the paper is organized as follows: Section II
It is also a state-of-the-art approach and widely applied for reviews the relevant methods and related works. The proposed
sequence-to-sequence learning tasks to date. Recently, another method and training strategies are described in section III.
sequence-to-sequence learning technique has been emerged, Section IV presents all experiments, empirical analysis, and
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

results obtained from the experiment. Finally, conclusions and handwritten recognition [25], an LSTM-based online hand-
future works are described in the last section. written recognition for 102 languages in 26 scripts [26],
combined Connectionist Temporal Classification (CTC) with
II. R ELATED WORK Bidirectional LSTM (BLSTM) for unconstrained online hand-
writing recognition [24], and Multidimensional-LSTM for
The existing OCR models can utilize either traditional or Chinese handwritten recognition [27].
holistic techniques. Methods belongs to the first category were Others, like [3], [28], [29], integrate CNNs for improved
mainly applied before the introduction of deep learning and low-level feature extraction prior to the recurrent layers. This
follows step-wise routines. By contrast, the holistic approach approach has been mainly applied for printed text-image
integrated the feature extraction and sequential translation recognition [30], handwritten recognition [31], and license
steps in a unified framework that is trained from end-to-end. plate recognition [32]. A convolutional recurrent neural net-
Therefore, in this section, we review the research trends of work is also employed for Japanese handwritten recognition
Amharic OCR development and the existing state-of-the-art [29], and Chinese handwritten text recognition [33], fully
techniques that are applied for sequence-to-sequence learning convolutional neural networks for detection of watermarking
tasks. region [34] and handwritten text-line segmentation [35], a
Even though OCR research for the Amharic script started scattering feature maps integrated with convolutional neural
in 1997 [19], it is still in its infancy and it is still an open area networks has been applied for Malayalam handwritten char-
of research. Since then, attempts have been made to develop acter recognition [36].
Amharic OCR [8], [19], [20], [21] using different statistical The main challenge in sequence-to-sequence learning tasks
machine learning techniques. is to find an appropriate alignment between input and output
Recently, following the success of deep learning, other sequences of variable length. For text-image recognition, one
attempts are also made to develop a model for Amharic OCR needs to identify the correct characters at each time step
and achieved relatively promising results; such as Belay et al without any prior knowledge about the alignment between the
[22] proposed CNN for Amharic character image recognition image pixels and the target characters. The current two major
and a year later a factored convolutional neural network methods that overcome this problem are attention-based and
[9] model was also proposed for Amharic character image CTC-based sequence-to-sequence learning approaches.
recognition. This new model was designed by leveraging the The CTC-based models compute a probability distribution
arrangement of Amharic characters in Fidel-Gebeta. It consists over all possible output sequences, given an input sequence.
of two classifiers that have shared layers at the lower stage and They do so by dividing the input sequence into frames and
task-specific layer at their last stage (one as a column detector emitting, for each frame, the likelihood of each character of
while the other is the row component detector), where both the target alphabet. The probability distribution can be used
classifiers are trained jointly. A parallel work by Gondere et al to infer the actual output greedily, either by taking the most
[23] was proposed for Handwritten Amharic character image likely character at each time step.
recognition. The work of Gondere is based on the architecture Attention-based models are an alternative and recent idea
proposed by Belay. Unlike the previous work, the work of of sequence-to-sequence architectures that follow the encoder-
Gondere is mainly focused on handwritten Amharic character decoder framework is to decouple the decoding from the fea-
recognition. ture extraction. The model consists of an encoder module, at
The standard OCR tasks have been investigated based on the bottom layer, that reads and builds a feature representation
RNNs [1], [3], and the CTC [24] objective function. For exam- of the input sequence, and a decoder module, at the top layer,
ple, an LSTM network together with CTC objective function which generates the output sequence one token at a time.
has been proposed for Amharic text-line image recognition The decoder uses an attention mechanism to gather context
[1]. In this work, benchmark datasets called ADOCR were information and search for relevant parts of the encoded
introduced. A model for Amharic text-line image recognition features.
was also proposed by stacking CNNs before LSTM networks Even though attention mechanism was introduced to address
[3] as a feature extractor, where training is done in an the problem of long sequences in Neural Machine Trans-
end-to-end fashion. However, CTC-based architectures are lation(NMT), nowadays attention becomes one of the most
subject to inherent limitations like strict monotonic input- influential ideas in the deep learning community and it is
output alignments and an output sequence length that is widely applied in many sequence-to-sequence learning tasks.
bound by the subsampled input length while the the attention- Bahdanua et al. [15] and Luong [16] proposed attention-based
based sequence-to-sequence model is more flexible, suits the encoder-decoder model for machine language translation.
temporal nature of the text and can focus on the most relevant The most works so far with an attention mechanism has
features of the input by incorporating attention mechanisms focused on neural machine translation. However, researchers
[15]. have recently applied attention to different research areas.
In literature, most CTC-based architectures for text-image Therefore, it becomes popular and a choice of many re-
recognition were LSTM networks, in many cases using the searchers in the area of OCR. For example, Doetsch et al
Bi-LSTM such as Bidirectional-LSTM architecture for online [37] proposed an attention neural network to the output of
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

a Bi-LSTM network that is operating on frames extracted Table II


from a text-line image with a sliding window and Bluche [38] C ONVOLUTIONAL LAYERS OF THE PROPOSED MODEL ADOPTED FROM THE
OUR PREVIOUS PAPER [3] AND THEIR CORRESPONDING PARAMETER
proposed a similar technique for an end-to-end handwritten VALUES . K- SIZE AND F- MAP REPRESENT KERNEL SIZE AND THE NUMBER
paragraph recognition, recognizing handwritten texts [39], OF FEATURE MAPS RESPECTIVELY.
[40], characters in the wild [41], handwritten mathematical ex-
pression [42], Chen et al [43] proposed an adaptive embedding Network Layers K-Size Stride F-Maps
gate controlled attention module for scene text recognition. Convolution 3×3 1×1 64
Finally, our system is quite similar to Bahdanua et al. [15] Max-Pooling 2×2 1×1 -
attention network architecture the one proposed for neural Convolution 3×3 1×1 128
machine translation. The main difference is that we apply the Max-Pooling 2×1 1×1 -
attention for text-line image recognition where the decoder Convolution 3×3 1×1 256
network outputs character by character given the decoder Convolution 3×3 1×1 256
history and the expected input from the attention mechanism. Max-Pooling 2×1 1×1 -
Further, CNN layers are integrated with the encoder network Convolution 3×3 1×1 256
as a feature descriptor and the decoder LSTM initializes with a BatchNormalization - - -
Keras default weight initializer, Xavier unif orm initializer Convolution 3×3 1×1 256
[44], instead of the final state of the encoder LSTM network. BatchNormalization - - -
The following section gives a detailed overview of the pro- Max-Pooling 2×1 1×1 -
posed attention-based encoder-decoder Amharic OCR model. Convolution 3×3 1×1 512

III. T HE P ROPOSED A PPROACH


In this section, we elaborate on the proposed attention-based states should be the same as the number of units in the LSTM
network for Amharic OCR. In the text-image recognition task, cell which is 128 in our case. The final states of the encoder
the raw data is 32 by 128 text-line images which should be (i.e in this case hT x and cT x ) are the relevant information of
processed and encoded to a sequence of high-level features. the whole input Amharic text-line image.
To do so, as illustrated in Figure 2, our Amharic OCR Suppose that the input text-line image I consists of a
model follows the standard encoder-decoder framework with sequence length T x, then the encoder-CNN processes an input
attention mechanism [15]. The model consists of three basic image I and transfers it into an intermediate-level feature map
modules: an encoder module that combines a CNN as a X, which can be thought of as a sequence of column vectors
generic feature extractor with recurrent neural network layers X =(x1 , x2 , ..., xT x ). This sequence is then processed by two
to introduce temporal contexts in the feature representation, Bidirectional-LSTM layers and we get the combined state se-
a decoder module that utilizes a recurrent network layer to quence of the final encoded feature map H =(h1 , h2 , ..., hT x ),
interpret those features and the third module is an attention of the forward and backward hidden states.
mechanism that enables the decoder to focus on the most
relevant encoded features at each decoding time step. Table III
T HE RECURRENT NETWORK LAYERS AN OF THE PROPOSED ENCODER
A. Encoder MODULE WITH THEIR CORRESPONDING PARAMETER VALUES .

Our encoder module is equipped with CNNs; thus the Network Layers (Type) Hidden Layer Size
segmented text-line images are converted into a sequence
of visual feature vectors. CNN layers integrated into en- BLSTM 128
coder network are pre-trained model weights adopted from an BLSTM 128
Amharic text-image recognizer proposed by Belay [3]. Then
two Bidirectional-LSTM layers, which reads the sequence of
convolutional features to encode temporal context between B. Decoder
them, are employed. The decoder module generates the target character sequence
The Bidirectional-LSTM processes the sequence in opposite present in the image given the current image summary and
directions to encode both forward and backward dependencies state vector. The proposed model uses a unidirectional LSTM
and capture the natural relationship of texts. All the configura- network whose initial hidden state is randomly initialized
tions and corresponding network parameters of the proposed as done for our encoder. The network parameters and their
encoder module are presented in Table II and III, where the corresponding value are shown in Table IV. At each time-step
input text-line image size is 32 × 128 × 1. t, the decoder computes a probability distribution over the pos-
As shown in Figure 2, the encoder network takes an sible characters and predicts the most probable character yt ,
Amharic text-line image as an input and encapsulates the conditioned on its own previous predictions (y0 , y1 , ..., yt−l )
information as the internal state vectors which are later used and a time-dependent context vector ct , which contains in-
by the decoder. The hidden state and cell state (h0 and c0 ) of formation from the encoded features. Formally, for an output
the encoder are initialized randomly. The dimensions of both sequence length T y, it defines a probability over the output
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

y0 y1 y2 yTy-1 yTy

..
.
Softmax
Softmax Softmax Softmax
RepeatVector

Decoder
RepeatVector RepeatVector RepeatVector

s1 s2 sTy-1
LSTM LSTM LSTM

...
Concat Concat Concat

..
.
s0

RepeatVector

RepeatVector
...
Attention layer
RepeatVector

... ... ...


softmax     softmax     softmax    

..
.
FC2 FC2 ... FC2 FC2 FC2 ... FC2 FC2 FC2 ... FC2

FC1 FC1 ... FC1 FC1 FC1 ... FC1 FC1 FC1 ... FC1

..
.
concat concat concat concat concat concat concat concat concat

..
.
h1 h2 hTx
Encoder

LSTM LSTM LSTM

..
.
LSTM LSTM LSTM

..
.
X1 X2 XTx
..
.
CNN-layers

Amharic text-image

Figure 2. Attention-based encoder-decoder model for Amharic OCR. The encoder converts an input Amharic text-line image I into a sequence of constant
feature vectors h. The decoder generates the output sequence y =(y0 , y1 , ..., yT y ) one character at a time, where y0 is the dummy sequence that is generated
before the actual target character sequence started to generate. At each time step t, the decoder uses an attention mechanism to produce a context vector cj
based on the encoded feature vectors and a time-dependent decoder hidden state sj−1 . A concatenation of the context vector at t time step and the output y
at t-1 time step serves as the decoder’s next input. During concatenation, to have the same dimension, feature vectors are regenerated using, RepeatV ector
, a function in Keras API.

Table IV
T HE DECODER RECURRENT NETWORK LAYERS AND THEIR PARAMETER
VALUES . T HE INPUT SIZE OF THE DECODER LSTM, AT EACH TIME - STEP t, C
IS A CONCATENATION OF THE CONTEXT VECTOR AT TIME t AND
PREVIOUS PREDICTED OUTPUT AT TIME - STEP t-1. +

Network Layers (Type) Hidden Layer Size X X X


s1 s2 ... sn

LSTM 128 SoftMax


FC+Soft-Max No. class = 281
f1 f2 fn

tanh tanh ... tanh


sequence Y =(y1 , y2 , ..., yT y ) by modeling each condition
as p(yt |(y0 , y1 , ..., yTy−1 , ct )) = sof tmax(f (yt−1 , st−1 , ct )) S0

where f and st−1 represents the current and previous LSTM


...
hidden states respectively. h1 h2 hn

C. Attention Mechanism
Figure 3. A typical attention model architecture. This network module
At each step of decoding, the attention layer is introduced computes the output c from the initial state s0 and the part of the given
to focus on the most relevant part of the encoded feature sequence h=h1 , h1 , ..., hn . The tanh layer can be replaced by any other
network that is capable to produce an output form the given s0 and hi ,
representation. A typical attention model architecture is shown where the input length is equal to the output length which is n in this case.
in Figure 3. For each particular input sequence, attention
mechanism has the power to modify the context vector at
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

each time step based on some similarity of the decoder hidden


state sj−1 with the encoded features h of the context vector
cj where cj is computed for each target character yj as given
in Equation (1), a
Tx
X
cj = wj,i hi (1)
i=1

where the wj,i is an attention weight which is computed


by soft-maxing its corresponding alignment score aj,i using
Equation (2),

exp(aj,i ))
wj,i = PT x (2) b
k=1 exp(ajk )

where aj,i is the alignment score of concatenated annotation


between sj−1 and hi at each time step t and it can be computed
using Equation (3).

aj,i = f (g(hi , sj−1 )), for i = 1, ..., Tx (3)

The function g and f in Equation (3) are feed-forward neu-


ral networks, with tanh activation function, that are stacked
consecutively. The intuition of the feed-forward networks is
c
to let the model to learn the alignment weights together with
the translation while training the whole model layers.
The overall steps followed in our attention-based network
could be presented as follows:
1) Initialize the encoder hidden states: The initial encoder
hidden states are a random small number and then the Figure 4. Sample text-line images from test sets of ADOCR database. The
encoder produces hidden states of each element in the text-line images have a size of 48 by 128 pixels: (a) Printed text-line images
written with the Power Geez font type. (b) Synthetically generated text-line
input sequence. images with the Power Geez font type. (c) Synthetically generated text-line
2) Initialize the decoder hidden states: Similar to the en- images with the Visual Geez font type.
coder hidden states, the decoder initial hidden states are
initialized randomly.
3) Compute alignment scores: Alignment score is calcu- IV. E XPERIMENTAL S ETUP
lated between the previous decoder hidden state at t- In this section, the database used for training and evaluation,
1 and each of the encoder’s hidden states at time step the training procedure, results obtained from our experiment
t. The alignment score function used in this paper is are presented.
similar with the alignment score function in [15] which
is called additive/concatenation. Therefore, the current A. Database
encoder hidden states and the previous decoder hidden There are few databases used in the various works on
states are first concatenated and then passes through two Amharic OCR reported in the literature. As reported in [45],
consecutive feed-forward neural networks with relu and the authors considered 5172 images of the most frequently
tanh activation function respectively. Alignment score used Amharic characters. A later work by Million et al [8]
could be computed using Equation (2). uses 76,800 character images with different font types and
4) Softmaxing the alignment scores: Each computed align- sizes which belong to 231 classes. Other researchers’ work
ment score runs through a softmax layer. on Amharic OCR [46], [47], [19] reported that they used their
5) Computing the Context Vector: The encoder hidden private databases, but none of them were made their dataset
states and their respective soft-maxed alignment scores publicly available. Therefore, the shortage of datasets has been
are multiplied and then summed up the results to form continued as the main challenge and one of the limiting factors
the context vector using Equation (1). in developing reliable OCR systems for the Amharic script to
6) Decoding the output: The context vector is concatenated date. To train and evaluate the performance of our Amharic
with the previous decoder output and fed into the OCR model, we use the same OCR database, employed in the
decoder, to emit a character, at that time step along with work of Belay et al [1], [3], and which is freely available at
the previous decoder hidden state. The process from step http://www.dfki.uni-kl.de/∼belay/.
3 to 6 are repeated themselves for each time step of the The ADOCR database consists both character level and
decoder T y times. text-line level images. In this paper we only use the text-line
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

(a) (b) (c)

Figure 5. Training & validation losses of model training with different network settings: (a). CTC loss with an LSTM-CTC network settings [1]. (b) CTC
loss with a CNN-LSTM-CTC settings [3]. (c) CE loss of the proposed model.

Table V from each time step and the errors are backpropagated through
T HE DETAILS OF THE ADOCR DATASET. PG AND VG DENOTE THE time to update the parameters of the model. The optimized cost
P OWER G E ’ EZ AND V ISUAL G E ’ EZ FONT TYPES RESPECTIVELY
is the negative log-likelihood of the correct transcription and
Printed Synthetic
given as,
XL
Font type PG PG VG Loss(I,y) = − log(yt |I) (4)
t=1
Number of samples 40,929 197,484 98,924
No. of test samples 2907 9245 6479 where I is the image, y = y1 , y2 , ..., yT y is the target character
No. of training samples 38,022 188,239 92,445 sequence and p(yt |I) is the probability of the decoder network
No. of unique chars. 280 261 210 output yt given the image I.
LSTM network is employed for both the decoder
and encoder module, where the encoder module uses
image database which consists a total of 337,337 Amharic two bidirectional-LSTM, and the decoder module uses a
text-line images with a size of 48 × 128 pixels, where 40,929 unidirectional-LSTM. The attention mechanism, in our model,
are printed text-line images written with power Geez font, computes the attention weight and outputs the context vector
197,484, and 98,924 images are synthetically generated text- once it took all hidden sates of the encoder LSTM and the
lines with Power Geez and Visual Geez fonts respectively. previous hidden state of the decoder-LSTM. We use a general
Sample text-line images from the test set and the details of attention strategy that usually attends to the entire input state
text-line images in the ADOCR database are presented in space with all the encoder hidden states. It also manages and
Figuer 4 and Table V respectively. quantifying the interdependence between the input and output
elements. Our attention module consists of two consecutively
B. Training procedure stacked feed forward neural network with a relu and tanh
The encoder and decoder of the RNNs have 128 hidden activation function respectively.
units each. The encoder RNN module consists of the forward In the entire OCR model, the attention module is called till
and backward RNNs each having 128 hidden units that are it reaches the maximum target sequence length, T y times to
stacked on top of CNN layers. Its decoder also has 128 give back the computed context vector which will be going
hidden units. In the both cases, we use a multilayer perceptron to be used by the decoder-LSTM. At this time all yt copies
network, two in the encoder part, and one in the decoder part, have the same weights by implementing layers with shareable
with a single soft-max hidden layer to compute the conditional weights. The networks are trained with an RMSProp optimizer
probability of each target character. The whole architecture, [48], [49] and a batch size of 128 for 25 epochs. Since our
depicted in Figure 2, computes a fully differentiable function, model is trained to minimize categorical cross-entropy loss,
which parameters can be trained from end-to-end in a super- each character in the ground truth was changed to one-hot
vised manner. encoding and then the decoder module generates the one-
The decoder is trained to generate the output based on the hot encoded equivalent of a character. In addition, considering
information gathered by the encoder by taking a prediction of research works done in the area of OCR, all the image from
randomly initialized hidden units as the first input to generate the ADOCR database are resized in to a size of 32 × 128
the first Amharic character. The input at each time step is pixels.
predicted output which is usually called learning by injection The proposed model is implemented with Keras Application
technique. The loss is also calculated on the predicted outputs Program Interface (API) [50] on a TensorFlow backend [51].
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

The learning loss of the proposed model and other Amharic GT Input text-line Images Prediction
OCR models that are trained using different network settings
and loss functions are depicted in Figure 5. Compared to
[1], the proposed model converges with a smaller number
of epochs and even though it takes 15 more epochs to
converge, it took a smaller time for training compared to በኩል የተለመደ ትብብር በኩል የተለመደ ትብር
[3]. The proposed model incorporates divers network layers
and parameter settings in a single unified framework and
trained in an end-to-end fashion which is not usually optimal
during training [52]. Therefore, the proposed model could be
assessed and training time may be further improved. The next ዳይሬክተር ለኮምፒዩቲንግ ፋኩልቲ ዳይሬክት ለኮምፒቲንንግ ፋኩልቲ
section presents the experimental results recorded during the
OCR model evaluation and sample text-line images with their
corresponding prediction texts.
C. Results በአሁኑ ጊዜ ህንፃውንም በአሁኑ ጊዜ ህንዳውን
The performance of the model is measure using the Char-
acter Error Rate (CER) [1], where CER is computed using
Equation (5), and 1.17% and 5.21% on ADOCR test datasets
that are synthetically generated with V isual Geez and P ower
ዕውን ለማድርግ ሌት ተቀን ዕውን ለማድርግ ሌት ተተቀን
Geez fonts respectively. We also carried out preliminary
experiments on a printed Amharic text-line image dataset from
the ADOCR test sets and achieved a promising result with a
CER of 1.54%.
 
1 X እንዲያገኝ ምቹ ሁኔታን እንዲያገኝ ምቹ ሁኔታን
CER(P, T ) =  ( D(n, m)) × 100, (5)
c
n∈P,m∈T
Figure 6. Sample predicted text-line images (middle), from test sets of
where c is the total number of target character labels in ADOCR database [1], with their corresponding Ground-Truth (GT) texts
(lef t) and model predictions (right).
the ground truth, P and T are the predicted and ground-
truth labels, and D(n, m) is the edit distance [53] between
sequences n and m.
test set of ADOCR database. The comparisons among the
Sample text-line images that are wrongly recognized during
proposed approach and others’ attempts made based on the
evaluation of our Amharic OCR model are depicted in Figure
ADOCR database [1] are listed in Table VI. The performance
6. Characters marked by colored-boxes are wrong predictions
of the proposed model shows better results on the printed
(it can be deletion, substitution or insertion errors), characters
dataset while results on synthetic datasets are comparable.
marked by blue-boxes, such as b and r from the first
Most of the characters are missed at the beginning and/or
and second text-line image respectively are sample deleted
end of the text-line images in the synthetic data during text-
characters. Other characters, such as n and °, from second
line image generation, while there are few miss-alignments
and fourth predicted texts that are marked with green-boxes
in the ground-truth of the printed text-line images that re-
are insertion errors, while character × in the third predicted
duce the recognition performance of our model. Using better
text which is marked by red-box is one of the substitution
annotation tools or careful manual annotation specifically if
error recorded during experimentation.
those missed characters during text-line image segmentation or
The other type of errors that usually affect the recognition
synthetic data generation are annotated/aligned the recognition
performance of our model is the missing of characters either on
performance of the OCR model could be enhanced.
the ground-truth or on the text-line image it self. For example,
the character m, marked by violet-box, in the third text-line
V. C ONCLUSIONS
image’s ground-truth is one of the character missed during the
text-line image generation. Even though all visible characters This paper successfully approaches the Amharic script
in the text-line image are predicted correctly, since the CER recognition with an attention-based sequence-to-sequence ar-
is computed between the ground-truth texts and the predicted chitecture. The proposed model integrates a convolutional
texts, such type of error are still considered as deletion errors. feature extractor with a recurrent neural network to encode
Beside wrongly predicted text-line images, the fifth text-line both the visual information, as well as the temporal context in
image in Figure 6, is sample correctly predicted text-line image the input image while a separate recurrent neural network is
from ADOCR test sets. employed to decode the actual Amharic character sequence.
The recognition performance of our model is compared Overall, we obtain results that are competitive with the state-
against the previous models’ recognition performance on the of-the-art recognition performance on ADOCR datasets. As
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

Table VI
C OMPARISON OF TEST RESULTS (CER IN %).

Methods #train-set #test-set image-type Font CER

BLSTM-CTC [54] * 96,000 12 pages printed - 2.12%


BLSTM-CTC [1] 38,022 2,907 Printed Power Ge’ez 8.54%
BLSTM-CTC [1] 188,239 9,245 Synthetic Power Ge’ez 4.24%
BLSTM-CTC [1] 92,445 6,479 Synthetic Visual Ge’ez 2.28%
CNN-BLSTM-CTC [3] 38,022 2,907 Printed Power Ge’ez 1.56%
CNN-BLSTM-CTC[3] 188,239 9,245 Synthetic Power Ge’ez 3.73%
CNN-BLSTM-CTC[3] 92,445 6,479 Synthetic Visual Ge’ez 1.05%
Ours 38,022 2,907 Printed Power Ge’ez 1.54%
Ours 188,239 9,245 Synthetic Power Ge’ez 5.21%
Ours 92,445 6,479 Synthetic Visual Ge’ez 1.17%
* Denotes methods tested on different datasets.

we observed the empirical results, the attention-based encoder- [9] B. Belay, T. Habtegebrial, M. Liwicki, G. Belay, and D. Stricker,
decoder model becomes poor when the sequence length in- “Factored convolutional neural network for amharic character image
recognition,” in 2019 IEEE International Conference on Image Pro-
creases. In most cases, the first characters are always correctly cessing (ICIP), pp. 2906–2910, IEEE, 2019.
predicted while the rest errors have no patterns; thus it is [10] T. Bloor, “The ethiopic writing system: a profile,” Journal of the
hard to learn in the initial training stage for longer input Simplified Spelling Society, vol. 19, no. 2, pp. 30–36, 1995.
[11] T. M. Breuel, A. Ul-Hasan, M. A. Al-Azawi, and F. Shafait, “High-
sequences. Such character errors are not observed in the performance ocr for printed english and fraktur using lstm networks,” in
LSTM-CTC based networks. In addition, we have observed Document Analysis and Recognition (ICDAR), 2013 12th International
a coverage problem which leads to an over-translation or Conference on, pp. 683–687, IEEE, 2013.
[12] D. S. Maitra, U. Bhattacharya, and S. K. Parui, “Cnn based common
under-translation. Hence, to minimize errors in the longer approach to handwritten character recognition of multiple scripts,” in
input sequence and handle overall alignment information, Document Analysis and Recognition (ICDAR), 2015 13th International
the attention-based encode-decoder model should be further Conference on, pp. 1021–1025, IEEE, 2015.
[13] M. Mondal, P. Mondal, N. Saha, and P. Chattopadhyay, “Automatic
enhanced. number plate recognition using cnn based self synthesized feature
learning,” in Calcutta Conference (CALCON), 2017 IEEE, pp. 378–381,
ACKNOWLEDGEMENTS IEEE, 2017.
[14] K. Cho, A. Courville, and Y. Bengio, “Describing multimedia content
The first author was partially supported by DAAD scholar- using attention-based encoder-decoder networks,” IEEE Transactions on
ship (Funding program No. 57375975), Department of Com- Multimedia, vol. 17, no. 11, pp. 1875–1886, 2015.
[15] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
puter Science, Technical University of Kaiserslautern, Ger- jointly learning to align and translate,” arXiv preprint arXiv:1409.0473,
many and Bahir Dar institute of Technology Ethiopia. This 2014.
research was carried out at the Augmented Vision lab at DFKI, [16] M.-T. Luong, H. Pham, and C. D. Manning, “Effective ap-
proaches to attention-based neural machine translation,” arXiv preprint
Kaiserslautern, Germany. arXiv:1508.04025, 2015.
[17] A. Das, J. Li, G. Ye, R. Zhao, and Y. Gong, “Advancing acoustic-to-word
R EFERENCES ctc model with attention and mixed-units,” IEEE/ACM Transactions on
Audio, Speech, and Language Processing, vol. 27, no. 12, pp. 1880–
[1] B. Belay, T. Habtegebirial, M. Liwicki, G. Belay, and D. Stricker, 1892, 2019.
“Amharic text image recognition: Database, algorithm, and analysis,” in [18] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi, “Hy-
2019 International Conference on Document Analysis and Recognition brid ctc/attention architecture for end-to-end speech recognition,” IEEE
(ICDAR), pp. 1268–1273, IEEE, 2019. Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–
[2] G. T. Mekuria and G. T. Mekuria, “Amharic text document sum- 1253, 2017.
marization using parser,” International Journal of Pure and Applied [19] W. Alemu, “The application of ocr techniques to the amharic script,”
Mathematics, vol. 118, no. 24, 2018. An MSc thesis at Addis Ababa University Faculty of Informatics, 1997.
[3] B. Belay, T. Habtegebrial, M. Meshesha, M. Liwicki, G. Belay, and [20] J. Cowell and F. Hussain, “Amharic character recognition using a fast
D. Stricker, “Amharic ocr: An end-to-end learning,” Applied Sciences, signature based algorithm,” in Information Visualization, 2003. IV 2003.
vol. 10, no. 3, p. 1117, 2020. Proceedings. Seventh International Conference on, pp. 384–389, IEEE,
[4] Y. JEMANEH, “Will amharic be aus lingua franca.” https://www.press. 2003.
et/english/?p=2654#l, 2019. [21] Y. Assabie and J. Bigun, “Hmm-based handwritten amharic word
[5] A. Teferra, “Amharic: Political and social effects on english loan words,” recognition with feature concatenation,” in 2009 10th International
Multilingual Matters, vol. 140, p. 164, 2008. Conference on Document Analysis and Recognition, pp. 961–965, IEEE,
[6] R. Meyer, “Amharic as lingua franca in ethiopia,” Lissan: Journal of 2009.
African Languages and Linguistics, vol. 20, no. 1/2, pp. 117–132, 2006. [22] B. Belay, T. Habtegebrial, and D. Stricker, “Amharic character image
[7] A. Wion, “The national archives and library of ethiopia,” in six years of recognition,” in 2018 IEEE 18th International Conference on Commu-
Ethio-French cooperation (2001-2006), 2007. nication Technology (ICCT), pp. 1179–1182, IEEE, 2018.
[8] M. Meshesha and C. Jawahar, “Optical character recognition of amharic [23] M. S. Gondere, L. Schmidt-Thieme, A. S. Boltena, and H. S. Jomaa,
documents,” African Journal of Information & Communication Technol- “Handwritten amharic character recognition using a convolutional neural
ogy, vol. 3, no. 2, 2007. network,” arXiv preprint arXiv:1909.12943, 2019.
Preprints (www.preprints.org) | NOT PEER-REVIEWED | Posted: 15 October 2020

[24] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, and S. Fernández, [44] X. Glorot and Y. Bengio, “Understanding the difficulty of training
“Unconstrained on-line handwriting recognition with recurrent neu- deep feedforward neural networks,” in Proceedings of the thirteenth
ral networks,” in Advances in neural information processing systems, international conference on artificial intelligence and statistics, pp. 249–
pp. 577–584, 2008. 256, 2010.
[25] M. Liwicki, A. Graves, S. Fernàndez, H. Bunke, and J. Schmidhuber, “A [45] D. Teferi, “Optical character recognition of typewritten amharic text,”
novel approach to on-line handwriting recognition based on bidirectional Master’s thesis, School of Information studies for Africa, Addis Ababa,
long short-term memory networks,” in Proceedings of the 9th Interna- 1999.
tional Conference on Document Analysis and Recognition, ICDAR 2007, [46] Y. Assabie, “Optical character recognition of amharic text: an integrated
2007. approach,” Master’s thesis, Addis Ababa University, Addis Ababa, 2002.
[26] V. Carbune, P. Gonnet, T. Deselaers, H. A. Rowley, A. Daryin, M. Calvo, [47] B. Y. Reta, D. Rana, and G. V. Bhalerao, “Amharic handwritten character
L.-L. Wang, D. Keysers, S. Feuz, and P. Gervais, “Fast multi-language recognition using combined features and support vector machine,”
lstm-based online handwriting recognition,” International Journal on in 2018 2nd International Conference on Trends in Electronics and
Document Analysis and Recognition (IJDAR), pp. 1–14, 2020. Informatics (ICOEI), pp. 265–270, IEEE, 2018.
[27] Y.-C. Wu, F. Yin, Z. Chen, and C.-L. Liu, “Handwritten chinese [48] Y. Bengio and M. CA, “Rmsprop and equilibrated adaptive learning
text recognition using separable multi-dimensional recurrent neural rates for nonconvex optimization,” corr abs/1502.04390, 2015.
network,” in 2017 14th IAPR International Conference on Document [49] T. Tieleman and G. Hinton, “Lecture 6.5-rmsprop: Divide the gradient
Analysis and Recognition (ICDAR), vol. 1, pp. 79–84, IEEE, 2017. by a running average of its recent magnitude,” COURSERA: Neural
[28] T. M. Breuel, “High performance text recognition using a hybrid networks for machine learning, vol. 4, no. 2, pp. 26–31, 2012.
convolutional-lstm implementation,” in 2017 14th IAPR International [50] F. Chollet, “Introduction to keras,” March 9th, 2018.
Conference on Document Analysis and Recognition (ICDAR), vol. 1, [51] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
pp. 11–16, IEEE, 2017. S. Ghemawat, G. Irving, M. Isard, et al., “Tensorflow: A system
[29] N.-T. Ly, C.-T. Nguyen, K.-C. Nguyen, and M. Nakagawa, “Deep con- for large-scale machine learning,” in 12th {USENIX} symposium on
volutional recurrent network for segmentation-free offline handwritten operating systems design and implementation ({OSDI} 16), pp. 265–
japanese text recognition,” in 2017 14th IAPR International Conference 283, 2016.
on Document Analysis and Recognition (ICDAR), vol. 7, pp. 5–9, IEEE, [52] T. Glasmachers, “Limits of end-to-end learning,” arXiv preprint
2017. arXiv:1704.08305, 2017.
[30] R. Ghosh, C. Vamshi, and P. Kumar, “Rnn based online handwritten [53] G. Navarro, “A guided tour to approximate string matching,” ACM
word recognition in devanagari and bengali scripts using horizontal computing surveys (CSUR), vol. 33, no. 1, pp. 31–88, 2001.
zoning,” Pattern Recognition, vol. 92, pp. 203–218, 2019. [54] D. Addis, C.-M. Liu, and V.-D. Ta, “Printed ethiopic script recognition
[31] A. Yuan, G. Bai, L. Jiao, and Y. Liu, “Offline handwritten english by using lstm networks,” in 2018 International Conference on System
character recognition based on convolutional neural network,” in 2012 Science and Engineering (ICSSE), pp. 1–6, IEEE, 2018.
10th IAPR International Workshop on Document Analysis Systems,
pp. 125–129, IEEE, 2012.
[32] P. Shivakumara, D. Tang, M. Asadzadehkaljahi, T. Lu, U. Pal, and
M. H. Anisi, “Cnn-rnn based method for license plate recognition,”
CAAI Transactions on Intelligence Technology, vol. 3, no. 3, pp. 169–
175, 2018.
[33] R. Messina and J. Louradour, “Segmentation-free handwritten chinese
text recognition with lstm-rnn,” in 2015 13th International Conference
on Document Analysis and Recognition (ICDAR), pp. 171–175, IEEE,
2015.
[34] J.-C. B. J.-M. O. Vinh Loc Cu, Trac Nguyen, “A robust watermarking ap-
proach for security issue of binary documents using fully convolutional
networks,” International Journal on Document Analysis and Recognition
(IJDAR), 2020.
[35] G. Renton, Y. Soullard, C. Chatelain, S. Adam, C. Kermorvant, and
T. Paquet, “Fully convolutional network with dilated convolutions for
handwritten text line segmentation,” International Journal on Document
Analysis and Recognition (IJDAR), vol. 21, no. 3, pp. 177–186, 2018.
[36] K. Manjusha, M. A. Kumar, and K. Soman, “Integrating scattering fea-
ture maps with convolutional neural networks for malayalam handwrit-
ten character recognition,” International Journal on Document Analysis
and Recognition (IJDAR), vol. 21, no. 3, pp. 187–198, 2018.
[37] P. Doetsch, A. Zeyer, and H. Ney, “Bidirectional decoder networks
for attention-based end-to-end offline handwriting recognition,” in 2016
15th International Conference on Frontiers in Handwriting Recognition
(ICFHR), pp. 361–366, IEEE, 2016.
[38] T. Bluche, “Joint line segmentation and transcription for end-to-end
handwritten paragraph recognition,” in Advances in Neural Information
Processing Systems, pp. 838–846, 2016.
[39] J. Poulos and R. Valle, “Character-based handwritten text transcription
with attention networks,” arXiv preprint arXiv:1712.04046, 2017.
[40] A. Chowdhury and L. Vig, “An efficient end-to-end neural model for
handwritten text recognition,” arXiv preprint arXiv:1807.07965, 2018.
[41] C.-Y. Lee and S. Osindero, “Recursive recurrent nets with attention
modeling for ocr in the wild,” in Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition, pp. 2231–2239, 2016.
[42] J. Zhang, J. Du, and L. Dai, “Track, attend, and parse (tap): An
end-to-end framework for online handwritten mathematical expression
recognition,” IEEE Transactions on Multimedia, vol. 21, no. 1, pp. 221–
233, 2018.
[43] X. Chen, T. Wang, Y. Zhu, L. Jin, and C. Luo, “Adaptive embedding gate
for attention-based scene text recognition,” Neurocomputing, vol. 381,
pp. 261–271, 2020.

You might also like