Extraction of Information From Handwriting Using Optical Character Recognition and Neural Networks
Extraction of Information From Handwriting Using Optical Character Recognition and Neural Networks
Abstract—Image Processing is a vital tool when one is dealing to one’s requirement and can arrive at results, which earlier
with several images and wishes to perform several complex used to take a lot of time, in a matter of seconds. Converting
actions on the same. With advances in technologies, one can now handwriting to digital data can be characterized into the above
compress, manipulate, extract required information, etc. from
any image one wants to. One such application of Image processing category. This conversion opens a myriad of avenues for us
is detecting handwritten text and converting it to a digital text and can have its own wide range of applications.
format. The main objective is to bridge the gap between the It is important for the converted data i.e in most cases digital
actual bit of paper and the digital world and in doing so, one can text, to be in a palpable and understandable format, for the
operate on the digital data much faster as compared to the actual user to be able to make full use of the same. Hence, this
data. Hence, in this paper, we aim to implement the detection
of handwritten text via Optical Character Recognition (OCR).
paper has converted it into a digital text which is fairly easy
The entire paper will be implemented on TensorFlow. T h i s to comprehend. Using optical character recognition (OCR), it
r e s e a r c h w o r k h a s also analyzed various results and taken aims to achieve this task. A Neural Network (NN) model is
appropriate dataset to train the model. Further, the importance devised to be trained on the dataset. This neural network
of this paper lies in the fact that it can facilitate and open model will consist of various layers as discussed in detail
various unexplored avenues. The key novelty of the paper lies
in the fact that the data-set used is comprehensive which helps afterward. The image of the word will act as the input to the
us to produce better result. In addition, the paper successfully entire model and pass through the several layers, eventually to
analyzes handwritten scripts and extracts it in digital form. come out as digital text data. Since the data-set chosen is a
Analyzing the text can help combat forgery, understand certain fairly exhaustive one, the training will also be fairly sufficient
temperaments of the person writing the text, and so on. to keep the accuracy of the model satisfactory. Although this
Coupled with this, this paper has successfully implemented an
improved version as compared to the pre-existing solutions by is an assumption, this have strengthened the proposition by
using the convergence of convoluted neural networks (CNN) and bolstering the work in this paper with some performance
the Recurrent Neural Network (RNN). metrics. This will help us to deduce the exact accuracy of the
model and hence would indicate certain areas for further
Index Terms—Image processing, handwriting detection, research. The speed of computing the same is also analyzed
optical character recognition (OCR), TensorFlow, convoluted
neural and kept in mind for comparison.
networks (CNN), recurrent neural network(RNN)
II. LIT ERAT URE SURVEY
I. INT RODUCT ION The optical character recognition (OCR) is a broad domain
of research in soft computing, artificial intelligence (AI),
Humans have constantly been evolving and working towards
pattern recognition (PR), and computer vision. OCR is a
making their lives better. Technology forms one of those
general technique of handwritten texts or digitizing pictures of
aspects, wherein the humans continuously make innovations
printed documents, so that they could be electronically
and advancements to improve both the user experience and
amended, stored and searched more efficiently and correctly.
perform complex tasks in a very short span of time. Coupled
According to [1], there are two main types of OCR. One of
with this, the internet penetration has increased by leaps and
them being offline and the other being online OCR. Both of
bounds. Since the inception of the World Wide Web [WWW],
these types differ mainly in the input of images. In the offline
the number of users of the internet has been increasing at a
mode, the input is basically the static information (via
striking rate. Commensurate to this increase, a lot of data has
images), whereas in online mode the information is obtained
been digitized. This digitization has enabled a seamless
via real time writing objects. Online methods gain the position
transmission of data in various forms. This further enables us
to extract a ton of information both in a very short span of
time and efficiently. When one has digital data one can
manipulate it according
978-1-7281-6387-1/20/$31.00 ©2020 IEEE 1328
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
of pen as a function of time straight from the interface. This is after the development of Deep Learning, the HMM has been
typically done through pen-based interfaces and somewhere obsolete. Deep Learning has made up for the shortcomings
the author writes with an individual pen on an electronic of HMM. There are several alternatives to the HMM [7] and
tablet. Therefore, the online method usually has a higher has hence increased the accuracy of the model. In the paper,
complexity due to the dynamics involved. the CNN and RNN i.e Convolutional Neural Networks and
Recurrent Neural Networks have been used which has been
depicted in Fig.2.
However, the offline mode also does not seem simple either,
due to the variation of handwriting and several unprecedented
hand writings. Due to such wide applications, OCR has been a
great research topics and has been advancing by leaps and Fig. 2. Basic Model Overview.
bounds. [2] shows us the historical development of OCR and
how OCR has turned out in its modern form. Such an overview As we can clearly see in Fig.2, it consists of the following
provided us with a firm understanding of OCR. layers:
Now for the implementation of OCR one actually have a 1) CNN (Convolutional Neural Networks): CNN is a type
myriad of options. As observed in [3] one can implement OCR of neural network which consists of an input and output layer
without segmentation. This process has both its merits and with several hidden layers between them. These layers are
demerits. One of the biggest cons of this process is that it colloquially known as convolutes but technically, they are
is not much comprehensive. Similarly in [4] we can see that known as sliding dot product or cross -correlation. The
support vector machines (SVM) can also be used for OCR. activation function used is the Rectified Linear Unit (RELU)
Further, K nearest neighbor (KNN) has also been used for the function, which basically serves the purpose of a rectifier. This
same implementation. KNN is a smart way to perform OCR is subsequently followed by various other convolutions such
but has its limitations because it is after all only a classification as pooling layers, fully connected layers and normalization
algorithm and hence fails to provide us the adequate insights layers. The usage of RELU is opted because of its patent
[5]. advantages [8].
Given the advancements of technologies and the patent A convolution is designed such that the kernel is defined by
shortcomings of the above discussed ideas, a holistic solution width and height (hyper parameter), number of input and
is provided by Convolution Neural Networks. This paper output channel (hyper-parameter) and the depth of the
explores the ways in which the limitations of [3], [4], [5] are convolution filter (the input channels) must be equal to the
tackled and how the paper has been able to do the same with a number channels (depth) of the input feature map.
greater level of accuracy. As discussed before the main aspect
of solving the limitations is the usage of NN. Text is an
arbitrary sequence of characters, and for those reasons one
requires a higher accuracy. This problem is efficiently solved
by using Recurrent Neural Network (RNN).
III. M E T HODOL OGY
As discussed above, OCR has been in research for a
long time and the accuracy of every model is increasing
and implementation is improving day by day. Earlier, for Fig. 3. Overview of a CNN with the RELU activation model.
handwriting text recognition, the widely acclaimed hidden
markov models (HMM) was used [6]. HMM’s are basically a The aforementioned hyper parameters, then control the size
set of probabilistic tools for dealing with sequences. But of the output of the convolution layers. Firstly, the number
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
of neurons in the input region is controlled by the depth. writers. This collection helps to classify the writers based
After training, they learn to activate specific parts of the input. ontheir respective handwriting. The data set is fairly big and
Secondly, the depth columns along spatial diversity are has complex handwriting, with a lot of punctuation. The
organized are controlled by strides. As the stride length is corpus of the IAM dataset has been formulated by
increased, the spatial distancing is also increased and the data categorizing them by assigning unique word-id, gray level to
passed and received by the receptive layer is also increased. binarize the line containing the word, bounding box around the
Lastly, the output and the input size must always be matched, word in form of cartesian coordinates along with width and
hence padding is a vital role in the CNN. These are the three height (x,y,z,w,h) , grammatical tag for the word and the actual
Hyper Parameters. label for the word. The features word-id, gray level, bounding
After all the hyper-parameters have been devised and box dimensions and the actual label for the word.
specified, the number of hidden layers are then decided. In The partition methodology has been used in order to im-
case of a linear relationship between input and output, a plement the model. Firstly,the data-set is trained. In training,
single hidden layer is present. This forms a simple neural the 70 percent of the corpus is used up where the model sees
network. In case of any layers greater than 1, it forms a and learns from the data. In this case the foundations of the
deep neural network. In this paper the number of layers of weights and biases of the NN are determined. Secondly, the
CNN implemented is 5. Now as each input can be taken as test data-set, in which 30 percent of the data-set is taken for
an individual neurons and each input has a relation to the final evaluation of the model.
neuron in a hidden layer. Each mapping has a different weight
or bias. These weights are then summed and this addition is
passed through the RELU layer. Further, the RELU layer is
responsible of removing the negative values from an
activation map by setting them to zero.
By doing so, the decision making of the non-linear properties
of the decision function a lot simpler.
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
Fig. 6. Image Padding: An arbitrary sized image padded with white space to
fit the target image size of 128x32.
Fig. 9. T op is the Input image. Bottom is the output given by the code, which
includes the image detected and its probability.
blank label with most confidence score i.e. above the threshold,
when concatenated together gives us the required letter “m-i-h-
Fig. 7. T op is the Input image. Bottom is the output given by the code, which i-r”. This is done dynamically and hence the procured scores
includes the image detected and its probability. are the scores starting from first letter ‘m’ till CTC blank
character is obtained. Also, Fig.9 showcases the performance
3) Recurrent Neural Network (2 Layers): Recurrent Neural of the model when it encounters bad handwriting.
Networks (RNN) is used for word prediction. RNN is required
as memory element is needed to predict the word based on V I. EXPERIMENT AL ANALYSIS AND RESULT S
current feature and the previous feature. Thus BLSTM
network (Bidirectional Long Short Term Memory) of RNN is The CNN and RNN model was trained for 1000 epochs
used in implementing the model [10]. Fig.8 shows the output with a batch size of 16 and was saved separately as an .h5py
from the trained RNN network which correctly determines the file format and was later used for model evaluation and text
handwritten text”mihir” with a probability of 0.4865. The
Recurrent Neural Network implements the Bi-directional Params Default With Punctuations Without Punctuation
LSTM. ADAM optimization algorithm is used for updating the percent error percent error percent error
CER WER CER WER CER WER
weights iteratively. The letters, most of the time are predicted 820k
11.29 37.34 10.23 29.099 7.52 19.11
at the location where they appear with the letters including the
CTC
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
prediction. The model was evaluated for IAM test dataset raw image. The image has to be properly padded for greater
which had 860 lines which consisted of 6960 words which had efficiency. The gray color normalization in the pre-processing
a total of 38429 characters. The decoding of the text was done stage ensures efficiency in feature extraction. If we increase
using CTC (Connectional Temporal Loss) network [11] which the number of epochs of training with a lower batch size,
is used for complementing the RNN. As BLSTM network is we can increase the accuracy of feature extraction from the
used, the timing of the incoming non-blank feature is not input image, but would increase the training time. RNN was
constant, hence we need this complementary network to detect used for text prediction because CNN is used only for feature
the stream of characters that can be collectively identified as extraction, there was a requirement of a neural network with
a word. Hence to identify a stream of characters as a word memory i.e. BLSTM. The CTC (Connectional Temporal Loss)
we wait for the CTC blank character. The character error rate decoding is taken into account for final text prediction as the
(CER) and the word error rate (WER) was evaluated for the timing of input sequence is variable. Hence, we wait for the
test data-set. The test data-set consisted of 820k parameters CTC blank character to identify a set of characters as a word.
and it was observed that the average character error rate (CER) The CER and WER was computed for inline sentences in the
came out to be 11.29 percent and the average word error rate image, which, in this case, acts as a performance metrics, to
(WER) came out to be 37.34 percent. Sentences with judge the model accuracy. Looking at the ways to improve the
punctuation marks had an average CER of paper, the proposed research work can focus on the following
10.23 percent and WER of 29.099 percent. Sentences without concepts. Firstly, a larger data-set with an even more
punctuation marks had an average CER of 7.52 percent and comprehensive data of hand writing, the accuracy of the
WER of 19.11 percent, where it was observed that about 20 model can be easily increased. Secondly, given the capability
percent of the overall error rate was contributed by of the mobile devices, a mobile application can be developed
punctuation marks. Thus, if the punctuation marks are to increase the portability and accessibility. Lastly, the
disregarded, then the rate of recognition outperforms by 5 applications of such devices i.e. the handwriting detector can
percent. The following Fig.10, Fig.11 and Fig.12 shows the be customized to match the specific demands and needs of the
text prediction that was obtained from test data images. The user. Owing to the needs the software or algorithm can be
first image is the handwritten text which is the test data image customized and the users can leverage from these highly
given to the input layer of HTR model. TEL is the label specialized features. For example, if the Handwriting detector
associated with the image sample i.e. the actual handwritten is put to use for detecting cheque forgery [12] [13], then the
text. TEP is the predicted text by the HTR model. data set needs to be modified with respect to signatures and
other aspects. Further, for the implementation of text to speech
and added vocalization as seen in [14], the model implemented
in this paper can be enhanced Another application for the
similar cause can be seen in [15] where text recognition
model has been combined with face detection blind people.
Fig. 10. Handwritten text prediction from test input image. V II I. RE FE RE NCES
V II . CONCL USION
The Handwritten Text Recognition model was implemented
using 5 Layer CNN and 2 Layer of RNN with BLSTM
network. CNN is an efficient way of feature extraction from a
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.
Fourth International Conference on Electronics, Communication and Aerospace Technology (ICECA-2020)
IEEE Xplore Part Number: CFP20J88-ART; ISBN: 978-1-7281-6387-1
Authorized licensed use limited to: Carleton University. Downloaded on May 30,2021 at 05:16:37 UTC from IEEE Xplore. Restrictions apply.