Paper 3
Paper 3
Abstract— Many researchers have contributed to automate through Convolution Neural Network [12]. Both the models
the optical character recognition. But handwritten character are trained for the dataset and evaluated based on test data.
recognition is still an uncompleted task. In this paper we are Results were compared for both methods in terms of
proposing two techniques to recognize handwritten Kannada efficiency, ease of use, and flexibility. The proposed CNN
script, which yields high accuracy compared to previous works. technique gives better results as compared to Tesseract tool
There are lot of challenges in recognizing handwritten Kannada for the data set used in the experiment.
scripts. Few of the challenges include: each person have their
own handwriting, there is no uniform spacing between II. RELATED WORK
alphabets, words and lines. Another main problem when it
comes to Kannada language is that there is no large dataset Here we discuss about previous work related to
available to train the recognition system, and it is challenging to handwritten script recognition for Kannada. The literature
write all combinations of each alphabet in Kannada script. In study was carried out to understand and recognize the
the proposed work, we have gathered the handwritten training drawbacks of existing techniques towards Kannada
set from the Web and from the students of our campus and handwritten character recognition. The literature survey
segmented each letter. We have proposed two methods to includes previous work in English character recognition and
recognize the handwritten Kannada characters. The first for the current tools available for text recognition. A detailed
techniques is by Tesseract tool, and second is by using study on printed Kannada script recognition and how to
Convolution Neural Network (CNN). With Tesseract tool we
have achieved 86% accuracy and through Convolution Neural
segment handwritten Kannada scripts was carried out. In
Network we achieved87% accuracy although it might improve paper [1], the authors discussed about training Odia language
with the data set chosen and further enhanced image processing. through Tesseract tool for handwritten Odia script. Authors
The main idea behind this work is to extract text from the have used very little data sets to conduct the experiment.
scanned images, identify the Kannada letters in it accurately
and display or store it for further usage. Segmenting and preparing a Kannada handwritten data
set is discussed in paper [2]. Authors mainly focused on
Keywords— Handwritten character recognition, OCR, creating a standard data set to provide framework for other
Kannada handwritten script recognition, character classifier,
related researches and helps in overcoming lack of data set
CNN.
for Kannada character set. Recognition of printed and
I. INTRODUCTION handwritten Kannada numerals is discussed in [3] using SVM
(Support Vector Machine) technique. Authors have worked
Kannada is one of the main Dravidian language spoken
mainly by Karnataka region people in India. The language has on identifying handwritten numerals by mixing both printed
roughly 43.7 million native speakers and it is written using and handwritten dataset and classified them using SVM. The
Kannada script [11]. Each person has a different handwriting experiment was concentrating only on numerals but not on
but there will be similarities. The task of handwritten script how to train for whole language. Review on tools available in
recognition is to identify these lines and store it as a text file. market and their efficiency is discussed in paper [5].Authors
There are already processes like OCR (Optical Character used K-nearest neighbor algorithm for classification and this
Recognitions) to help with any new language recognitions but paper also gives review on different feature extraction
these methods work very well only on printed words. techniques for script recognition. In papers [6] and [7], deep
learning techniques for handwritten English alphabets is
Several minor languages in Karnataka, like Tulu, Konkani,
discussed. In the paper [6], authors have used neural networks
Kodava, Sanketi, Beary also uses script based on Kannada
script. The Kannada script consist of forty-nine letters. It is and in paper [7], authors have used zoning technique for
written from left to right and consist of vowels and feature extraction which is combined with PNN classifier.
Consonants. Letters representing consonants are combined to Similar approaches can be applied for Kannada language too
form digraphs when there is no intervening vowel. Otherwise, but, one should keep in mind that number of classes of
each letter corresponds to a syllable. Each letter has its own character set in Kannada is very large compared to English
form and sound, which helps in providing visual and audible and Kannada does not have joint letters unlike English
representation respectively. handwriting which will be in our favor for easy segmentation.
In article [9], authors have worked on online handwritten
In the proposed work, we mainly focus on handwritten Kannada script and trained the model to recognize the same
scripts of Kannada language. First step towards handwritten
to improve the efficiency. Authors have used Statistical
character recognition is collecting the data set. Next step is to
prepare the training dataset by applying filters and edge Dynamic Time Warping (SDTW) technique. But real
detection techniques so that unwanted information from the handwritten data will have more randomness in handwriting
image can be discarded. The proposed method uses two and there won’t be uniformity between each word. A person
techniques. The first technique is by using Tesseract tool. The can have same letter in multiple shape within a single page
Tesseract tool is an optical character recognition engine. The only. In proposed system we have worked on Tesseract tool
second technique is by using classification of alphabets which has high accuracy for script recognition and best suited
A. Image Collection
This is the first stage in both the models wherein we
collected different handwritten image data sets from various
sources. Kannada handwritten documents are available in
Internet which can be converted to set of images. We have also Fig. 3.4 Text detection
collected Kannada handwritten documents from our campus
students. These documents are manually written by students. Once the text is detected now it is time to break the text
But it is not preferred since one cannot sit and write all into single character images. It is achieved by cropping the
possible alphabets in Kannada. This data is only used for image section which is covered by each set of co-ordinate
testing the model and it is not a standard data used for training. values obtained from contours. Fig 3.5 shows output images
Bu Hence there is lot of opportunity for improvement when after cropping.
done with standard dataset.
B. Preprocessing
This is the stage where in images are prepared for
training. The Tesseract tool works only for the gray scale
Now it is manual work. Copy the image to generated box file
location and open image with qt-box-editor.
Fig. 3.5 Cropped letters from the image as new segmented set of images
Images with multiple copies of the same character is Syntax: font_name italic bold monospace serif fraktur
preferred, since the system will learn the different forms of
the same image. Name the input image as per tesseract Now the actual training.
convention.ie,
<language_name>.<font_name>.exp<number>.<file_ mftraining font_properties unicharset kan.unicharset
extension> training_filename
In our case it is kan.sample.exp0.png.Name the multiple cntrainning training_ file_ name
images of the same fonts by changing the number field.
Next step is to generate box file to all these image files do this for all the .tr files and copy the content to any one
together. inttemp, normproto, pffmtable, shapetable files respectively
for each image in the training data set and delete the rest. Rename the generated inttemp,
tesseract Image_file Image_Name batch nochop normproto, pffmtable, shapetable with kan. as prefix as
makebox shown in Table I. Move file content to kan.file.
endfor
TABLE I. RENAMING FILES label. Train-data is the path where all images are stored for
file names Modified filenames training.
unicharset kan.unicharset
inttemp kan.inttemp
Train data with label():
pffmtable kan.pffmtable Initialize list
normproto kan.normproto shuffle images in dataset
shapetable kan.shapetable for each image in the path
Convert image to greyscale
Combine all the files with kan. as prefix to kan.traindata Resize image (usually 28*28,based on
accuracy one can do trial and error)
Combine_tessdata kan. Encode image with one-hot label and
convert image to np array
Now the trained file is ready. Copy the output trained file to Append to list
endfor
$TESSRACT_INSTALATION_DIR/tessdata/
In the above algorithm, the first step is to shuffle the image
It can be used with any Kannada recognition application just data set. The reason for shuffling is to improve the
by setting the language argument as kan. We can embed all randomness of the data. Next step is to build the training
these commands in python to run all of the above bash script model. For model creation we used Keras which is a library
together but after creating box file only. available for CNN model creation. We need to import
sequential model, layers and optimizers from Keras. Then
load the data to a variable by calling train-data-with-label
D. Training using CNN techniques
function.
This method follows image classifier using Convolution In our experiment, we used 3 convolution layers with
Neural Network. We implemented this using Python and single stride, zero padding and relu activation. In each layer,
Tensorflow. Basic idea is to classify each image of each filter count is increased to ensure that the initial layer
character of Kannada script while training and assigning them represents the high level features of the image and the inner
labels. When a new image is given it should precisely tell once will represent the more detailed features and so they
what character it is, then the idea is applied to identify whole have more number of features.
word and line of Kannada script.
Training Model():
Divide the data set in the following ratio. If 100 images Model :Sequential
are available for each alphabet in Kannada script, then split it layer one :
by 80:20 ratio.80 images as training data and 20 images as filters=32,kernelsize=5,strides=1,padding=same,
test data. Store these images in separate folders testdata and activation=’relu’
traindata without changing the image name because we layer two:
identify the different images after comparison, by its label. filters=50,kernelsize=5,strides=1,padding=same,
First step in the training phase is to label each character. We activation=’relu’
followed one-hot-label encoding technique to accomplish layer three:
this. In this method we create 1*n dimensional array where n filters=80,kernelsize=5,strides=1,padding=same,
denotes number of labels. There are nearly 800 combinations activation=’relu’
generated. But here we have taken only a 5 letter for Maxpool: poolsize=5
demonstration and hence we have used 5*1 array. It is Dropout=0,25
ordered, each index denotes unique letter. If input image label flatten and dense layers
is matched, the corresponding index value is changed to 1 softmax Activation
from 0.
After three convolution layers, we have used one dropout
encodeList equal to [0] *number of alphabets in Kannada layer. This is to avoid over fitting problem. Once the image
One-hot_label(image): passes through convolution layer, it has to be again flattened
set label by image filename and fed into fully connected dense layer. In the proposed
if label: work, 2 dense layers are used. The first dense layer is having
set encodeList[label]=1 512 neurons and relu activation. The neuron count can be
endif arbitrary for this layer. For our experiment, we have
identified 512 neurons as the best fit. The second denser layer
Function test_data_with_label will be converting the image will have number of neurons depending on the number of
data into numpy array of size 36*36.Original image can be of letters in Kannada script. In our experiment we have used 5
any size but this function will resize it into 36*36 size. In letters.
general, the size of the image can be of n*n size, depending
on the accuracy. For example, one can set it into 16*16 This layer will use softmax activations, which calculates
or128*128. During the training phase of image data, for our the probabilities of each target alphabet over all possible
experiment we got best results with 36*36 image size. We alphabets in Kannada script. Based on the highest probability
convert the image into pixel array and encode it with one-hot- value, the softmax model decides which letter it is. Since any
word in an image is made by combination of letters, one has accuracy. One can run some image input and can see the result
to first extract script in image then break it into images of the result manually.
characters in it, order it in same as it appears in the actual print(image_to_string(Image. Open(path), Lang='ken'))
image, then feed it to deep learning model which recognizes
each letter. This text extraction can be done with help of Below given is the sample input (fig 4.1a)) and output (fig
Thresholding and Contours which is available in Python 4.2b)) for the Tesseract trained handwritten Kannada script.
Image Library (PIL). Only last one latter is mismatched (it should have been
Kannada vowel ‘ǀ’ instead of ‘o’).But we can eliminate it by
We followed Otsu thresholding for enhancing the text training with more different input data.
area in the image which is then followed by contours. Below
is the pseudo code for obtaining image covered by each
contour. In the proposed experiment, we have ignored the
images having width and height less than 5 and having height
more than 200.This is by considering that no one will write
Kannada script with such a big or small handwriting, which
may be changed as per users need.
Initialize a list
for each contour in contours: Fig. 4.1a) Input image b) Output from python IDE for Tesseract model
get rectangle bounding contour
store bounding(x, y, w, h) box in 2D list In deep learning technique we have used Adam optimizer
sort the list based on values of x and y with categorical_crossentropy as loss function with learning
for each box in list: rate of 0.001. We train our model for epochs that is for
if height is greater than 200 then different image set of each character. For every epoch the
continue model will adjust its parameter value to minimize. Below is
endif the snapshot (fig 4.2) from the starting stage of training when
if width, height is greater than 5 then classes are less and as the training process continues accuracy
crop contour from image drops slowly.
save image
endif
endfor