0% found this document useful (0 votes)
39 views6 pages

Paper 3

The document discusses two techniques for recognizing handwritten Kannada script: using the Tesseract tool and using a Convolutional Neural Network. It collected a dataset of handwritten Kannada text images and preprocessed them before training models with each technique. The CNN approach achieved a slightly higher accuracy of 87% compared to 86% for Tesseract, though the accuracy could be improved with an enhanced dataset and image processing.

Uploaded by

mukundaspoorthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views6 pages

Paper 3

The document discusses two techniques for recognizing handwritten Kannada script: using the Tesseract tool and using a Convolutional Neural Network. It collected a dataset of handwritten Kannada text images and preprocessed them before training models with each technique. The CNN approach achieved a slightly higher accuracy of 87% compared to 86% for Tesseract, though the accuracy could be improved with an enhanced dataset and image processing.

Uploaded by

mukundaspoorthi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

Kannada Handwritten Script Recognition using

Machine Learning Techniques


Roshan Fernandes Anisha P Rodrigues
Department of CS&E Department of CS&E
NMAM Institute of Technology NMAM Institute of Technology
Nitte, India Nitte, India
[email protected] [email protected]

Abstract— Many researchers have contributed to automate through Convolution Neural Network [12]. Both the models
the optical character recognition. But handwritten character are trained for the dataset and evaluated based on test data.
recognition is still an uncompleted task. In this paper we are Results were compared for both methods in terms of
proposing two techniques to recognize handwritten Kannada efficiency, ease of use, and flexibility. The proposed CNN
script, which yields high accuracy compared to previous works. technique gives better results as compared to Tesseract tool
There are lot of challenges in recognizing handwritten Kannada for the data set used in the experiment.
scripts. Few of the challenges include: each person have their
own handwriting, there is no uniform spacing between II. RELATED WORK
alphabets, words and lines. Another main problem when it
comes to Kannada language is that there is no large dataset Here we discuss about previous work related to
available to train the recognition system, and it is challenging to handwritten script recognition for Kannada. The literature
write all combinations of each alphabet in Kannada script. In study was carried out to understand and recognize the
the proposed work, we have gathered the handwritten training drawbacks of existing techniques towards Kannada
set from the Web and from the students of our campus and handwritten character recognition. The literature survey
segmented each letter. We have proposed two methods to includes previous work in English character recognition and
recognize the handwritten Kannada characters. The first for the current tools available for text recognition. A detailed
techniques is by Tesseract tool, and second is by using study on printed Kannada script recognition and how to
Convolution Neural Network (CNN). With Tesseract tool we
have achieved 86% accuracy and through Convolution Neural
segment handwritten Kannada scripts was carried out. In
Network we achieved87% accuracy although it might improve paper [1], the authors discussed about training Odia language
with the data set chosen and further enhanced image processing. through Tesseract tool for handwritten Odia script. Authors
The main idea behind this work is to extract text from the have used very little data sets to conduct the experiment.
scanned images, identify the Kannada letters in it accurately
and display or store it for further usage. Segmenting and preparing a Kannada handwritten data
set is discussed in paper [2]. Authors mainly focused on
Keywords— Handwritten character recognition, OCR, creating a standard data set to provide framework for other
Kannada handwritten script recognition, character classifier,
related researches and helps in overcoming lack of data set
CNN.
for Kannada character set. Recognition of printed and
I. INTRODUCTION handwritten Kannada numerals is discussed in [3] using SVM
(Support Vector Machine) technique. Authors have worked
Kannada is one of the main Dravidian language spoken
mainly by Karnataka region people in India. The language has on identifying handwritten numerals by mixing both printed
roughly 43.7 million native speakers and it is written using and handwritten dataset and classified them using SVM. The
Kannada script [11]. Each person has a different handwriting experiment was concentrating only on numerals but not on
but there will be similarities. The task of handwritten script how to train for whole language. Review on tools available in
recognition is to identify these lines and store it as a text file. market and their efficiency is discussed in paper [5].Authors
There are already processes like OCR (Optical Character used K-nearest neighbor algorithm for classification and this
Recognitions) to help with any new language recognitions but paper also gives review on different feature extraction
these methods work very well only on printed words. techniques for script recognition. In papers [6] and [7], deep
learning techniques for handwritten English alphabets is
Several minor languages in Karnataka, like Tulu, Konkani,
discussed. In the paper [6], authors have used neural networks
Kodava, Sanketi, Beary also uses script based on Kannada
script. The Kannada script consist of forty-nine letters. It is and in paper [7], authors have used zoning technique for
written from left to right and consist of vowels and feature extraction which is combined with PNN classifier.
Consonants. Letters representing consonants are combined to Similar approaches can be applied for Kannada language too
form digraphs when there is no intervening vowel. Otherwise, but, one should keep in mind that number of classes of
each letter corresponds to a syllable. Each letter has its own character set in Kannada is very large compared to English
form and sound, which helps in providing visual and audible and Kannada does not have joint letters unlike English
representation respectively. handwriting which will be in our favor for easy segmentation.
In article [9], authors have worked on online handwritten
In the proposed work, we mainly focus on handwritten Kannada script and trained the model to recognize the same
scripts of Kannada language. First step towards handwritten
to improve the efficiency. Authors have used Statistical
character recognition is collecting the data set. Next step is to
prepare the training dataset by applying filters and edge Dynamic Time Warping (SDTW) technique. But real
detection techniques so that unwanted information from the handwritten data will have more randomness in handwriting
image can be discarded. The proposed method uses two and there won’t be uniformity between each word. A person
techniques. The first technique is by using Tesseract tool. The can have same letter in multiple shape within a single page
Tesseract tool is an optical character recognition engine. The only. In proposed system we have worked on Tesseract tool
second technique is by using classification of alphabets which has high accuracy for script recognition and best suited

978-1-7281-3735-3/19/$31.00 ©2019 IEEE


Optical character Recognition (OCR) and another by images and hence our first step towards preprocessing is
classification of data through Convolution Neural Network. converting the image to gray scale image. Since it is a
handwritten data the images will have unwanted small dots
(Fig 3.2a) and other noise which might affect the training
III. METHODOLOGY process. So, we remove it by applying Denoising technique.
The proposed methodology uses two approaches to train Output images after applying Denoising technique is shown
the handwritten Kannada text images. Fig 3.1 shows the flow in Fig 3.2b). This will add smoothing effect on the images.
diagram of proposed method. Initially the text images are Depending on the quality of the images, there may be a need
collected mainly from online sources with handwritten pdf of image sharpening too.
documents, which are then converted to images. The contour
technique is applied to detect the text in the image document.
Through image processing one can easily divide paragraphs
into images of each letter. These images are separated and
stored as per the pronunciation in Kannada. Text images are
also collected from 250 different students from a classroom.
For Tesseract tool we need to have different combinations of Fig. 3.2a) Input Image with Noise b) Denoised Image
the same letter in the same image, whereas for CNN we must
break the words into each letter. Then it is supplied to the The next step in image preprocessing is Erosion. The
preprocessing stage where each image is converted to gray Erosion process applied on images to decrease thickness of
scale image and the text area is enhanced. each object in the image so that there will be enough space
between them and easy to isolate them. Output of above
method (Fig 3.3) can be then moved forward for Tesseract
tool training process. But before that one must convert the
image type to tiff from any other type like png or jpeg.

For CNN method we have to break the images into each


character and arrange the images alphabetically according to
Kannada script. For this first we must detect text area in the
image. For this we used contours technique. It detects
intensity variation, shape and object from the image and
returns the co-ordinates of the shape detected (Fig 3.4). It
returns the x and y co-ordinates with width and height of each
detected letter.

Fig. 3.3 Image after erosion.


One can see clear separation between each letter.

Fig. 3.1 Flow diagram of the Proposed System

A. Image Collection
This is the first stage in both the models wherein we
collected different handwritten image data sets from various
sources. Kannada handwritten documents are available in
Internet which can be converted to set of images. We have also Fig. 3.4 Text detection
collected Kannada handwritten documents from our campus
students. These documents are manually written by students. Once the text is detected now it is time to break the text
But it is not preferred since one cannot sit and write all into single character images. It is achieved by cropping the
possible alphabets in Kannada. This data is only used for image section which is covered by each set of co-ordinate
testing the model and it is not a standard data used for training. values obtained from contours. Fig 3.5 shows output images
Bu Hence there is lot of opportunity for improvement when after cropping.
done with standard dataset.
B. Preprocessing
This is the stage where in images are prepared for
training. The Tesseract tool works only for the gray scale
Now it is manual work. Copy the image to generated box file
location and open image with qt-box-editor.

Fig. 3.5 Cropped letters from the image as new segmented set of images

Now store each letter in their respective folders. The folder


can be named as per pronunciation of Kannada alphabets. The
example for this step is shown below:
4-> a ,5->aa, 8->u, 7->ee, [->ya
Each letter had about 80 images with different handwriting
samples with different size and texture. Even though we could Fig. 3.6 a) Input Handwritten tiff image b) Sample few lines of content in
not get same number of images for each combination of box file
consonant letter with diacritic, we can apply same
methodology for all the Kannada words. Adjust the co-ordinate values around each character in the
image as shown in Fig 3.6 b) and give respective label by trial
After preprocessing, in Tesseract model we have to create and error.
box file for each image. Then for training, there are series of
bash commands used which can be run one after the other 4 21 25 30 5 (letter left bottom right top).Now compound box
through python code. At the end one has to copy the trained and image files into training files (.tr).
data folder to tesseract workspace and can test new images by
pytesseract library. for each image in the training data set
In the machine learning approach, before passing images tesseract Image_file Image_Name box.train
into CNN layers, we have to encode training data to one-hot- endfor
label since machine will not understand the images nor any
text. Then we train the model. We need large data set for Then we extract charset from the box files
training the model. At the end when we want to test the model unicharset_extractorkan.sample.exp0.box
for recognizing the Kannada handwritten characters in a
random image, one has to preprocess this random image as
discussed earlier. Further, the proposed technique involves
arranging each letter and sort based on image position. Then
supply each letter to the model for prediction and have to
differentiate between word, paragraph and newline.
C. Training using Tesseract Method
Tesseract is an optical character recognition engine
supported by various operating systems. One has to install Fig. 3.7 Sample lines inside unicharset file
tesseract-ocr and add installation directory to the system path.
We also need these following applications installed in our Do this for all box file and copy the contents (fig 3.7) to any
system: one unicharset file and delete the rest. Next create font
properties file store "kan 0 0 1 0 0" in font_properties
• Cygwin:For windows users (to run bash commands) font properties file will tell Tesseract information about the
• Qt-box-editor: To edit the box generated by the nature of the font. Always train the images with same fonts
tesseract and adjust the box around each character. together for better accuracy.

Images with multiple copies of the same character is Syntax: font_name italic bold monospace serif fraktur
preferred, since the system will learn the different forms of
the same image. Name the input image as per tesseract Now the actual training.
convention.ie,
<language_name>.<font_name>.exp<number>.<file_ mftraining font_properties unicharset kan.unicharset
extension> training_filename
In our case it is kan.sample.exp0.png.Name the multiple cntrainning training_ file_ name
images of the same fonts by changing the number field.
Next step is to generate box file to all these image files do this for all the .tr files and copy the content to any one
together. inttemp, normproto, pffmtable, shapetable files respectively
for each image in the training data set and delete the rest. Rename the generated inttemp,
tesseract Image_file Image_Name batch nochop normproto, pffmtable, shapetable with kan. as prefix as
makebox shown in Table I. Move file content to kan.file.
endfor
TABLE I. RENAMING FILES label. Train-data is the path where all images are stored for
file names Modified filenames training.
unicharset kan.unicharset
inttemp kan.inttemp
Train data with label():
pffmtable kan.pffmtable Initialize list
normproto kan.normproto shuffle images in dataset
shapetable kan.shapetable for each image in the path
Convert image to greyscale
Combine all the files with kan. as prefix to kan.traindata Resize image (usually 28*28,based on
accuracy one can do trial and error)
Combine_tessdata kan. Encode image with one-hot label and
convert image to np array
Now the trained file is ready. Copy the output trained file to Append to list
endfor
$TESSRACT_INSTALATION_DIR/tessdata/
In the above algorithm, the first step is to shuffle the image
It can be used with any Kannada recognition application just data set. The reason for shuffling is to improve the
by setting the language argument as kan. We can embed all randomness of the data. Next step is to build the training
these commands in python to run all of the above bash script model. For model creation we used Keras which is a library
together but after creating box file only. available for CNN model creation. We need to import
sequential model, layers and optimizers from Keras. Then
load the data to a variable by calling train-data-with-label
D. Training using CNN techniques
function.
This method follows image classifier using Convolution In our experiment, we used 3 convolution layers with
Neural Network. We implemented this using Python and single stride, zero padding and relu activation. In each layer,
Tensorflow. Basic idea is to classify each image of each filter count is increased to ensure that the initial layer
character of Kannada script while training and assigning them represents the high level features of the image and the inner
labels. When a new image is given it should precisely tell once will represent the more detailed features and so they
what character it is, then the idea is applied to identify whole have more number of features.
word and line of Kannada script.
Training Model():
Divide the data set in the following ratio. If 100 images Model :Sequential
are available for each alphabet in Kannada script, then split it layer one :
by 80:20 ratio.80 images as training data and 20 images as filters=32,kernelsize=5,strides=1,padding=same,
test data. Store these images in separate folders testdata and activation=’relu’
traindata without changing the image name because we layer two:
identify the different images after comparison, by its label. filters=50,kernelsize=5,strides=1,padding=same,
First step in the training phase is to label each character. We activation=’relu’
followed one-hot-label encoding technique to accomplish layer three:
this. In this method we create 1*n dimensional array where n filters=80,kernelsize=5,strides=1,padding=same,
denotes number of labels. There are nearly 800 combinations activation=’relu’
generated. But here we have taken only a 5 letter for Maxpool: poolsize=5
demonstration and hence we have used 5*1 array. It is Dropout=0,25
ordered, each index denotes unique letter. If input image label flatten and dense layers
is matched, the corresponding index value is changed to 1 softmax Activation
from 0.
After three convolution layers, we have used one dropout
encodeList equal to [0] *number of alphabets in Kannada layer. This is to avoid over fitting problem. Once the image
One-hot_label(image): passes through convolution layer, it has to be again flattened
set label by image filename and fed into fully connected dense layer. In the proposed
if label: work, 2 dense layers are used. The first dense layer is having
set encodeList[label]=1 512 neurons and relu activation. The neuron count can be
endif arbitrary for this layer. For our experiment, we have
identified 512 neurons as the best fit. The second denser layer
Function test_data_with_label will be converting the image will have number of neurons depending on the number of
data into numpy array of size 36*36.Original image can be of letters in Kannada script. In our experiment we have used 5
any size but this function will resize it into 36*36 size. In letters.
general, the size of the image can be of n*n size, depending
on the accuracy. For example, one can set it into 16*16 This layer will use softmax activations, which calculates
or128*128. During the training phase of image data, for our the probabilities of each target alphabet over all possible
experiment we got best results with 36*36 image size. We alphabets in Kannada script. Based on the highest probability
convert the image into pixel array and encode it with one-hot- value, the softmax model decides which letter it is. Since any
word in an image is made by combination of letters, one has accuracy. One can run some image input and can see the result
to first extract script in image then break it into images of the result manually.
characters in it, order it in same as it appears in the actual print(image_to_string(Image. Open(path), Lang='ken'))
image, then feed it to deep learning model which recognizes
each letter. This text extraction can be done with help of Below given is the sample input (fig 4.1a)) and output (fig
Thresholding and Contours which is available in Python 4.2b)) for the Tesseract trained handwritten Kannada script.
Image Library (PIL). Only last one latter is mismatched (it should have been
Kannada vowel ‘ǀ’ instead of ‘o’).But we can eliminate it by
We followed Otsu thresholding for enhancing the text training with more different input data.
area in the image which is then followed by contours. Below
is the pseudo code for obtaining image covered by each
contour. In the proposed experiment, we have ignored the
images having width and height less than 5 and having height
more than 200.This is by considering that no one will write
Kannada script with such a big or small handwriting, which
may be changed as per users need.

Initialize a list
for each contour in contours: Fig. 4.1a) Input image b) Output from python IDE for Tesseract model
get rectangle bounding contour
store bounding(x, y, w, h) box in 2D list In deep learning technique we have used Adam optimizer
sort the list based on values of x and y with categorical_crossentropy as loss function with learning
for each box in list: rate of 0.001. We train our model for epochs that is for
if height is greater than 200 then different image set of each character. For every epoch the
continue model will adjust its parameter value to minimize. Below is
endif the snapshot (fig 4.2) from the starting stage of training when
if width, height is greater than 5 then classes are less and as the training process continues accuracy
crop contour from image drops slowly.
save image
endif
endfor

Instead of directly cropping the image, one has to store x, y,


w, h values in an array and sort it based on the value of x and
y so that order of the letters is maintained. Store the values by
giving ascending order number as image name, for example
1.png, 2.png and so on. Now feed these images to recognition
function of the deep learning model. Testing-images are
passed to the deep learning model same as training images. Fig. 4.2 Output for last three epochs
Since these images have number as label, it will be encoded The sample output received by this model is shown in fig 4.3.
as all zeros in one hot label function. It was 100 percent accurate since it is one small word, but
result might change a little while taking a paragraph as a
for image in testing_images: whole. The accuracy and performance of the model increases
reshape image with good training data set.
model_out=model.predict([image])
if max index of model_out then
display respective letter indicated by index
endif
endfor

At each iteration, if difference between x values of current and


next image is greater than 5 then print space between each
letter and if difference between y values of current and next
image is greater than 25 then print newline. One can store the
data as a text file for further usage. Fig. 4.3 a) Input image b) Output from python IDE for Deep learning model

The result and comparison between two proposed


IV. RESULTS AND DISCUSSION
methods is shown in fig 4.4. As one can see from the graph,
Although both methods give more than 80 percent of the accuracy is more when we train we train the model only
accuracy, the performance mainly depends on the training for vowels and consonants. The accuracy drops when we train
data. A good number of data sets with covering most of the the model for whole language, since model has to identify the
handwriting style will result in 99 percent of accuracy. In the letter from a greater number of classes and classify them
first method there is no function available for finding
accordingly. With mixed fonts handwriting tesseract tool training phase becomes slow. But still it is better compared to
faces difficulty in identify the letters, whereas CNN method previous techniques [6] and [7]. There is no restriction on
can more accurately classify them. images. Images can be of any color. The model will convert
it into how it needs it. About the time consumption in
88%
training, it’s the machine which is going to work not human.
86%
Once the training is done it becomes simple and one can
84% easily modify the details in model as per their need.
82%
80% V. FUTURE WORK
78%
76% Kannada hand written script recognition can be used in
74% much applications which includes the task of reading
72% Kannada literature. It can be embedded with other projects
Vowels Consonants whole Mixed Fonts like, to help blind people to read the Kannada boards and
language signs while crossing road or in any type of navigation which
involves Kannada script. It makes the job easy if there is
Tesseract Tool CNN already written Kannada script handwritten documents and
they are automatically recognized and stored in a digital file.
This might be a basic step for any Artificial Intelligence
Fig. 4.4 Comparison between accuracy of two proposed methods
related research work on Kannada language. Because once
the Kannada character recognition is carried out, it easy to
One more analysis is based on the number of images for
move forward.
each letter and number of classes for training in CNN model.
Accuracy increase as we add more number of training images
All though we have discussed here for Kannada language,
for each alphabet since the model can easily differentiate these methods of handwritten can be used for any language
among the letters. But as we add more number of set of not only for Kannada script. The challenges will be more when
classes (alphabets) to identify, it drops slightly. Fig 4.5 shows the language scripts are complex.
the analysis on this.
88% REFERENCES
86%
[1] Mamata Nayak,Ajit Kumar Nayak,”Characters Recognition by
84% Training Tesseract OCR Engine”,International Journal of Computer
82% Applications (0975 – 8887),Bhubaneswar, India , 2014
[2] Alireza Alaei, P. Nagabhushan, Umapadpal, “A Benchmark Kannada
80% Handwritten Document Dataset and its Segmentation”, International
78% Conference on Document Analysis and Recognition, 2011
[3] G. G. Rajput, Rajeswari Horakeri, Sidramappa Chandrakant,”Printed
76% and Handwritten Mixed Kannada Numerals Recognition Using
74% SVM”,International Journal on Computer Science and Engineering
Vol. 02, No. 05, 2010, 1622-1626
50 100 150 200 250 300 600
[4] K. Indira, S. Sethu Selvi , Assistant Professor, “Kannada Character
Number of images for each letter Recognition System: A Review”,M. S. Ramaiah Institute of
Technology, Bangalor
SET size(Number of classes) [5] Yusuf Perwej, Ashish Chaturvedi, “Neural Networks for Handwritten
English Alphabet Recognition”,International Journal of Computer
Fig. 4.5 Analysis as the number of images for each letter and classes grow Applications (0975 – 8887) Volume 20– No.7, April 2011
in size [6] Batuhan Balci,Dan Saadati,Dan Shiferaw, “Handwritten Text
Recognition using DeepLearning ”,stanford.
Both the methods have their own draw backs. In the [7] Netravati Belagali(1), Shanmukhappa A. Angadi (2) “OCR for
Tesseract method manual work for drawing the box around Handwritten Kannada Language Script”,International Journal of
Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue
each image and to label them accordingly is difficult task. On 08; August - 2016 [ISSN: 2455-1457]
top of that tesseract needs a black and white image with [8] Rituraj Kunwar, Mohan P., Shashikiran K, A. G. Ramakrishnan
particular size (300dpi). It fails to recognize if constraints on ,“Mining Online Product Reviews and Extracting Product features
image doesn’t meet. There are about 800 different forms of using Unsupervised method”, IISc, Bangalore.
images and for each letter even if we consider 1000 images [9] http://pretius.com/how-to-prepare-training-files-for-tesseract-ocr-and-
no one can sit and edit each letter, but accuracy is high here improve-characters-recognition/
since Tesseract is designed to help Optical Character [10] https://blog.francium.tech/build-your-own-image-classifier-with-
tensorflow-and-keras-dc147a15e38e
Recognition.
[11] https://en.wikipedia.org/wiki/Kannada
Whereas CNN approach makes the training easy. Problem
[12] https://www.coursera.org/learn/convolutional-neural-networks
here is as we train for all possible letters in Kannada, the

You might also like