0% found this document useful (0 votes)
16 views6 pages

Kunte 2007

This paper presents a bilingual Optical Character Recognition (OCR) system designed for printed Kannada and English text, achieving a character recognition rate of 90.5%. The system utilizes Gabor filters and wavelet features for script identification and character classification, employing multilayer feedforward neural networks. The proposed methodology includes a two-stage classification process to effectively recognize a large set of Kannada characters alongside English text.

Uploaded by

sanjeevkunte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views6 pages

Kunte 2007

This paper presents a bilingual Optical Character Recognition (OCR) system designed for printed Kannada and English text, achieving a character recognition rate of 90.5%. The system utilizes Gabor filters and wavelet features for script identification and character classification, employing multilayer feedforward neural networks. The proposed methodology includes a two-stage classification process to effectively recognize a large set of Kannada characters alongside English text.

Uploaded by

sanjeevkunte
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

10th International Conference on Information Technology

A Bilingual Machine-Interface OCR for Printed Kannada and English Text


Employing Wavelet Features

R Sanjeev Kunte R D Sudhaker Samuel


J S S Research Foundation Department of Electronics
S J C E Campus S J College of Engineering
Mysore - 570006, Karnataka, India Mysore - 570006, Karnataka, India
[email protected] [email protected]

Abstract participating scripts.


In the Indian OCR context, most of the works on OCR
An Optical Character Recognition (OCR) system is one have been carried out for Devanagari, Bangla and Telugu
of the important research areas in the field of Human- Scripts [1, 2, 3] while not many works are reported for
machine interface. This paper presents a bilingual OCR the Kannada language. A modified region decomposition
system for printed Kannada and English text. Gabor fil- method and optimal depth decision tree for the recognition
ter based features are used for separating the Kannada of printed Kannada characters is reported in [4]. A subspace
and English words from the bilingual document. Wavelets projection approach using Radial Basis Function Networks
that have been progressively used in pattern recognition are for the recognition of basic Kannada characters is described
used in the system to extract the features for classifying both in [5]. In a Kannada OCR system reported in [6] structural
the Kannada and English characters. Multilayer feed for- features extracted from the zones of the character image are
ward Neural classifiers known for their good generaliza- used for classification using Support Vector Machine.
tion and approximation property have been effectively used Gabor filters are widely being used in Computer vision,
in the system for the classification. An overall recognition Pattern recognition, Image processing applications due to
rate of 90.5% is obtained at character level. their optimal localization properties in both spatial analysis
and frequency domain [7].
The impulse response of Gabor filters is created by mul-
1. Introduction tiplying a Gaussian envelope function with a complex oscil-
lation. Gabor showed that these elementary functions min-
Optical Character Recognition (OCR) is one of the old- imize the space (time) - uncertainty product. By extending
est sub fields of pattern recognition with a rich contribution these functions to two dimensions, it is possible to create
for the recognition of printed documents. In a country like filters which are selective for orientation.
India, where many languages and scripts exist, it is more In the recent years, wavelet analysis has been success-
common that a single document page like railway reserva- fully applied in the field of pattern recognition. Wavelet
tion forms, bank challans, etc. contain words from more descriptors of a character contour, as a set, can be used
than one script. In this context a multi-script computer- to replace the combination of many conventional features
interface OCR is a present day requirement in India. used in character recognition systems. They serve as excel-
Two types of approaches are followed in the develop- lent features for character recognition and are reasonably in-
ment of a multi-script OCR. In one approach, script iden- sensitive to intra-class variation and exhibit good interclass
tification is done at word level and this information is used separation [8].
to invoke the corresponding OCR developed for that par- The present work proposes a bilingual OCR system to
ticular script. This method reduces the search space in the recognize complete set of printed Kannada and English
database, while it involves cost of script recognition task. characters. Gabor features are used for identifying the script
In the other combined database approach, characters from at word level using pre trained neural classifier. Wavelet de-
all the participating scripts are treated identically irrespec- scriptors are used as features for recognition of both Kan-
tive of their scripts. But in this method the search space in nada and English characters employing neural networks for
the database increases as it contains alphabets from all the classification. The specific contributions of the work re-

0-7695-3068-0/07 $25.00 © 2007 IEEE 196


202
DOI 10.1109/ICIT.2007.12
ported in this paper towards the advancement in Kannada 3. System architecture
OCR technology are:
paper printed bi-script text documents are scanned with
• Formulation of a set of minimum numbers of individ- The block schematic of the bilingual OCR system devel-
ual class of Kannada characters representing the entire oped is shown in the Figure 2 and Figure 3. Paper printed
set of Kannada alphabet in the most appropriate man- in bi-script text documents are scanned with a flatbed scan-
ner. ner (300 dpi) to capture the document digital image. It is
first subjected to pre-processing to remove the background
• Development of a two-stage classification methodol- noise and then subjected to slant correction to remove any
ogy to classify the large set of 312 Kannada characters skew, if present. After preprocessing, it is subjected to seg-
containing 278 main characters and 34 subscripts by mentation. First, the individual lines in the document are
fairly simple architecture neural classifiers with good segmented using horizontal projection profile and then the
recognition rate. words in the line are segmented using vertical projection
profile of the line. Each segmented word is normalized to
Current literature reveals that the above techniques are a fixed block size. From each block of word image, Gabor
being reported for the first time in this paper for OCR of features are extracted and applied to the pre-trained neural
Kannada language. Further, to the best of our knowledge classifier to recognize the script of the word.Once the script
no work has been done towards simultaneous recognition of the word is identified, it is input to either of Kannada
of Kannada and English script. OCR block or English OCR block depending on the script.
The rest of the paper is organized as follows. Section 2 Character segmentation from English words is done us-
gives a brief description of the Kannada language. A sum- ing vertical projection profile. For Kannada word, charac-
mary of the proposed recognition scheme is given in section ters are extracted using a two-stage method, in which, each
3. Section 4 and 5 provides brief overview of Gabor filters segmented word is examined for the presence of subscript
and Wavelet descriptors. The procedure developed for re- characters. If subscript characters are not present (or when
ducing Kannada character set and the details of classifica- all the subscripts are extracted) the main characters from
tion schemes are described in section 6 and 7. Results and the word are segmented using vertical projection profile in
conclusions are discussed in section 8 and 9. the second-stage. A two-stage segmentation technique has
been adopted [10].
The character contour is extracted from the segmented
2. Kannada script character by classifying the pixels of character into inte-
rior, noise and contour points [11], and is then normalized
to make the character representation invariant to different
Kannada language has 16 vowels and 34 consonants as
transformations such as size, shift and numbers of contour
the basic alphabet of the language. Each vowel has a vowel
points [9]. For a given character (Kannada or English) con-
sign (modifier) and each consonant has a basic form (prim-
tour, its wavelet features are extracted by applying Discrete
itive). A basic form of consonant can combine with the
Wavelet Transform to the contour points to get wavelet de-
vowel sign to form another set of 16 Consonant-Vowel (CV)
scriptors which serve as features for the character which the
composite characters. Such an example is shown in Figure
pre-trained neural classifiers uses to recognize the character
1 for the first consonant . In Kannada, each consonant has
and generate the class information. The character class is
a short form which are written in the form of a subscript
mapped to the Kannada/English font and stored in a docu-
below other characters.
ment file.

4. Gabor filters
Figure 1. Consonant-vowel composite char- A Gabor filter can be viewed as a sinusoidal plane of par-
acters of first consonant. ticular frequency and orientation, modulated by a Gaussian
envelope. It is defined as,

To summarize, Kannada alphabet has (i) 16 vowels, (ii) h(x,y) = s(x,y)g(x,y) (1)
34 consonants, each with 34 basic form and 16 consonant-
vowel composite letters giving rise to 578 (34 x 17) char- where s(x,y) is a complex sinusoid, known as carrier and
acters and (iii) 34 subscripts to a total of 628 (16+578+34) g(x,y) is a 2-D Gaussian shaped function, known as enve-
characters. The full set of Kannada alphabet is given in [9]. lope. The complex sinusoid is defined as,

203
197
Figure 2. Word script recognition (Kanada / English) in bi-script document.

Figure 3. Kannada/English OCR system block schematic.

s(x, y)=e−j2π(u0 x+v0 y) (2)


c(t) = x(t) + jy(t) 0≤t≤T (5)
The 2-D Gaussian function is defined as,
  where,(x(t), y(t)) are points in the character contour se-
(u−u0 )2 (v−v0 )2
1 − 12 2 + 2 quence. Since a closed curve can be retraced infinitely,
= e σu σv
(3)
2πσu σv curve c can be assumed as periodic function with period
Thus the 2-D Gabor filter can be written as, T. Then, the wavelet transform [WT] of the curve c can be
  taken independently for each of x and y component.
2
x2
− 12 + y2
σ2
h (x, y) = e σ
e−j2π(u0 x+v0 y) (4) W T [c(u)] = W T [x(u) + jW T (y(u)] (6)
In the above equations (2, 3, 4) , (u0, v0 ) are referred to Applying the wavelet transform to the character contour
as the Gabor filter spatial central frequency. The parameters sequence (xm , ym ) (where m = 0,1,..., n-1, and n is the total
(σX , σy ) are the standard deviation of the Gaussian enve- number of points in the character contour) results in wavelet
lope along X and Y directions which determines the filter coefficients (approximate and detail coefficients) called as
bandwidth. wavelet descriptors, which serves as features for represent-
ing Kannada character.
4.1. Gabor filter design and feature extrac-
tion
6. Representation of entire Kannada alphabet
In our proposed system, we have used multi-bank Gabor with a reduced set
filter. Four different values for Spatial frequency (f= 0.05,
0.1, 0.15, 0.20) and five different values for filter orientation The following procedure is adopted for reducing the
(θ = 00 , 450 , 900 , 1200 , 1350 )are selected to give a total Kannada character set for classification: Let us consider the
of 20 Gabor filters. Each block of word image is filtered 16 consonant-vowel composite character set of first conso-
using 20 Gabor filters. From the output of each Gabor filter nant which is given in Figure 1.
mean and standard deviation are computed, which serves as The following procedure is adopted for reducing the
Gabor features. Thus, for each block of word image we get Kannada character set for classification: Let us consider the
a feature vector F of 40 values given by, 16 consonant-vowel composite character set of first conso-
F= [µ1 , σ1 , µ2 , σ2 , , µ20, σ20 ] nant which is given in Figure 1.

• The last two characters are the combinations of


5. Wavelet descriptors
the consonant and the symbols
To carryout the wavelet analysis a functional represen- • The six characters are the com-
tation of the input pattern is needed. This is formed by binations of characters and the symbol
the contour representation of the character, as it contains
all the information necessary for character classification. A • The characters are the combinations of the
character contour can be represented as a closed parametric subscripts with the main consonant charac-
curve c in the complex plane C [12], ters

204
198
Hence, the 16 composite characters The consonant-vowel composite characters of a conso-
of con- nant are all grouped together as a single class of character in
sonant can be identified by the set of 8 characters the first instance of classification. Later, a character belong-
and the 6 symbols/subscripts ing to such a group is further classified within its group by a
separate small network meant for classifying that consonant
.
group of characters, in the second stage of classification.
All the 34 consonants in Kannada have a set of
Hence, in this classification methodology, a character
16 consonant-vowel composite characters similar to that
is classified in either a single stage or two stages. The
shown in Figure 1. For each consonant group of charac-
consonant-vowel composite characters require two stages
ters, the combinations remains the same, represented by
of classification. All other main characters (i.e vowels and
the 8 individual classes of characters and 6 individual sym-
symbols) and subscripts are classified in single stage only
bols/subscripts, totally 312 [(34 x 8 + 6=278) + 34 sub-
(first stage).
scripts] individual classes for recognition. This results in
a large reduction in the number of classes to be identified 7.3.1. First-stage of classification At the first level of clas-
by the classifiers. sification, there are two separate neural networks, one for
main character classification and another for subscript clas-
7. Classification sification. The input character is fed to either the main clas-
sifier or the subscript classifier depending upon whether it
Multilayer Perceptrons are being used as classifiers in is a main character or subscript character, which is deter-
the proposed work because of their universal approximation mined at character segmentation stage [10]. The subscript
property and good generalization ability [13]. They are ad- classifier is a small neural network designed to classify 34
equate for classification problems involving a small number classes of subscripts. The main classifier is a neural network
of classes. However, in large-set character recognition prob- designed to classify 14 vowels, 4 symbols and 34 consonant
lems, the conventional one-stage neural network classifier groups to a total of 52 classes. The neural classifier used for
either does not converge or even if it converges takes a large classification of main characters in the first stage is referred
amount of time [14]. Separate neural classifiers are devel- to as the first stage main character classifier.
oped (i) for separation of Kannada and English words from 7.3.2. Second-stage of classification The second level of
bilingual document (ii) for recognition of English charac- main character recognition neural network is a group of 34
ters and (iii) for recognition of Kannada characters. small architecture neural classifiers, one for each consonant
group. They are referred as group classifiers. An appropri-
7.1. Classifier for script recognition ate group classifier is selected by the first stage main char-
acter classifier to further classify a consonant or its compos-
The network architecture includes 40 nodes in input ite character within its group. Since a consonant group is a
layer (corresponding to 40 Gabor features), 40 nodes in the set of 8 characters, the number of outputs of any group clas-
hidden layer and 2 nodes in output layer (corresponding to sifier is only 8.The number of individual character classes
two classes of scripts i.e. to identify Kannada and English that the first stage main character classifier should handle
scripts). This network is referred as Script classifier. thus reduces to just 52 classes, as against 274 main charac-
ters. Each group classifier should handle just 8 characters
7.2. Classifier for English character recog- resulting in a set of simple architecture neural networks suf-
nition ficient for handling the complete character set of Kannada
for recognition.
The network architecture includes 120/160/200 nodes
in input layer (corresponding to 120/160/200 wavelet fea- 8. Experimental results and discussions
tures), 60/80/100 nodes in the hidden layer and 52 nodes in
output layer (corresponding 52 classes of English charac- The bilingual document is scanned using flatbed scanner.
ters). Lines and words are separated after pre-processing. Gabor
features are extracted from the normalized word image and
7.3. Two-stage classifier for Kannada char- applied to Script classifier for word recognition.
acter recognition
8.1. Script identification
The proposed two-stage classification methodology to
classify the 312 individual classes of Kannada characters The obtain the data set for training, about 250 differ-
is as follows: ent samples of Kannada and English words are taken from

205
199
different Kannada news papers and magazines. The script classifier. The test samples corresponding to each conso-
identification classifier was tested with about 350 samples nant group are copied into separate test data files to test the
from Kannada and English words, which were different group classifiers independently. All the trained classifiers
from training set. Kannada and English words were sep- of the recognition system are tested independently by pre-
arated with an accuracy of about 99.3% and 98.6% re- senting the test patterns from their individual test data files.
spectively. Kannada words which are mostly circular have The first stage main character classifier could classify
showed nearly equal response to all the Gabor filter orien- the composite characters grouped together for a consonant
tations. Where as English words which have more vertical group as same class of character and as well differentiate
and horizontal strokes have showed more response to 00 and between the samples of different consonant groups, vowels
900 filter orientations. This discriminating feature is used to and symbols with an effective maximum rate of about 92%.
classify the two scripts namely; Kannada and English. The recognition rate of subscript classifier is about 94%.
The average error rate of 34 group classifiers is less than
8.2. Kannada/English character recogni- 5% when tested independently from their test data files.
tion Since, the correct recognition of any complete character
in Kannada involves correct identification of the group, and
The wavelet features of the character image are extracted then the identification of the character within the group, the
by applying Discrete Wavelet Transform (up to one level overall recognition rate at character level of the system is
of decomposition using Daubechies wavelet) to the pre- found to be about 91% considering first stage main charac-
processed, normalized and resampled contour of the char- ter classifier, subscript classifier and group classifiers.
acter. For each class of character 3 sets of wavelet features A comparison of recognition rates for different num-
are extracted corresponding to 60, 80 and 100 number of ber of contour points against different rejection threshold
points in the resampled character contour, for finding the is plotted in Figure 4(a) and Figure 4(b) for first stage main
optimal number of points required to represent the charac- character classifier and subscript classifier respectively. It
ter. can be seen that improvement in recognition rate for first
stage main character classifier is very less for 100 contour
8.2.1. Kannada character recognition The data set for points. Similarly there is less improvement in recognition
training the neural network used for recognizing Kannada rate for 80 and 100 contour points for subscript classifier.
characters is formed from, about 30 different samples of Hence, the optimal number of contour points for represent-
each class of Kannada character (with different fonts and ing the Kannada characters is chosen as 60 and 80 for the
font size varying from 14 to 18 points) totally about 9000 cases of subscripts and main (non-subscript) characters re-
characters. A test set is created with about 20 different sam- spectively.
ples (other than training samples) of each class of Kannada
character totally about 6000 characters scanned from differ- 8.2.2. English character recognition The data set for train-
ent Kannada newspapers and magazines. The architectures ing the neural network used for recognizing English char-
of the different classifiers used are as follows: acters is formed from, about 100 different samples of each
class of English character (with different fonts and font size
• The first stage main character classifier, used for varying from 10 to 16 points) totally about 5200 characters.
group identification consists of 120/160/200 nodes in A test set is created with about 200 different samples (other
the input layer (corresponding to 120/160/200 wavelet than training samples) of each class of English character
features), 60/80/100 nodes in hidden layer and 52 totally about 10400 characters. A maximum recognition
nodes in output layer (corresponding to 52 classes). rate of 98.2% is obtained. The recognition rate is higher
• The subscript classifier used for subscript identifica- than that of Kannada character as the number of classes and
tion consists of 120/160/200 nodes in the input layer, number of similar shaped characters is less in English script.
60/80/100 nodes in hidden layer and 34 nodes in out- Considering Script classifier, Kannada character classi-
put layer (corresponding to 34 subscripts). fier and English Classifier an overall recognition rate of
90.5% is obtained at character level.
• The 34 group classifiers used for identifying the char-
acter class within a particular consonant group consists
of 120/160/200 nodes in the input layer, 60/80/100
9. Conclusion
nodes in hidden layer and 8 nodes in output layer (cor-
responding to 8 characters in a consonant group). In this paper, a bilingual OCR system for complete set of
printed Kannada and English characters employing wavelet
Main character samples are all put together and stored in descriptors is presented. Gabor features are used to recog-
a separate test data file to test the first stage main character nize the script at word level. A two-stage multi-network

206
200
Figure 4. Recognition rate for (a) first stage main character classifier (b) subscript classifier.

scheme for classification of large-set of characters and re- recognition of non-uniform sized characters – An exper-
alization of this scheme for printed Kannada character is imentation with Kannada characters, Pattern Recognition
proposed. A procedure to reduce large set of Kannada char- Letters, Vol. 37, 1999, 1467–1475.
acters for classification is also developed. Even though the [5] B. Vijay Kumar and A.G. Ramakrishnan, Radial ba-
recognition rate of Script classifier and English character sis function and subspace approach for printed Kannada
classifier is high, at character level combining all the classi- text recognition, Proc. IEEE International Conference on
fiers a recognition rate of 90.5% is got due to lower recog- Acoustics, Speech, and Signal Processing, Vol. 5, 2004,
nition rate of Kannada character classifier. Further research 321–324.
is on the way to improve the recognition rate for Kannada [6] T. V. Ashwin and P. S. Sastry, A font and size–independent
characters by adding structural features. The proposed sys- OCR system for printed Kannada documents using support
tem was implemented on PIV, 3.2 GHz computer system vector machines, Sadhana, Vol. 27, No. 1, 2002, 35–58.
using VC++ under Windows environment.
[7] Y. Zhu, T. Tan and Y. Wang, Font Recognition Based on
Global Texture Analysis, IEEE Trans. PAMI, Vol. 23, No.
Acknowledgment 10, 2001, 1192–1200.
[8] I. Daubechies, Ten Lectures on Wavelets, CBMS-
This work was supported in part by research grants from Conference Lecture notes, 71, SIAM Philadelphia, 1992.
University Grants Commission (UGC), New Delhi, under
Major Research Project scheme, F. No. 32 - 113/2006(SR). [9] R. Sanjeev Kunte and Sudhaker Samuel R.D, Wavelet De-
scriptors for Recognition of Basic Symbols in Printed Kan-
nada Text, International Journal of Wavelets, Multiresolu-
References tion and Information Processing, Vol. 5, No. 2, 351–367.
[10] R. Sanjeev Kunte and Sudhaker Samuel R.D, A Two–stage
[1] B.B. Chaudhuri and U. Pal, An OCR system to read two Character Segmentation Technique for Printed Kannada
Indian language scripts: Bangla and Devanagari, Proc. Text, International Journal on Graphics, Vision and Image
of Fourth International Conference on Document Analysis Processing, Vol. 6, 2006, 1–8.
and Recognition, 1997, 1011–1015.
[11] Chungnan Lee and Bohom Wu, A Chinese character stroke
[2] Atul Negi, Chakravarthy Bhagavathi and Krishna B, An extraction algorithm based on contour information, Pattern
OCR system for Telugu, Proc. of Sixth International Con- Recognition, Vol. 31, No. 6, 1998, 651–663.
ference on Document Analysis and Recognition, 2001,
1110–1114. [12] P. Wunch and F. L. Andrew, Wavelet descriptors for Mul-
tiresolution recognition of handprinted characters, Pattern
[3] Arun K. Pujari, C. Dhanunjaya Naidu, M. Sreenivasa Rao, Recognition, Vol. 28, No. 8, 1995, 1237–1249.
B. C. Jinaga, An intelligent character recognizer for Telugu
[13] H. White, Artificial neural networks: Approximation and
scripts using multiresolution analysis and associative mem-
learning theory, Blackwell, Oxford, 1992.
ory, Image Vision Comput, Vol. 22, No. 14, 2004, 1221–
1227. [14] Hee-Heon Song and Seong-Whan Lee, A self-organizing
neural tree for large-set pattern classification, IEEE Trans.
[4] P. Nagabhushan and Radhika. M. Pai, Modified region de- on Neural networks, Vol. 9, No. 7, 1998, 369–380.
composition method and optimal depth decision tree in the

207
201

You might also like