Dhandra 2007
Dhandra 2007
Abstract
This paper aims at the script identification problem of handwritten document images,
which facilitates many important applications such as sorting, transcription of multilingual
documents and indexing of large collection of such images, or as a precursor to optical
character recognition (OCR). The script identification scheme proposed in this paper has two
phases. First phase reports the script identification of text words using global and local
features, extracted by morphological filters and regional descriptors of three major Indian
languages/scripts: Kannada, Roman and Devnagari. In the second phase Kannada and
Roman handwritten numerals script identification is carried out. For classification of text
words and numerals, a K nearest neighbour algorithm is used. The proposed algorithm
achieves an average maximum recognition accuracy is 96.05% and 99% respectively for text
words and numerals with five fold cross validation test. The data set containing 3000 text
words and 400 numerals collected from 250 writers. The novelty of the proposed algorithm is
robust for noise, writer style, size and ink etc.
1. Introduction
Development of a successful multilingual OCR, script identification is very essential
before running an individual OCR system. Most of the published work on automatic
script identification of Indian scripts, deals with printed documents and very few
articles deal with handwritten script identification problem. This has motivated us to
consider the handwritten script identification at word level for three major Indian
scripts: English (Roman) Kannada and Devnagari.
In the literature, a number of approaches can be found for determining the script/
language of printed and handwritten documents and they can be typically classified into
four categories: (a) connected components analysis (local features based analysis), (b)
characters, words and text lines analysis, (c) text blocks analysis (global features based
analysis), (d) hybrid information of connected components, text lines etc. In most of the
text block and word level script separation work reported in [1, 2, 6, 9, 10, 11 and 13]
used directional energy features of Gabor filters. The discrimination of the scripts at
text line and word level based on shape, conventional, strokes and water reservoir
features can be found in [5, 7, 8, and 16]. The connected components, clusters and
projection profile features are used in [2, 3, 4, 12, 14, and 17] for scripts separation. In
this paper an attempt is made to demonstrate the potentiality of hybridized features
(global + local) for script identification at word level.
In Section 2, the brief overview of data collection and pre-processing is presented. In
Section 3, segmentation, feature extraction and their computation is discussed. The
experimental details and results obtained are presented in Section 4. Conclusion is
given in Section 5.
To extract the strokes in vertical and horizontal directions, we have performed the
opening operation on the input binary text word/numeral image with the line-structuring
element. Its length is computed using equation (1) with k=0.5.
472
477
Length _ strel = K .Mean( Connected _ compents _ hight ) (1)
where, k varies from 0.5 to 0.8.
Stroke density : The stroke length is defined as the number of pixels in a stroke as the
measure of its length [15]. The stroke density is defined as the total length of all the
strokes divided by the size of the image. Throughout the discussion of Section 3.1, N is
referred as number of on pixels in an image. The values of 10 features extracted are real
numbers.
1 Vertical Stroke Density (VSD):
N
∑ Onpixel( pattern )
i
i
PDH(pattern)= (4)
Size( patterni )
The aspect ratio is very important feature for word wise script identification [15].
9. Eccentricity: It is defined as the length of major axis divided by the length of the
minor axis of a connected component of an image [19].
473
478
N
∑ eccentricity
i
i
Average Eccentricity = (6)
N
10. Extent: It is a real valued function; defined as the proportion of the pixels in the
bounding box that are also in the region.
N
∑ extent
i
i
Average Extent = (7)
N
The KNN classifier is used for decision making with various neighbours (K= 3, 5, 7,
9). To test the performance of the classifier the data set containing 3000 text words and
400 numerals are randomly divided into five groups and a 5-fold cross validation is
performed for 10 iterations to get optimum result as reported in Tables 1, 2.and 3.
Table 1 Recognition rates for pairs of scripts. The figures indicate the average
correct recognition of words in both scripts.
Scripts Kannada Hindi English
Kannada ----- 95.55% 96.05%
Hindi 95.55% ------ 94.10%
English 96.05% 94.10% ------
Table 2 Recognition rates for triplets
Scripts Kannada Hindi English
Kannada/ Devnagari and Roman 93.1% -- --
Devnagari/Roman and Kannada -- 92.5% --
Roman/Devnagari and Kannada -- -- 89.5%
Table 3 Confusion Matrix of Kannada and English Numerals
Actual Kannada English
Kannada 199 1
English 3 197
Table 4 Comparative Analysis
Methods prop. Scripts combination Accuracy Time complexity
U.Pal Oriya, Roman 97.69% High
Proposed method Kannada, Roman 96.05%
Hindi , Roman 94.10% Low
Hindi, Kannada 95.55%
Kannada, Hindi, English 91.70%
Roman and Kannada numerals 99.00%
474
479
5. Conclusion
This paper describes a simple method for handwritten text words and numerals script
identification of three major India scripts. The aim of the paper is to facilitate the
multilingual handwritten OCR and script based retrieval of offline-handwritten
documents. By decomposing the word image in two directions at two levels using
morphological transformation seven global features are obtained and other three
dominant local features are computed based on connected components. These features
are passed to KNN classifier for classification of the scripts and achieved high
accuracy. Proposed algorithm exhibits insensitive to writing style, ink, size, noise and
characters slant.
Acknowledgment
The authors are grateful to Dr. P. Nagabushan, Dr. G. Hemantha Kumar and Dr. D.
S. Guru, University of Mysore, for their helpful discussion and encouragement during
this work.
References
[1] Santanu Chaudhury, Gaurav Harit, Shekar Madnani, R.B.Shet,” Identification of scripts of Indian languages by
Combining trainable classifiers,” Proc. of ICVGIP, Dec-20-22, Bangalore, India, 2000.
[2] D.Dhanya, A.G Ramakrishnan and Peeta Basa pati, “Script identification in printed bilingual documents,”
Sadhana, vol. 27, part-1, pp. 73-82, 2002.
[3] J. Hochberg, P. Kelly, T Thomas and L Kerns, “Automatic script identification from document images using
cluster-based templates,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.19, pp.176-181, 1997.
[4] Judith Hochberg, Kevin Bowers, Michael Cannon and Patrick Keely, “Script and language identification for
hand-written document images,” IJDAR, vol.2, pp. 45-52, 1999.
[5] M.C.Padma and P. Nagabhushan,” Script Identification and separation of text words of Kannada Hindi and
English languages through discriminating features,” Proc. of NCDAR-2003, pp. 252-260. 2003.
[6] G.S.Peake and Tan, “Script and language identification from document images,” Proc. of Eighth British Mach.
Vision Conf., vol.2, pp. 230-233, Sept-1997.
[7] U.Pal and B.B.Chaudhuri, “Script line separation from Indian Multi-script documents,” 5th ICDAR, pp.406-
409, 1999.
[8] U.Pal, S.Sinha and B.B Chaudhuri, “Word-wise Script identification from a document containing English,
Devnagari and Telgu Text,” Proc. of NCDAR, PP 213-220, 2003.
[9] S. Basavaraj Patil and N.V.Subbareddy, “Neural network based system for script identification in Indian
documents,” Sadhana, vol. 27, part-1, pp. 83-97, 2002.
[10] Peeta Basa pati, S. Sabari Raju, Nishikanta Pati and A.G. Ramakrishnan, “Gabor filters for document analysis
in Indian Bilingual Documents,” Proc. of ICISIP, pp. 123-126, 2004.
[11] P. Nagabhushan, S.A. Angadi and B.S. Anami,” An Intelligent Pin code Script Identification Methodology
Based on Texture Analysis using Modified Invariant Moments,” Proc. of ICCR, pp. 615-623, 2005.
[12] A.L.Spitz, “Determination of the script and language content of document images,” IEEE Tran. on Pattern
Analysis and Machine Intelligence, Vol. 19, pp.234-245, 1997.
[13] T.N.Tan, “Rotation invariant texture features and their use in automatic script identification,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 20, pp.751-756, 1998.
[14] S. Wood. X. Yao. K.Krishnamurthi and L.Dang ”Language identification for printed text independent of
segmentation,” Proc. of Int’l. Conf. on Image Processing, pp. 428-431, 1995.
[15] Anoop M, Namboodri, Anil K. Jain,” Online handwritten script identification”, IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 26,no.1, pp.124-130, 2004.
[16] K. Roy, A. Banerjee and U. Pal, “A System for Word-wise Handwritten Script Identification for Indian Postal
Automation”, In Proc. IEEE India Annual Conference 2004, (INDICON-04), pp. 266-271, 2004.
[17] Lijun Zhou, Yue Lu, Chew Lim Tan, “Bangla/English Script Identification based on Analysis of Connected
component Profiles”, In Proc. 7th IAPR workshop on Document Analysis System, New land, pp. 234-254,13-15,
Feb-2006,
[18] N. Otsu, ” A Threshold Selection Method from Gray-Level Histogram , IEEE Trans. Systems,Man, and
Cybernetics, vol.9,no.1,pp.62-66,1979
[19] Dengsheng Zhang, Guojun Lu, “Review of shape representation and description techniques,” Pattern
Recognition, vol. 37, pp. 1-19, 2004.
475
480