0% found this document useful (0 votes)
8 views5 pages

Dhandra 2007

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views5 pages

Dhandra 2007

Uploaded by

Savet Omron
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

International Conference on Computational Intelligence and Multimedia Applications 2007

Global and Local Features Based Handwritten Text Words and


Numerals Script Identification

B.V.Dhandra Mallikarjun Hangarge


P.G.Department of Studies & P.G.Department. of Studies &
Research in Comp. Science, Gulbarga Research in Computer Science,
University, Gulbarga, India Gulbarga University, Gulbarga, India
[email protected] [email protected]

Abstract
This paper aims at the script identification problem of handwritten document images,
which facilitates many important applications such as sorting, transcription of multilingual
documents and indexing of large collection of such images, or as a precursor to optical
character recognition (OCR). The script identification scheme proposed in this paper has two
phases. First phase reports the script identification of text words using global and local
features, extracted by morphological filters and regional descriptors of three major Indian
languages/scripts: Kannada, Roman and Devnagari. In the second phase Kannada and
Roman handwritten numerals script identification is carried out. For classification of text
words and numerals, a K nearest neighbour algorithm is used. The proposed algorithm
achieves an average maximum recognition accuracy is 96.05% and 99% respectively for text
words and numerals with five fold cross validation test. The data set containing 3000 text
words and 400 numerals collected from 250 writers. The novelty of the proposed algorithm is
robust for noise, writer style, size and ink etc.

1. Introduction
Development of a successful multilingual OCR, script identification is very essential
before running an individual OCR system. Most of the published work on automatic
script identification of Indian scripts, deals with printed documents and very few
articles deal with handwritten script identification problem. This has motivated us to
consider the handwritten script identification at word level for three major Indian
scripts: English (Roman) Kannada and Devnagari.
In the literature, a number of approaches can be found for determining the script/
language of printed and handwritten documents and they can be typically classified into
four categories: (a) connected components analysis (local features based analysis), (b)
characters, words and text lines analysis, (c) text blocks analysis (global features based
analysis), (d) hybrid information of connected components, text lines etc. In most of the
text block and word level script separation work reported in [1, 2, 6, 9, 10, 11 and 13]
used directional energy features of Gabor filters. The discrimination of the scripts at
text line and word level based on shape, conventional, strokes and water reservoir
features can be found in [5, 7, 8, and 16]. The connected components, clusters and
projection profile features are used in [2, 3, 4, 12, 14, and 17] for scripts separation. In
this paper an attempt is made to demonstrate the potentiality of hybridized features
(global + local) for script identification at word level.
In Section 2, the brief overview of data collection and pre-processing is presented. In
Section 3, segmentation, feature extraction and their computation is discussed. The
experimental details and results obtained are presented in Section 4. Conclusion is
given in Section 5.

0-7695-3050-8/07 $25.00 © 2007 IEEE 476


471
DOI 10.1109/ICCIMA.2007.125
2. Data collection and pre-processing
The standard databases of Indian scripts are not available at present. Therefore, a
sample of 250 writers is chosen from schools, colleges and professionals and collected
250 unconstrained handwritten text documents of Kannada and Devnagari scripts with
Kannada and English numerals.
The collected documents are scanned using HP Scanner at 300 DPI, which usually
yields a low noise and good quality document image. The digitized images are in gray
tone and we have used Otsu’s [18] global thresholding approach to convert them into
two-tone images. The two-tone images are then converted into 0-1 labels where the
label 1 represents the object and 0 represents the background. The small objects (less
than are equal to 50 pixels) like, single or double quotation marks, hyphens and periods
etc. are removed using morphological opening. Further, we assume that the skew
correction of the document is performed.

3. Segmentation and features extraction


The Kannada and Devnagari handwritten document images are segmented into line-
wise using horizontal projection profile. However, the touched line segmentation is not
attempted here. Further, the text lines are dilated by using a line structuring element
(its length is calculated using equation (1) with k=0.7) in horizontal and vertical
directions to fill up the gaps between the characters and the descenders. Finally, the
dilated lines are used for word wise segmentation by fixing the bounding box on each
connected component and by cropping the word image. Thus, we have obtained 2000
word images of Kannada and Devnagari scripts (each 1000) and 1000 English word
images from standard IAM database. Each sample or pattern that we attempt to classify
here is a word of English, Kannada and Devnagari scripts. Further, it is helpful to study
the visual discriminating factors of the proposed scripts for robust feature extraction.
Kannada: Horizontal and diagonal stokes are more dominant primitives of the
Kannada characters. They are flat in shape and have more closed curves as compared to
Roman script. It has more decenders and holes as compared to Roman and Devnagari
scripts.
Roman (English): The important property of the Roman (English) script is the
existence of the vertical strokes in its characters and has less number of horizontal
strokes as compared to Kannada and Devnagari scripts.
Devnagari: Most of the characters of Devnagari script have a horizontal line at the
upper part. In Devnagari, this line is called sirorekha. When two or more Devnagari
characters sit side by side to form a word, the sirorekha or headline touch one another
and generates a big headline [7] in case of printed documents, whereas in handwritten
documents, these lines are usually drawn after the word is written.
These potential visual discriminating features are extracted as global and local
features and their method of computation is described in the following section. Initially,
first seven global features are computed by considering a text word/numeral image as a
texture (Section 3.1).

3.1 Global features

To extract the strokes in vertical and horizontal directions, we have performed the
opening operation on the input binary text word/numeral image with the line-structuring
element. Its length is computed using equation (1) with k=0.5.

472
477
Length _ strel = K .Mean( Connected _ compents _ hight ) (1)
where, k varies from 0.5 to 0.8.
Stroke density : The stroke length is defined as the number of pixels in a stroke as the
measure of its length [15]. The stroke density is defined as the total length of all the
strokes divided by the size of the image. Throughout the discussion of Section 3.1, N is
referred as number of on pixels in an image. The values of 10 features extracted are real
numbers.
1 Vertical Stroke Density (VSD):
N

∑Vertical _ onpixel( Pattern )


i
i
VSD (Vertical pattern) = (2)
Size( Vertical _ Pattern )
2 Horizontal Stroke Density (HSD):
N

∑ Horizontal _ onpixel( Pattern )


i
i
HSD (Horizontal pattern) = (3)
Size( Horizontal _ Pattern )
The remaining features 3, 4, 5 and 6 are extracted based on top hat and bottom-hat
morphological filters (transformations) in vertical and horizontal directions. These
features are computed in similar way as discussed in equation (2)-(3). As an illustration,
morphological opening and top hat transformation of Devnagari word in vertical and
horizontal directions is represented in Fig.1

(a) (b) (c) (d) (e)


Figure 1. (a) Devnagari word (b) vertical opening of (a),(c) vertical top-hat
transform of (a), (d)Horizontal opening of (a),(e) horizontal top-hat transform of
(a)
Pixel Density of an image after fill holes: This is the ratio between the numbers of on
pixels left after performing fill hole operation on input pattern, to its size.
7. Pixel Density of the pattern after fill holes is defined as
N

∑ Onpixel( pattern )
i
i
PDH(pattern)= (4)
Size( patterni )

3.2 Local features

The extraction of following features is based on the connected components of an


image and thus, they are local features. Throughout the discussion of section 3.2, N is
referred as number of connected components of an image.
8 Aspect Ratio: - The ratio of the height to the width of a connected component of an
image [4]. The value of AAR is a real number and average aspect ratio (AAR) is
defined as
1 N height( componenti )
AAR(pattern) = ∑
N i =1 width( componenti )
(5)

The aspect ratio is very important feature for word wise script identification [15].
9. Eccentricity: It is defined as the length of major axis divided by the length of the
minor axis of a connected component of an image [19].

473
478
N

∑ eccentricity
i
i
Average Eccentricity = (6)
N
10. Extent: It is a real valued function; defined as the proportion of the pixels in the
bounding box that are also in the region.
N

∑ extent
i
i
Average Extent = (7)
N

3.3 K nearest neighbour (KNN) classifier

The KNN classifier is used for decision making with various neighbours (K= 3, 5, 7,
9). To test the performance of the classifier the data set containing 3000 text words and
400 numerals are randomly divided into five groups and a 5-fold cross validation is
performed for 10 iterations to get optimum result as reported in Tables 1, 2.and 3.

4. Results and discussions


For experimentation, 3000 text word images and 400 handwritten English and
Kannada numerals (each 200) are used. The script identification results of text words
for bi-script and tri-script problems is presented in Table 1,Table 2 and numeral script
identification results in Table 3. The comparative analysis is reported in Table4. The
average time taken to recognize the script of a word is 0.2109 seconds on a Pentium IV
with 128 MB RAM based machine running at 1.8 GHz.

Table 1 Recognition rates for pairs of scripts. The figures indicate the average
correct recognition of words in both scripts.
Scripts Kannada Hindi English
Kannada ----- 95.55% 96.05%
Hindi 95.55% ------ 94.10%
English 96.05% 94.10% ------
Table 2 Recognition rates for triplets
Scripts Kannada Hindi English
Kannada/ Devnagari and Roman 93.1% -- --
Devnagari/Roman and Kannada -- 92.5% --
Roman/Devnagari and Kannada -- -- 89.5%
Table 3 Confusion Matrix of Kannada and English Numerals
Actual Kannada English
Kannada 199 1
English 3 197
Table 4 Comparative Analysis
Methods prop. Scripts combination Accuracy Time complexity
U.Pal Oriya, Roman 97.69% High
Proposed method Kannada, Roman 96.05%
Hindi , Roman 94.10% Low
Hindi, Kannada 95.55%
Kannada, Hindi, English 91.70%
Roman and Kannada numerals 99.00%

474
479
5. Conclusion
This paper describes a simple method for handwritten text words and numerals script
identification of three major India scripts. The aim of the paper is to facilitate the
multilingual handwritten OCR and script based retrieval of offline-handwritten
documents. By decomposing the word image in two directions at two levels using
morphological transformation seven global features are obtained and other three
dominant local features are computed based on connected components. These features
are passed to KNN classifier for classification of the scripts and achieved high
accuracy. Proposed algorithm exhibits insensitive to writing style, ink, size, noise and
characters slant.

Acknowledgment
The authors are grateful to Dr. P. Nagabushan, Dr. G. Hemantha Kumar and Dr. D.
S. Guru, University of Mysore, for their helpful discussion and encouragement during
this work.

References
[1] Santanu Chaudhury, Gaurav Harit, Shekar Madnani, R.B.Shet,” Identification of scripts of Indian languages by
Combining trainable classifiers,” Proc. of ICVGIP, Dec-20-22, Bangalore, India, 2000.
[2] D.Dhanya, A.G Ramakrishnan and Peeta Basa pati, “Script identification in printed bilingual documents,”
Sadhana, vol. 27, part-1, pp. 73-82, 2002.
[3] J. Hochberg, P. Kelly, T Thomas and L Kerns, “Automatic script identification from document images using
cluster-based templates,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol.19, pp.176-181, 1997.
[4] Judith Hochberg, Kevin Bowers, Michael Cannon and Patrick Keely, “Script and language identification for
hand-written document images,” IJDAR, vol.2, pp. 45-52, 1999.
[5] M.C.Padma and P. Nagabhushan,” Script Identification and separation of text words of Kannada Hindi and
English languages through discriminating features,” Proc. of NCDAR-2003, pp. 252-260. 2003.
[6] G.S.Peake and Tan, “Script and language identification from document images,” Proc. of Eighth British Mach.
Vision Conf., vol.2, pp. 230-233, Sept-1997.
[7] U.Pal and B.B.Chaudhuri, “Script line separation from Indian Multi-script documents,” 5th ICDAR, pp.406-
409, 1999.
[8] U.Pal, S.Sinha and B.B Chaudhuri, “Word-wise Script identification from a document containing English,
Devnagari and Telgu Text,” Proc. of NCDAR, PP 213-220, 2003.
[9] S. Basavaraj Patil and N.V.Subbareddy, “Neural network based system for script identification in Indian
documents,” Sadhana, vol. 27, part-1, pp. 83-97, 2002.
[10] Peeta Basa pati, S. Sabari Raju, Nishikanta Pati and A.G. Ramakrishnan, “Gabor filters for document analysis
in Indian Bilingual Documents,” Proc. of ICISIP, pp. 123-126, 2004.
[11] P. Nagabhushan, S.A. Angadi and B.S. Anami,” An Intelligent Pin code Script Identification Methodology
Based on Texture Analysis using Modified Invariant Moments,” Proc. of ICCR, pp. 615-623, 2005.
[12] A.L.Spitz, “Determination of the script and language content of document images,” IEEE Tran. on Pattern
Analysis and Machine Intelligence, Vol. 19, pp.234-245, 1997.
[13] T.N.Tan, “Rotation invariant texture features and their use in automatic script identification,” IEEE Trans. on
Pattern Analysis and Machine Intelligence, vol. 20, pp.751-756, 1998.
[14] S. Wood. X. Yao. K.Krishnamurthi and L.Dang ”Language identification for printed text independent of
segmentation,” Proc. of Int’l. Conf. on Image Processing, pp. 428-431, 1995.
[15] Anoop M, Namboodri, Anil K. Jain,” Online handwritten script identification”, IEEE Trans. on Pattern
Analysis and Machine Intelligence, vol. 26,no.1, pp.124-130, 2004.
[16] K. Roy, A. Banerjee and U. Pal, “A System for Word-wise Handwritten Script Identification for Indian Postal
Automation”, In Proc. IEEE India Annual Conference 2004, (INDICON-04), pp. 266-271, 2004.
[17] Lijun Zhou, Yue Lu, Chew Lim Tan, “Bangla/English Script Identification based on Analysis of Connected
component Profiles”, In Proc. 7th IAPR workshop on Document Analysis System, New land, pp. 234-254,13-15,
Feb-2006,
[18] N. Otsu, ” A Threshold Selection Method from Gray-Level Histogram , IEEE Trans. Systems,Man, and
Cybernetics, vol.9,no.1,pp.62-66,1979
[19] Dengsheng Zhang, Guojun Lu, “Review of shape representation and description techniques,” Pattern
Recognition, vol. 37, pp. 1-19, 2004.

475
480

You might also like