Script Recognition Ghosh 2009
Script Recognition Ghosh 2009
net/publication/47544548
Article in IEEE Transactions on Pattern Analysis and Machine Intelligence · December 2010
DOI: 10.1109/TPAMI.2010.30 · Source: PubMed
CITATIONS READS
263 17,586
3 authors:
Shivaprasad Adamane
Indian Institute of Science
1 PUBLICATION 263 CITATIONS
SEE PROFILE
All content following this page was uploaded by Tulika Dube on 12 July 2014.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009 1
Abstract—A variety of different scripts are used in writing languages throughout the world. In a multi-script, multilingual environment, it
is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm
can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to
two broad categories – structure-based and visual appearance-based techniques. This survey report gives an overview of the different
script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are
also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in case of
handwritten documents.
Index Terms—Document analysis, Optical character recognition, Script identification, Multi-script document.
Authorized
Digital Object licensed
Indentifier use limited to: UNIVERSIDADE ESTADUAL DE0162-8828/10/$26.00
10.1109/TPAMI.2010.30 CAMPINAS. Downloaded
© 2010 on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Fig. 3. Tree diagram showing broad classification of prominent writing systems and scripts of the present world.
2.2 Syllabic system somewhat similar to that of Latin except that it uses
In a syllabic system, every written symbol represents a a different alphabet set. Some characters in the Cyrillic
phonetic sound or syllable, as used in Japanese. The sym- alphabet are also borrowed from Latin and Greek, modi-
bols representing the Japanese syllables are known as fied with cedillas, crosshatches or diacritical marks. This
Kanas which are of two types — Hirakana and Katakana. induces recognition ambiguity between Cyrillic, Latin
As indicated in Fig. 3, Japanese script uses a mix of lo- and Greek.
gographic Kanji and syllabic Kanas. Hence, it is visually
similar to Chinese, but less dense due to the presence of 2.4 Abjads
simpler Kanas in between the logograms. The Abjad system of writing is similar to the alpha-
betic system, but has symbols for consonantal sounds
2.3 Alphabetic system only. Unlike most other scripts in the world, Abjads are
An alphabet is a set of characters representing phonemes written from right to left within a textline. This unique
of a spoken language. Examples of scripts following this feature is particularly useful for identifying Abjad-based
system are Greek, Latin, Cyrillic and Armenian. The scripts in pen computing.
Latin script, also called Roman script, is used by many Two important scripts under this category are Arabic
languages throughout this world with varying degrees and Hebrew. A typical Arabic character is formed of
of modifications from one language to another. It is used a long main stroke along with one to three dots. The
for writing many European languages like English, Ital- characters in a word are generally conjoined giving
ian, French, German, Portuguese, Spanish, etc., and has an overall cursive appearance to the written text. This
been adopted in many Amerindian and Austronesian provides an important clue for the recognition of Arabic
languages including modern Malay, Vietnamese and In- script. The same applies to some other scripts of Arabic
donesian language. Fig. 4 shows few such variants of the origin such as Farsi (Persian), Urdu, Sindhi, Jawi, etc.
Latin script. Compared to other scripts, classical Latin On the other hand, character strokes in Hebrew are more
characters are simple in structure, mainly composed of uniform in length and the letters in a word are generally
few lines and arcs. The other major script under the discrete.
alphabetic system is Cyrillic. This script is used by some
languages of Eastern Europe, Asia and Slavic regions 2.5 Abugidas
that include Bulgarian, Russian, Macedonian, Ukrainian,
Abugida is another alphabetic-like writing system used
Mongolian, etc. The basic properties of this script are
by the Brahmic family of scripts that originated from the
ancient Indian Brahmi script and includes nearly all the
scripts of India and southeast Asia. In Fig. 5, we draw a
tree diagram to illustrate the evolution of major Brahmic
scripts in India and southeast Asia. The northern group
of Brahmic scripts (e.g. Devnagari, Bengali, Manipuri,
Gurumukhi, Gujrati and Oriya) bear strong resemblance
to the original Brahmi script. On the other hand, scripts
in south India (Tamil, Telugu, Kannada and Malayalam)
Fig. 4. Examples of some languages using the Latin alphabet with as well as in southeast Asia (e.g. Thai, Lao, Burmese,
different modifications. Javanese and Balinese) are derived from Brahmi through
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Fig. 5. The Brahmic family of scripts used in India and southeast Asia.
many changes and so look quite different from the north- make it possible to distinguish it from other scripts. So,
ern group. One important characteristic of Devnagari, the basic task involved in script recognition is to devise a
Bengali, Gurumukhi and Manipuri is that the characters technique to discover these features from a given docu-
in a word are generally written together without spaces, ment and then classify the document’s script accordingly.
so that the top bar is unbroken. This results in the Based on the nature of approach and features used,
formation of headline, called shirorekha, at the top of each these methods may be divided into two broad cate-
word. Accordingly, these scripts can be separated from gories — structure-based and visual appearance-based
other script types by detecting the presence of a large methods. Script recognition techniques in each of these
number of horizontal lines in the textual portions of a two categories may be further classified on the basis of
document. the level at which they are applied inside a document
image, viz. page-wise, paragraph-wise, textline-wise and
2.6 Featural system word-wise. The application mode of a method depends
The last significant form of writing system is the featural on the minimum size of the text from which the fea-
system in which the symbols or characters represent the tures proposed in the method can be extracted reliably.
features that make up the phonemes. One prominent Various algorithms under each of these categories are
script of this sort is the Korean Hangul. As indicated in summarized below.
Fig. 3, the Korean script is formed by mixing logographic
Hanja with featural Hangul. However, modern Korean 3.1 Structure-based script recognition
contains more of Hangul than Hanja. Consequently, In general, script classes differ from each other in their
Korean script is relatively less complex and less dense stroke structure and connections, and the writing styles
compared to Chinese and Japanese, containing more associated with the character sets they use. One ap-
circles and ellipses. proach to script recognition may be to extract con-
nected components (continuous runs of pixels) in a doc-
3 S CRIPT R ECOGNITION M ETHODOLOGIES ument [12] and then analyze their shapes and structures
Script identification relies on the fact that each script so as to reveal the intrinsic morphological characteristics
has unique spatial distribution and visual attributes that of the script used in the document. In machine-printed
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
yields only 77.14% accuracy at best while Gabor filter Thus, it is possible to separate out every script from other
approach yields accuracy rate as high as 95.71%. scripts by analyzing the distribution of character pixels
One problem encountered in Gabor filter related ap- in different zones inside a document.
plications is the high computational cost due to the
frequent image filtering. In order to reduce the cost 3.2.2 Script identification at paragraph and text-block
of computation, script identification in machine-printed level
documents using steerable Gabor filters was proposed The use of texture features in script identification was
in [62]. The method offers two-fold advantages. Firstly, considered by Jain and Zhong for discriminating printed
the steerability property of Gabor filter is exploited to Chinese and English documents [65]. This paper in
reduce the high computational cost. Secondly, the Gabor fact proposed a texture-based language-free page seg-
filter bank is appropriately designed so that the extracted mentation algorithm which automatically extracts text,
rotation-invariant features can discriminate scripts con- halftone and line-drawing regions from input gray-scale
taining characters that are similar in shape and even document images. An extension of this page segmen-
share many characters. In this paper, a 98.5% recognition tation procedure provides for further segmentation of
rate was achieved in discriminating Chinese, Japanese, the text regions into different script regions. First, a
Korean and Latin scripts while the number of image set of optimal texture discrimination masks are created
filtering operations was significantly reduced by 40%. through neural network training. Next, texture features
Although the above Gabor function-based script are obtained by convolving the trained masks with the
recognition schemes have shown good performance, input image. These features are then used for classifica-
their application is limited to machine-printed docu- tion.
ments only. Variations in writing style, character size, The use of other texture features for script classifi-
and inter-line and inter-word spacings make the recog- cation, other than GLCM and Gabor energy features,
nition process difficult and unreliable when these tech- has been explored by Busch et al in [66]. The features
niques are applied directly on handwritten documents. that they used are wavelet energy features, wavelet log
Therefore, it is necessary to preprocess the document mean deviation features, wavelet co-occurrence signa-
images prior to the application of Gabor filter so as to tures, wavelet log co-occurrence features, and wavelet
compensate for the different variations present. This has scale co-occurrence signatures. They tested these fea-
been addressed in the texture-based script identification tures on a database containing eight different script
scheme proposed in [63]. In the preprocessing stage, types — Latin, Han, Japanese, Greek, Cyrillic, Hebrew,
the algorithm employs denoising, thinning, pruning, m- Devnagari and Farsi. In their experiments, machine-
connectivity, and text size normalization in sequence. printed document images of size 64×64 pixels were first
Texture features are then extracted using a multi-channel binarized, skew corrected and text-block normalized, in
Gabor filter. Finally, different scripts are classified using line with the work done by Peake and Tan in [60]. In
fuzzy classification. In this proposed system, an overall order to reduce the dimensionality of the feature vectors
accuracy of 91.6% was achieved in classifying handwrit- while improving classification accuracy, Fisher linear
ten documents written in four different scripts, namely discriminant analysis technique is applied. Classification
Latin, Devnagari, Bengali and Telugu. is performed using a GMM (Gaussian mixture model)
Another visual attribute that has been used in many classifier which models each script class as a combination
image processing applications is histogram statistics, of Gaussian distributions. The GMM classifier is trained
which reflects spatial distribution of gray levels in an using a version of the expectation maximization (EM)
image. In a recent work [64], Cheng et al proposed to use algorithm. In order to create a more stable and global
normalized histogram statistics for the purpose of script script model, a maximum a posteriori (MAP) adaptation-
identification in documents typeset in Latin, Chinese, based method was also proposed. It was seen that the
Cyrillic or Japanese. In this work, every line of text in an wavelet log co-occurrence outperforms all other texture
input document is divided into three zones — ascender features for script classification (only 1% classification
zone between top-line and x-line, x-zone between x-line error) while GLCM features yielded the worst overall
and baseline, and descender zone between baseline and performance (9.1% classification error). This indicates
bottom-line. Next, a horizontal projection is obtained that pixel relationships at small distances are insufficient
for each textline that gives zone-wise distribution of to characterize the script of a document image appropri-
character pixels in a textline. It is observed that Latin ately.
and Cyrillic characters mainly distribute in the x-zone However, a single model per script class is useful
with two significant peaks located on the x-line and the only when every script is written using only one font or
baseline. The baseline peak is higher than the x-line peak using only visually similar fonts. On the contrary, there
in Latin while they are almost equal in Cyrillic. Chinese typically exists a large number of fonts, often of widely
characters, on the other hand, have relatively random varying appearance, within a given script. Because of
distribution without any peak in the profile. Japanese such variations, it is unlikely that a model trained on
characters also have the same random distribution but one set of fonts will correctly identify an image of a
the average height of the profile is significantly lower. previously unseen font of the same script. For example,
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
classification error increases from 1% and 9.1% in [66] to commonly used in India and an overall classification
15.9% and 13.2% in cases of wavelet log co-occurrence accuracy of 97.11% was achieved. The scripts used in-
and GLCM features, respectively. In view of this, Busch cluded Devnagari, Bengali, Tamil, Kannada, Malayalam,
proposed to characterize multiple fonts within a single Gurumukhi, Oriya, Gujrati, Urdu and Latin. Fig. 16
script more adequately by using multiple models per illustrates how these ten different Indian scripts are
script class [67]. This is done by partitioning each script classified using these features in two levels of hierarchy.
class into ten subclasses, each subclass corresponding to
one font included within that script class. This is fol- 3.2.3 Script identification at word/character level
lowed by linear discriminant analysis and classification While all the texture-based script identification methods
using the modified MAP-GMM classifier as above. Such described above work on a document page or a text-
a classification system provides significant improvement block, script identification at the word level had been
when compared to the results obtained using a single successfully implemented in [70], [71], [72], [73], [74],
model — classification error reduces to 2.1% and 12.5% [75], [76]. In the works by Ma et al [70], [71], Gabor
for the above two cases, respectively. filter analysis is applied to each word in a bilingual
Script identification in Indian printed documents us- document to extract features characterizing the script
ing oriented local energy features was performed in [68]. in which that particular word is written. Subsequently,
Local energy is defined as the sum of squared responses a 2-class classifier system is used to discriminate the
of a pair of conjugate symmetric Gabor filters. In an two different scripts contained in the input document.
earlier work, Chan et al [69] derived a set of descriptors Different classifier architectures based on SVM, KNN,
from oriented local energy and demonstrated their utility weighted Euclidean distance and GMM are considered.
in script classification. In line with human perception, A classifier system consisting of a single classifier may
the features chosen are energy distribution, the ratio of comprise of any of the above four architectures, while a
energies for two non-adjacent channels, and the horizon- multiple classifier system is built by combining two or
tal projection profile. The distribution of energy across more of them. In a multiple classifier system, the classifi-
differently oriented channels of a Gabor filter differs cation scores from each of the different component clas-
from one script to other. While this feature captures sifiers are combined using sum-rule to arrive at the final
the global differences among scripts, a closer analysis decision. In their papers, Ma et al considered bilingual
of the energy distribution may be necessary to reveal documents containing combinations of one Latin-based
finer differences between similar-looking scripts. This is language (mainly English) and one non-Latin language
provided by the ratios between energies at the output (e.g., Arabic, Chinese, Hindi or Korean). It was observed
of non-adjacent channel pairs. Finally, there are certain that while the performance for English-Hindi documents
scripts which are distinguishable only by the stroke was quite good (97.51% recognition rate using KNN
structures used in the upper part of the words. For classifier), script identification in English-Arabic docu-
example, Devnagari and Gurumukhi differ in the shape ments had the lowest performance (90.93% using SVM
of the matra present above the headline (‘shirorekha’). classifier). Moreover, it was established that multiple
Horizontal projection is used to discover this informa- classifier system can consistently outperform the single
tion. One major advantage with these features is that classifier systems (98.08% and 92.66% in case of English-
it is not necessary to perform analysis at multiple fre- Hindi and English-Arabic documents, respectively, using
quencies but at only one optimal frequency. This helps in a combination of KNN and SVM classifiers).
reducing the computational cost. Again, filter response A visual appearance-based approach has also been
can be enhanced by increasing filter bandwidth at this applied to identify and separate script-words in In-
optimal frequency. Accordingly, the filters employed dian multi-script documents. In [72], [73], two different
in [68] are log-Gabor filters designed for one empirically approaches to script identification at the word level
determined optimal frequency and at eight equi-spaced in printed bilingual (Latin and Tamil) documents are
orientations. For an input text-block of size 100 × 100 presented. The first method structures words into three
pixels, the aforementioned features are calculated and distinct spatial zones and utilizes the information about
then classified to different script classes using a KNN the spatial spread of the words in these zones. The
classifier. The scheme was tested on ten different scripts second technique analyzes the directional energy dis-
tribution of words using Gabor filters with suitable
frequencies and orientations. The algorithms are based
on the observations that: (1) the spatial spread of Roman
characters mostly covers the middle and upper zones;
only a few lower case characters spread to the lower
zone, (2) the Roman alphabet contains more vertical
and slanted strokes, (3) in Tamil, the characters mostly
spread to the upper and the lower zones, (4) there is a
Fig. 16. Classification hierarchy in Joshi et al’s script identification dominance of horizontal and vertical strokes in Tamil,
scheme. and (5) the aspect ratio of Tamil characters is generally
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
TABLE 1
Script Recognition Methods
Method Best recog.
Researchers Features Classifier Script types classified Scope of application reported
*NA: Not available – recognition result not given in terms of numeric value.
density. This is the measure of character pixels inside a result, upward concavities in a character are observed
a character bounding box which is distinctly very high at points where two or more character strokes join. Ac-
in scripts using complex ideographic characters. Struc- cordingly, ideograms composed of multiple strokes show
turally simple Arabic characters, on the other hand, are many more upward concavities per character compared
low in density. All other scripts across Europe and Asia to that in other scripts. As observed by Spitz [77], there
show more or less the same medium character density. are usually at most two or three upward concavities in a
Therefore, while this feature may be good in separating single Latin character while Han characters have many
out Han on one hand and Arabic on the other, it does not more upward concavities per character that are evenly
help much in bringing out the difference between moder- distributed along the vertical axis. However, we observe
ately complex scripts like Latin, Cyrillic, Brahmic scripts, that most other scripts also show two or three upward
etc. The second discriminating feature that Spitz used is concavities, the same as in the Latin script. So, upward
the location of upward concavities in characters. An up- concavity is good for separating Han from others but
ward concavity is formed when a run of character pixels not good for discrimination among non-Han scripts,
spans the gap between two white runs just above it. As except perhaps for Cyrillic which contains a few more
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
upward concavities compared to other non-Han scripts. demonstrated in [19], [34]. However, it is not difficult
Another problem with these two features is that they to realize that the classification error due to ambiguity
highly depend on document quality. Broken character will increase if the system includes script classes that use
segments may result in detection of false upward con- similar looking characters or even share many common
cavity while noise contributes to optical density measure. characters. Therefore, Hochberg’s method may not be
Non-Han documents tend to be misclassified as Han- suitable in a multi-script country like India where most
based Oriental ones if the document quality is poor, scripts have the same line of origin. Nevertheless, it
either because many characters are broken or noisy. In offers invariance to font-size and computational simplic-
order to cope with such situations, features like character ity. This is because textual symbols are size-normalized
height distribution, character bounding box profiles, hor- and the algorithm uses simple binary shape matching
izontal projections and several other statistical features without any feature value calculation.
were proposed in [16], [17], [18]. These features do not Another important feature proposed by Wood et al
depend on the document quality and resolution but on and used by many researchers is the horizontal projec-
the overall size of the connected components. However, tion. This gives a measure of the spatial spread of the
these features are not invariant to character size and font characters in a script that provides an important clue
and offer high performance only in separating distinctly to script identification. Some scripts can be identified
different Oriental scripts from other non-Han scripts. by detecting the peaks in the projection profile, e.g.
Several different structural features like character ge- Arabic scripts having a strong baseline shows peak at
ometry, occurrence of certain stroke structures and struc- the bottom of the profile while Brahmic scripts with
tural primitives, stroke orientations, measure of cavity ‘shirorekha’ show peak at the top, and so on. However,
regions, side profiles, etc. that directly relate to the this feature also is not good for separating scripts of
character shape have also been used for script character- similar nature and structure. For example, Devnagari,
ization. However, while some features show marked dif- Bengali and Gurumukhi will show the same peak in
ference between two scripts, measures of other features the profile due to ‘shirorekha’; Arabic, Urdu and Farsi
may be the same between that script pair. For example, have the same lower peak. Hence, this feature has not
while Devnagari and Gujrati can be easily identified been used alone but mostly in combination with other
using ‘shirorekha’ and water reservoir-based features, structural features.
character aspect ratio and character moments do not A better approach to script identification is via texture
show much difference. This is because many Gujrati feature extraction using multi-channel Gabor filter that
letters are exactly same as their Devnagari counterpart provides a model for human vision system. That means,
with the headline (‘shirorekha’) removed. Again, there Gabor filter offers a powerful tool to extract out visual
are features that are optimal in one script pair but not attributes from a document. This has motivated many
in another pair. For example, presence of ‘shirorekha’ researchers to employ Gabor filter for script determina-
may be a good feature for discriminating Latin and tion. Since texture feature gives the general appearance
Devnagari, but not at all useful in separating Devnagari of a script, it can be derived from any script class of
and Bengali. Therefore, in order to separate out a script any nature. Accordingly, this feature may be considered
from all other scripts, one may need to check a large a universal one. The discriminating power of a multi-
pool of structural features before any decision can be channel Gabor filter can be varied by having more
taken. This may result in the curse of dimensionality. channels with different radial frequencies and closely
So, a better option may be to do the classification using spaced orientation angles. Thus, this system is flexible
different sets of features at different levels of hierarchy, compared to all other methods and can be effectively
as proposed in some of the works above. Another option used in discriminating scripts that are quite close in
is to learn the script characteristics in a neural network, appearance. The main criticism with this approach is that
as in [25], without bothering about the features to be it cannot be applied with confidence to small text regions
used for classification. However, a larger network with as in word-wise script recognition. Also, Gabor filters are
more number of hidden units may be necessary for not capable of handling variations in script size and font,
reliable recognition as more and more script classes are inter-line spacings, etc.
included. Table 1 also lists recognition rates, as reported in
Compared to the above, Hochberg et al’s method is the literature. Since the experiments were conducted
more versatile. The method is based on discovering independently using different datasets, however, they do
frequent characters/symbols in every script class and not reflect the comparative performance of these meth-
storing them in the system database for matching during ods. To have a proper measure of their relative script
classification. Therefore, in principle, the method can separation power, these methods need to be applied on
identify any number of scripts of varied nature and font a common dataset. Script recognition performance of
as long as they are included in the training set. It is some of the above mentioned features, when applied to
possible to apply the method in a common framework a common dataset, is given in Table 2. The dataset con-
to scripts containing discrete and connected characters, tains printed documents typeset in ten different scripts,
alphabetic and non-alphabetic scripts and so on, as including six scripts used in India. In the absence of
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
16 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
TABLE 2
Script Recognition Results (in Percentage)
Script Features Used Latin Cyrillic Arabic Urdu Chinese Korean Devnagari Bengali Gujrati Tamil
Optical Density [13] 75.4 84.6 89.1 87.2 96.3 93.7 76.2 73.4 74.0 83.8
Textual Symbol [19] 97.2 92.3 93.7 90.1 97.2 94.3 95.3 97.8 87.1 98.9
Hor. Projection Profile [23] 89.7 91.2 94.3 92.9 87.5 90.2 92.1 90.0 94.6 76.8
Gabor Coefficients [60] 95.2 92.7 97.2 94.3 93.3 89.9 95.8 91.3 87.8 96.2
any standard database, we created our own database by care of by using certain statistical features, as proposed
collecting document samples from books and magazines. in [20]. Textual symbol-based method can also be used
Some documents were also available from world-wide- but with certain modifications — some shape descriptor
web which we printed using a laser printer. All the features can be derived from the text-symbols and the
documents were scanned in black and white mode at prototypes can be generated through clustering. We
300dpi and then rescaled to have a standard textline demonstrated this approach in an earlier paper [35].
height in all documents while maintaining the character Also, a script class may be represented by multiple
aspect ratio. Script recognition was performed at the text- models to account for variation in writing from one
block level. Homogeneous text-blocks of size 256 × 256 person to another.
pixels were extracted from document pages in such Based on our discussion above, we see that script
a way that page margins and non-textual parts were features are extracted either from a list of connected
excluded. A total of 120 text-blocks were generated per components like textline, word and character in a doc-
script, each block containing 10 to 12 textlines. The print ument or from a patch of text which may be a com-
quality of the documents and hence the quality of the plete paragraph, a text-block cropped from the input
document images was reasonably good containing very document or even the whole document page. Script
little noise. identification methods that use segment-wise analysis
We observe that optical density feature is capable of character structure may hence be regarded as local
of identifying Chinese and Korean, and also Arabic approach. On the other hand, visual appearance-based
and Urdu to some extent. For other script classes, the methods that are designed to identify script by analyzing
recognition rate was well below the acceptable level. This the overall look of a text-block may be regarded as a
is because the optical density feature is not good enough global approach.
to discriminate among scripts of similar complexity. The As discussed before, many different structural features
same argument holds for other script features. The Ga- and methods for script characterization have been pro-
bor filter method shows relatively better discriminating posed over the years. In each of these methods, the
power in comparison. We noticed that the classification features were chosen keeping in view only those script
error was mainly due to misclassification between script types that were considered therein. Therefore, while
pairs like Arabic and Urdu; Chinese and Korean; Dev- these features have been proved to be efficient for script
nagari and Bengali; Devnagari and Gujrati. These pairs identification within a given set of scripts, they may not
of script classes have characters of the same nature and be good in separating a wider variety of script classes.
complexity, and even share some common characters. Again, structural features cannot effectively discriminate
This leads to ambiguity and hence the classification between scripts having similar character shapes, which
error. So, on a whole, we may say that every proposed otherwise may be distinguished by their visual appear-
script identification method and script feature works ances. Another disadvantage with structure-based meth-
well only when applied within a small set of script ods is that they require complex preprocessing involv-
classes. Classification accuracy falls significantly when ing connected component extraction. Also, extraction of
more scripts of similar nature and origin are included. structural features is highly susceptible to noise and poor
As observed in Table 1, almost all work on script quality document images. Presence of noise or signifi-
recognition is targeted toward machine-printed docu- cant image degradation adversely affect the location and
ments. They have not been tested for script recognition segmentation of these features, making them difficult or
in handwritten documents. In view of the large amount sometimes impossible to extract.
of handwritten documents that need to be processed In short, the choice of features in local approach to
electronically nowadays, script identification in hand- script classification depends on the script classes to be
written documents turns out to be an important research identified. Further, the success of classification in this ap-
issue. Unfortunately, the script features proposed for proach depends on the performance of the preprocessing
printed documents may not be always effective in case stage that includes denoising and extraction of connected
of handwritten documents. Variations in writing style, components. Ironically, document segmentation and ex-
character size, inter-line and inter-word spacings make traction of connected components sometimes require
the recognition process difficult and unreliable when the script type to be known a priori. For example, an
these techniques are applied to handwritten documents. algorithm that is good for segmenting ideograms in Han
Variation in writing across a document can be taken may not be equally effective in segmenting alphabetic
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
18 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
tures of the strokes, attained classification accuracies in The OCR evaluation approaches are broadly classified
between 86.5% to 95% in different experimental tests. into two categories: black-box evaluation and white-
Later, they added two more features in [81], viz. vertical box evaluation. In black-box evaluation, only the input
inter-stroke direction and variance of stroke length, and and output are visible to the evaluator. In white-box
achieved around 0.6% improvement in the classification evaluation procedure, outputs of different modules com-
accuracy. prising the system may be accessed and the total system
A unified syntactic approach to online script recogni- is evaluated stage by stage. Nevertheless, the primary
tion was presented in [82] and was applied for classify- issues related to both types of evaluation are recognition
ing Latin, Devnagari and Kanji scripts by analyzing their accuracy and processing speed. The parameters that can
characteristic properties that include the fuzzy linguistic be varied for the purpose of evaluation are content,
descriptors to describe the character features. The fuzzy font size and style, print and paper quality, scanning
pattern description language FOHDEL (Fuzzy Online resolution, and the amount of noise and degradation in
Handwriting Description Language) is used to store the document images.
fuzzy feature values for every character of a script class Needless to say, the overall performance of a multi-
in the form of fuzzy rules. For example, the character “b” script OCR greatly depends on the performance of the
in the Roman alphabet may be described as comprising script recognition algorithm used in the system. As with
of two fuzzy linguistic terms – very straight vertical line any OCR system, the efficiency of a script recognizer
at the beginning followed by an almost circular curve at is mainly assessed on the basis of accuracy and speed.
the end. These fuzzy rules aid in decision making during Another important performance criterion is the min-
classification. imum size of the document necessary for the script
recognizer to perform reliably. This is to measure how
5 S CRIPT R ECOGNITION IN V IDEO T EXT the recognizer performs with varying document size.
Script identification is not only important for docu- In a multi-script system, another issue of considera-
ment analysis but also for text recognition in images tion is the writing system adopted by a script, script
and videos. Text recognition in images and videos is complexity and the size of the character set. Since some
important in the context of image/video indexing and scripts are simple in nature and some are quite complex,
retrieval. The process includes several preprocessing a relative comparison of performance across scripts is a
steps like text detection, text localization, text segmen- difficult task. For example, Latin is generally simpler in
tation and binarization before an OCR algorithm may structure and is based on alphabetic system. A script
be applied. As with documents in multi-script envi- identifier that is good in recognizing Latin scripts may
ronment, image/video text recognition in international not be so in case of complex non-alphabetic scripts
environment also requires script identification in order like Arabic, Han and Devnagari. Therefore, in order to
to apply suitable algorithm for text extraction and recog- evaluate various systems, a standard set of data should
nition. In view of this, an approach for discriminating be used so that the evaluation is unbiased. However,
between Latin and Han script was developed in [83]. it is generally difficult to find document data-sets in
The proposed approach proceeds as follows. First, the different languages/scripts that are similar in content
text present in an image or video frame is localized and layout. To address this problem, Kanungo et al
and size normalized. Then, a set of low-level features introduced the Bible as a data-set for evaluating mul-
is extracted from the edges detected inside the text tilingual and multi-script OCR performance [85]. Bible
region. This includes mean and standard deviation of translations are closely parallel in structure, relevant
edge pixels, edge pixel density, energy of edge pixels, with respect to modern day language, widely available
horizontal projection, and Cartesian moments of the and inexpensive. These make the Bible attractive for
edge pixels. Finally, based on the extracted features, the controlling document content while varying language
decision about the type of the script is made using a and script. The document layout can also be controlled
KNN classifier. Experimental results have demonstrated by using synthetically generated page image data. Other
the efficiency of the proposed method by identifying holy books whose translation have similar properties,
Latin and Han scripts accurately at the rate of 85.5% like the Quran and the Bhagavad-Gita, have also been
and 89%, respectively. suggested by some researchers.
One major concern with most of the reported works
in script recognition is the lack of any comparative anal-
6 I SSUES IN MULTI - SCRIPT OCR SYSTEM ysis of the results. Experimental results given for every
EVALUATION proposed method have not been compared with other
In connection with research in script recognition, it benchmark works in the field. Moreover, the datasets
is useful and important to develop benchmarks and used in experiments are all different. This is mainly due
methodologies that may be employed to evaluate the to the lack of availability of a standard database for
performance of multi-script OCR systems. Some aspects script recognition research. Consequently, it is hard to
of this problem have been reported in [84], and are assess the results reported in the literature. Hence, a
discussed below. standard evaluation test-bed containing documents writ-
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
ten in only one script type as well as multi-script doc- be not wrong to say that script recognition in handwrit-
uments with mix of different scripts within a document ten documents is still in its early stage of research. Since
is necessary. One important consideration in selecting the present thrust in OCR research is in handwritten
the data-set for a script class is that it should reflect document analysis, parallel research on script identifi-
the global probability of occurrence of the characters in cation in handwritten documents is in demand. Also,
texts written in that particular script. Another problem of not many of these script recognition techniques have
concern is for languages that constantly undergo spelling addressed font variation within a script class. Hence,
modifications and graphemic changes over the years. As we can conclude that script recognition technology still
a result, if an old document is chosen as the corpus, has a way to go, especially for handwritten document
then it may not be suitable for evaluating a modern analysis. Therefore, there is an urgent need to work
OCR system. On the other hand, a database of modern on script recognition of handwritten documents and in
documents may not be useful if the goal of the OCR developing font independent script recognizers.
is to process historic documents. This suggests that the As is evident from our analysis, development in script
data-set should include all different forms of the same recognition technology lacks a generalized approach to
language that evolved with time, with a full coverage of the problem which can handle all different types of
script alphabet of different languages and it should be scripts under a common framework. While a particular
large enough to reflect the statistical occurrence proba- script feature proves to be efficient within a set of scripts,
bility of the characters. it may not be useful in other scripts. To some extent,
texture features can be used universally but cannot be
applied reliably at word and character levels within a
7 C ONCLUSION document.
This paper presents a comprehensive survey on the Finally, we need to create a standard data-set for
developments in script recognition technology which is research in this field. This is necessary to evaluate dif-
an important issue in OCR research in our multilin- ferent script recognition methodologies under the same
gual multi-script world. Researchers have attempted to conditions. The creation of standard data resources will
characterize different scripts either by extracting their undoubtedly provide a much needed resource to re-
structural features or by deriving some visual attributes. searchers working in this field.
Accordingly, many different script features have been
proposed over the years for script identification at dif-
ferent levels within a document – page-wise, paragraph-
R EFERENCES
wise, textline-wise, word-wise and even character-wise. [1] C.Y. Suen, M. Berthod, and S. Mori, “Automatic Recognition of
Handprinted Characters – The State of The Art,” Proc. IEEE, vol.
Textline-wise and word-wise script identifications are 68, no. 4, pp. 469-487, Apr. 1980.
particularly important for use in a multi-script docu- [2] J. Mantas, “An Overview of Character Recognition Methodolo-
ment. However, compared to the large arsenal of liter- gies,” Pattern Recognition, vol. 19, no. 6, pp. 425-430, 1986.
ature available in the field of document analysis and [3] V.K. Govindan and A.P. Shivaprasad, “Character Recognition – A
Review,” Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990.
optical character recognition, the volume of work on [4] S. Mori, C.Y. Suen, and K. Yamamoto, “Historical Review of OCR
script identification is relatively thin. The main reason is Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029-
that most research in the area of OCR has been directed 1058, Jul. 1992.
[5] H. Bunke and P.S.P. Wang (eds.), Handbook of Character Recognition
at solving issues within the scope of the country where and Document Image Analysis, World Scientific Publishing, Singa-
the research is conducted. Since most countries in the pore, 1997.
world use only one language/script, OCR research in [6] N. Nagy, “Twenty Years of Document Image Analysis in PAMI,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no.
these countries need not bother determining the script 1, pp. 38-62, Jan. 2000.
in which a document is written. For instance, the US [7] U. Pal, “Automatic Script Identification: A Survey,” J. Vivek, vol.
postal department had spent a lot in developing system 16, no. 3, pp. 26-35, 2006.
[8] U. Pal and B.B. Chaudhuri, “Indian Script Character Recognition:
for automatic reading of postal addresses, but under the A Survey,” Pattern Recognition, vol. 37, no. 9, pp. 1887-1899, Sep.
assumption that all letters originating or arriving in US 2004.
will carry address written in English only. Script recog- [9] L. Peng, C. Liu, X. Ding, and H. Wang, “Multilingual Document
Recognition Research and Its Application in China,” Proc. Int’l
nition is important only in an international environment Conf. Document Image Analysis for Libraries, Lyon, pp. 126-132, Apr.
or in a country that uses more than one script. 2006.
Nonetheless, with recent economic globalization and [10] A. Nakanishi, Writing Systems of the World: Alphabets, Syllabaries,
Pictograms, Charles E. Tuttle Co., Tokyo, 1980.
increased business transactions across the globe, there [11] F. Coulmas, The Blackwell Encyclopedia of Writing Systems, Black-
had been increased awareness of automatic script recog- well Publishers, Oxford, 1996.
nition among the OCR community. That is why majority [12] C. Ronse and P.A. Devijver, Connected Components in Binary Images:
The Detection Problem, John Wiley & Sons, New York, 1984.
of the reported works are dated only during the last [13] A.L. Spitz, “Multilingual Document Recognition,” Proc. Int’l Conf.
one decade. However, it is noted that most of these Electronic Publishing, Document Manipulation & Typography, Mary-
script recognition methods have been tested on machine- land, pp. 193-206, Sep. 1990.
[14] A.L. Spitz and M. Ozaki, “Palace: A Multilingual Document
printed documents only, and their performance in hand- Recognition System,” Proc. IAPR Workshop Document Analysis
written documents is not known. In view of this, it will Systems, Kaiserslautern, pp. 16-37, Oct. 1994.
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
20 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009
[15] A.L. Spitz, “Determination of The Script and Language Content Proc. Int’l Conf. Document Analysis & Recognition, Edinburgh, pp.
of Document Images,” IEEE Trans. Pattern Analysis & Machine 750-754, Aug. 2003.
Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997. [37] R. Krishnapuram and J.M. Keller, “A Possihilistic Approach to
[16] D.-S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in Clustering,” IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110,
Complex, Unoriented, and Degraded Document Images,” Proc. May 1993.
IAPR Workshop Document Analysis Systems, Malvern, pp. 76-98, [38] D. Ghosh and A.P. Shivaprasad, “An Analytic Approach for
Oct. 1996. Generation of Artificial Handprinted Character Database from
[17] B. Waked, S. Bergler, C.Y. Suen, and S. Khoury, “Skew Detection, Given Generative Models,” Pattern Recognition, vol. 32, no. 6, pp.
Page Segmentation and Script Classification of Printed Document 907-920, Jun. 1999.
Images,” Proc. IEEE Int’l Conf. Systems, Man & Cybernetics, San [39] D.W. Muir and T. Thomas, “Automatic Language Identification
Diego, vol. 5, pp. 4470-4475, Oct. 1998. by Stroke Geometry Analysis,” U.S. Patent No. 6064767, 16 May
[18] L. Lam, J. Ding, and C.Y. Suen, “Differentiating Between Oriental 2000.
and European Scripts by Statistical Features,” Int’l J. Pattern [40] Y.-H. Liu, C.-C. Lin, and F. Chang, “Language Identification of
Recognition & Artificial Intelligence, vol. 12, no. 1, pp. 63-79, Feb. Character Images Using Machine Learning Techniques,” Proc. Int’l
1998. Conf. Document Analysis & Recognition, Seoul, vol. 2, pp. 630-634,
[19] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic Aug.-Sep. 2005.
Script Identification from Document Images Using Cluster-based [41] I. Moalla, A. Elbaati, A.M. Alimi, and A. Benhamadou, “Ex-
Templates,” IEEE Trans. Pattern Analysis & Machine Intelligence, traction of Arabic Text from Multilingual Documents,” Proc.
vol. 19, no. 2, pp. 176-181, Feb. 1997. IEEE Int’l Conf. Systems, Man & Cybernetics, Yasmine Hammamet,
[20] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, “Script and Oct. 2002, http://ieeexplore.ieee.org/iel5/8325/26298/01173266.
Language Identification for Handwritten Document Images,” Int’l pdf?arnumber=1173266.
J. Document Analysis & Recognition, vol. 2, no. 2/3, pp. 45-52, Dec. [42] I. Moalla, A.M. Alimi, and A. Benhamadou, “Extraction of Ara-
1999. bic Words from Multilingual Documents,” Proc. Conf. Artificial
[21] Y. Tho and Y.Y. Tang, “Discrimination of Oriental and Euramer- Intelligence & Soft Computing, Marbella, Sep. 2004, http://www.
ican Scripts Using Fractal Feature,” in Proc. Int’l Conf. Document actapress.com/PDFViewer.aspx?paperId=18567.
Analysis & Recognition, Seattle, pp. 1115-1119, Sep. 2001. [43] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in Multi-
[22] B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and lingual Documents,” Proc. Int’l Symp. Intelligent Multimedia &
V.S. Malemath, “Script Identification Based on Morphological Distance Education, Baden-Baden, pp. 59-64, Aug. 1999.
Reconstruction in Document Images,” Proc. IEEE Int’l Conf. Pattern [44] S. Lu, C.L. Tan, and W. Huang, “Language Identification in
Recognition, Hong Kong, vol. 2, pp. 950-953, Aug. 2006. Degraded and Distorted Document Images,” Lecture Notes in Com-
[23] S. Chaudhury and R. Sheth, “Trainable Script Identification Strate- puter Science: Int’l Workshop Document Analysis Systems, Nelson,
gies for Indian Languages,” Proc. Int’l Conf. Document Analysis & LNCS-3872, pp. 232-242, Feb. 2006.
Recognition, Bangalore, pp. 657-660, Sep. 1999. [45] C.V. Jawahar, M.N.S.S.K. Pavan Kumar, and S.S. Ravi Kiran, “A
[24] S.B. Patil and N.V. Subbareddy, “Neural Network Based System Bilingual OCR for Hindi–Telugu Documents and Its Applica-
for Script Identification in Indian Documents,” Sadhana, vol. 27, tions,” Proc. Int’l Conf. Document Analysis & Recognition, Edin-
part 1, pp. 83-97, Feb. 2002. burgh, pp. 408-412, Aug. 2003.
[25] Z. Chi, Q. Wang, and W.-C. Siu, “Hierarchical Content Classifica- [46] S. Sinha, U. Pal, and B.B. Chaudhuri, “Word-wise Script Identifi-
tion and Script Determination for Automatic Document Image cation from Indian Documents,” Lecture Notes in Computer Science:
Processing,” Pattern Recognition, vol. 36, no. 11, pp. 2483-2500, IAPR Int’l Workshop Document Analysis Systems, Florence, LNCS-
Nov. 2003. 3163, pp. 310-321, Sep. 2004.
[26] S. Kanoun, A. Ennaji, Y. Lecourtier, and A.M. Alimi, “Script and [47] S. Chanda. S. Sinha, and U. Pal, “Word-wise English Devnagari
Nature Differentiation for Arabic and Latin Text Images,” Proc. and Oriya Script Identification,” Speech and Language Systems for
Int’l Workshop Frontiers in Handwriting Recognition, Niagra, pp. Human Communication, R.M.K. Sinha and V.N. Shukla (eds.), Tata
309-313, Aug. 2002. McGraw-Hill, New Delhi, pp. 244-248, 2004.
[27] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English Script Identification [48] S. Chanda and U. Pal, “English, Devnagari and Urdu Text Iden-
Based on Analysis of Connected Component Profiles,” Lecture tification,” Proc. Int’l Conf. Cognition & Recognition, Mandya, pp.
Notes in Computer Science: Int’l Workshop Document Analysis Sys- 538-545, Dec. 2005.
tems, Nelson, LNCS-3872, pp. 243-254, Feb. 2006. [49] S. Chanda, R.K. Roy, and U. Pal, “English and Tamil Text Iden-
[28] U. Pal and B.B. Chaudhuri, “Script Line Separation from Indian tification,” Proc. Nat’l Conf. Recent Trends in Information Systems,
Multi-script Documents,” Proc. Int’l Conf. Document Analysis & Kolkata, pp. 184-187, Jul. 2006.
Recognition, Bangalore, pp. 406-409, Sep. 1999. [50] M.C. Padma and P. Nagabhushan, “Identification and Separation
[29] U. Pal and B.B. Chaudhuri, “Identification of Different Script of Text Words of Kannada, Hindi and English Languages Through
Lines from Multi-script Documents,” Image & Vision Computing, Discriminating Features,” Proc. Nat’l Conf. Document Analysis &
vol. 20, no. 13-14, pp. 945-954, Dec. 2002. Recognition, Mandya, pp. 252-260, Jul. 2003.
[30] U. Pal, S. Sinha, and B.B. Chaudhuri, “Multi-script Line Identifica- [51] R. Kumar, V. Chaitanya, and C.V. Jawahar, “A Novel Approach to
tion from Indian Documents,” Proc. Int’l Conf. Document Analysis Script Separation,” Proc. Int’l Conf. Advances in Pattern Recognition,
& Recognition, Edinburgh, pp. 880-884, Aug. 2003. Kolkata, pp. 289-292, Dec. 2003.
[31] A. Elgammal and M.A. Ismail, “Techniques for Language Identi- [52] K. Roy, U. Pal, and B.B. Chaudhuri, “Address Block Location
fication for Hybrid Arabic-English Document Images,” Proc. Int’l and Pin Code Recognition for Indian Postal Automation,” Proc.
Conf. Document Analysis & Recognition, Seattle, pp. 1100-1104, Sep. Workshop Computer Vision, Graphics & Image Processing, Gwalior,
2001. pp. 5-9, Feb. 2004.
[32] C.S. Cumbee, “Method of Identifying Script of Line of Text,” U.S. [53] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, and A. Belaid, “A System
Patent No. 7020338, 28 Mar. 2006. for Indian Postal Automation,” Proc. Int’l Conf. Document Analysis
[33] S.-W. Lee and J.-S. Kim, “Multi-lingual, Multi-font, Multi-size & Recognition, Seoul, vol. 2, pp. 1060-1064, Aug.-Sep. 2005.
Large-set Character Recognition Using Self-organizing Neural [54] K. Roy, D. Pal, and U. Pal, “Pin-code Extraction and Recognition
Network,” Proc. Int’l Conf. Document Analysis & Recognition, Mon- for Indian Postal Automation,” Proc. Nat’l Conf. Recent Trends in
treal, vol. 1, pp. 28-33, Aug. 1995. Information Systems, Kolkata, pp. 192-195, Jul. 2006.
[34] J. Hochberg, M. Cannon, P. Kelly, and J. White, “Page Segmenta- [55] K. Roy and U. Pal, “Word-wise Hand-written Script Separation
tion Using Script Identification Vectors: A First Look,” Proc. Symp. for Indian Postal Automation,” Proc. Int’l Workshop Frontiers in
Document Image Understanding Technology, Annapolis, pp. 258-264, Handwriting Recognition, La Baule, pp. 521-526, Oct. 2006.
Apr.-May 1997. [56] K. Roy, U. Pal, and B.B. Chaudhuri, “Neural Network Based
[35] D. Ghosh and A.P. Shivaprasad, “Handwritten Script Identifica- Word-wise Handwritten Script Identification System for Indian
tion Using Possibilistic Approach for Cluster Analysis,” J. Indian Postal Automation,” Proc. Int’l Conf. Intelligent Sensing & Informa-
Inst. of Science, vol. 80, pp. 215-224, May-Jun. 2000. tion Processing, Chennai, pp. 240-245, Jan. 2005.
[36] V. Ablavsky and M.R. Stevens, “Automatic Feature Selection with [57] S.L. Wood, X. Yao, K. Krishnamurthi, and L. Dang, “Language
Applications to Script Identification of Degraded Documents,” Identification for Printed Text Independent of Segmentation,”
Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
Proc. Int’l Conf. Image Processing, Washington D.C., vol. 3, pp. 428- [81] A.M. Namboodiri and A.K. Jain, “Online Handwritten Script
431, Oct. 1995. Recognition,” IEEE Trans. Pattern Analysis & Machine Intelligence,
[58] T.N. Tan, “Rotation Invariant Texture Features and Their Use in vol. 26, no. 1, pp. 124-130, Jan. 2004.
Automatic Script Identification,” IEEE Trans. Pattern Analysis & [82] A. Malaviya and L. Peters, “Fuzzy Handwriting Description
Machine Intelligence, vol. 20, no. 7, pp. 751-756, Jul. 1998. Language: FOHDEL,” Pattern Recognition, vol. 33, no. 1, pp. 119-
[59] L. O’Gorman and R. Kasturi, Document Image Analysis, IEEE-CS 131, Jan. 2000.
Press, Los Alamitos, 1995. [83] J. Gllavata and B. Freisleben, “Script Recognition in Images with
[60] G.S. Peake and T.N. Tan, “Script and Language Identification from Complex Backgrounds,” Proc. IEEE Int’l Symp. Signal Processing &
Document Images,” Lecture Notes in Computer Science: Asian Conf. Information Technology, Athens, pp. 589-594, Dec. 2005.
Computer Vision, Hong Kong, LNCS-1352, pp. 97-104, Jan. 1998. [84] B.B. Chaudhuri, “On Multi-script OCR System Evaluation,”
[61] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features Int’l Workshop Performance Evaluation Issues in Multi-lingual OCR,
for Image Classification,” IEEE Trans. Systems, Man, & Cybernetics, Bangalore, Sep. 1999, http://www.kanungo.com/workshop/
vol. 3, no. 6, pp. 610-621, Nov. 1973. abstracts/chaudhuri.html.
[62] W.M. Pan, C.Y. Suen, and T.D. Bui, “Script Identification Using [85] T. Kanungo, P. Resnik, S. Mao, D.-W. Kim, and Q. Zheng, “The
Steerable Gabor Filters,” Proc. Int’l Conf. Document Analysis and Bible and Multilingual Optical Character Recognition,” Commu-
Recognition, Seoul, vol. 2, pp. 883-887, Aug.-Sep. 2005. nications of the ACM , vol. 48, no. 6, pp. 124-130, Jun. 2005.
[63] V. Singhal, N. Navin, and D. Ghosh, “Script-based Classification
of Hand-written Text documents in A Multilingual Environment,”
Proc. Int’l Workshop Research Issues in Data Engineering – Multi-
lingual Information Management, Hyderabad, pp. 47-54, Mar. 2003.
[64] J. Cheng, X. Ping, G. Zhou, and Y. Yang, “Script Identification of D. Ghosh received the B.E degree in electronics
Document Image Analysis,” Proc. Int’l Conf. Innovative Computing, & communication engineering from M.R. Engi-
Information and Control, Beijing, vol. 3, pp. 178-181, Aug.-Sep. 2006. neering College, Jaipur, India, in 1993, and the
[65] A.K. Jain and Y. Zhong, “Page Segmentation Using Texture Anal- M.S. and Ph.D degrees in electrical communi-
ysis,” Pattern Recognition, vol. 29, no. 5, pp. 743-770, May 1996. cation engineering from the Indian Institute of
[66] A. Busch, W.W. Boles, and S. Sridharan, “Texture for Script Science, Bangalore, India, in 1996 and 2000,
Identification,” IEEE Trans. Pattern Analysis & Machine Intelligence, respectively. From April 1999 to November 1999,
vol. 27, no. 11, pp. 1720-1732, Nov. 2005. he was a DAAD Research Fellow at the Univer-
[67] A. Busch, “Multi-font Script Identification Using Texture-based sity of Kaiserslautern, Germany. In November
Features,” Lecture Notes in Computer Science: Int’l Conf. Image 1999, he joined the Indian Institute of Technol-
Analysis & Recognition, Póvoa de Varzim, LNCS-4142, pp. 844-852, ogy Guwahati, India, as an Assistant Professor
Sep. 2006. of electronics and communication engineering. He spent the 2003-2004
[68] G.D. Joshi, S. Garg, and J. Sivaswamy, “Script Identification from academic year as a visiting faculty in the Department of Electrical &
Indian Documents,” Lecture Notes in Computer Science: IAPR Int’l Computer Engineering, National University of Singapore. Between 2006
Workshop Document Analysis Systems, Nelson, LNCS-3872, pp. 255- and 2008, he was a Senior Lecturer with the Faculty of Engineering
267, Feb. 2006. and Technology, Multimedia University, Malaysia. He is currently an
[69] W. Chan and G.G. Coghill, “Text Analysis Using Local Energy,” Associate Professor with the Department of Electronics & Computer
Pattern Recognition, vol. 34, no. 12, pp. 2523-2532, Dec. 2001. Engineering, Indian Institute of Technology Roorkee, India. His teaching
[70] H. Ma and D. Doermann, “Gabor Filter Based Multi-class Clas- and research interests include image/video processing, computer vision
sifier for Scanned Document Images,” Proc. Int’l Conf. Document and pattern recognition.
Analysis & Recognition, Edinburgh, pp. 968-972, Aug. 2003.
[71] S. Jaeger, H. Ma, and D. Doermann, “Identifying Script on Word-
level with Informational Confidence,” Proc. Int’l Conf. Document
Analysis & Recognition, Seoul, vol. 1, pp. 416-420, Aug.-Sep. 2005.
[72] D. Dhanya, A.G. Ramkrishnan, and P.B. Pati, “Script Identification T. Dube received the B.Tech. degree in elec-
in Printed Bilingual Documents,” Sadhana, vol. 27, part 1, pp. 73- tronics and communication engineering from the
82, Feb. 2002. Indian Institute of Technology Guwahati, India, in
[73] D. Dhanya and A.G. Ramkrishnan, “Script Identification in 2006. Soon after her graduation, she joined the
Printed Bilingual Documents,” Lecture Notes in Computer Science: Indian division of British Telecom at Bangalore,
IAPR Int’l Workshop Document Analysis Systems, Princeton, LNCS- and later moved to Ibibo Web Pvt. Ltd., Gurgaon,
2423, pp. 13-24, Aug. 2002. India, as a Software Engineer. Between 2007
[74] D. Dhanya and A.G. Ramkrishnan, “Optimal Feature Extraction and 2009, she worked as a Senior Software En-
for Bilingual OCR,” Lecture Notes in Computer Science: IAPR Int’l gineer with Infovedics Software Pvt. Ltd., Noida,
Workshop Document Analysis Systems, Princeton, LNCS-2423, pp. India. She received a search developer certifica-
25-36, Aug. 2002. tion from FAST University, Norway, in 2007. She
[75] P.B. Pati, S. Sabari Raju, N. Pati, and A.G. Ramakrishnan, “Gabor is currently pursuing a management degree from the Indian Institute of
Filters for Document Analysis in Indian Bilingual Documents,” Management, Ahmedabad, India.
Proc. Int’l Conf. Intelligent Sensing and Information Processing, Chen-
nai, pp. 123-126, Jan. 2004.
[76] P.B. Pati and A.G. Ramakrishnan, “HVS Inspired System for Script
Identification in Indian Multi-script Documents,” Lecture Notes
in Computer Science: Int’l Workshop Document Analysis Systems,
Nelson, LNCS-3872, pp. 380-389, Feb. 2006. A.P. Shivaprasad received the B.E., M.E. and
Ph.D. degrees in electrical communications en-
[77] A.L. Spitz, “Script and Language Determination from Document
gineering from the Indian Institute of Science,
Images,” Proc. Annual Symp. Document Analysis & Information
Bangalore, India, in 1965, 1967 and 1972, re-
Retrieval, Las Vegas, pp. 229-235, Apr. 1994.
spectively. Since 1967, he was a member of the
[78] J.J. Lee, B.K. Sin, and J.H. Kim, “On-line Mixed Character Recog-
academic staff of the Department of Electrical
nition Using An HMM Network,” Proc. KISS Annual Conf., vol.
Communication Engineering, Indian Institute of
20, no. 2, pp. 317-320, Oct. 1993.
Science, Bangalore, India, till he retired as a Pro-
[79] J.J. Lee, J.H. Kim, and M. Nakajima, “A Hierarchical HMM
fessor in 2006. He is currently a Guest Professor
Network-based Approach for On-line Recognition of Multi-
with the Department of Electronics & Communi-
lingual Cursive Handwritings,” IEICE Trans. Information & Sys-
cation Engineering, Sambhram Institute of Tech-
tems, vol. E81-D, no. 8, pp. 881-888, Aug. 1998.
nology, Bangalore, India. His research interests include design of micro-
[80] A.M. Namboodiri and A.K. Jain, “Online Script Recognition,” power VLSI circuits, intelligent instrumentation, communication systems
Proc. Int’l Conf. Pattern Recognition, Quebec, vol. 3, pp. 736-739, and pattern recognition.
Aug. 2002.
Authorized
View publicationlicensed
stats use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.