0% found this document useful (0 votes)
13 views22 pages

Script Recognition Ghosh 2009

Uploaded by

thanhtung170605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views22 pages

Script Recognition Ghosh 2009

Uploaded by

thanhtung170605
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/47544548

Script Recognition-A Review

Article in IEEE Transactions on Pattern Analysis and Machine Intelligence · December 2010
DOI: 10.1109/TPAMI.2010.30 · Source: PubMed

CITATIONS READS
263 17,586

3 authors:

Debashis Ghosh Tulika Dube


Indian Institute of Technology Roorkee Indian Institute of Management
76 PUBLICATIONS 1,265 CITATIONS 2 PUBLICATIONS 268 CITATIONS

SEE PROFILE SEE PROFILE

Shivaprasad Adamane
Indian Institute of Science
1 PUBLICATION 263 CITATIONS

SEE PROFILE

All content following this page was uploaded by Tulika Dube on 12 July 2014.

The user has requested enhancement of the downloaded file.


This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009 1

Script Recognition – A Review


D. Ghosh, T. Dube, and A.P. Shivaprasad

Abstract—A variety of different scripts are used in writing languages throughout the world. In a multi-script, multilingual environment, it
is essential to know the script used in writing a document before an appropriate character recognition and document analysis algorithm
can be chosen. In view of this, several methods for automatic script identification have been developed so far. They mainly belong to
two broad categories – structure-based and visual appearance-based techniques. This survey report gives an overview of the different
script identification methodologies under each of these categories. Methods for script identification in online data and video-texts are
also presented. It is noted that the research in this field is relatively thin and still more research is to be done, particularly in case of
handwritten documents.

Index Terms—Document analysis, Optical character recognition, Script identification, Multi-script document.

1 I NTRODUCTION recognizing characters irrespective of the script in which


they are written. In general, recognition of different
O NEinteresting and challenging field of research in
pattern recognition is Optical Character Recogni-
tion (OCR). Optical character recognition is the process
script characters in a single OCR module is difficult. This
is because features necessary for character recognition
in which a paper document is optically scanned and then depend on the structural properties, style and nature
converted into computer processable electronic format of writing which generally differs from one script to
by recognizing and associating symbolic identity with another. For example, features used for recognition of
every individual character in the document. English alphabets are in general not good for recognizing
With the increasing demand for creating a paperless Chinese logograms.
world, many OCR algorithms have been developed over Another option for handling documents in a multi-
the years [1], [2], [3], [4], [5], [6]. However, most OCR script environment is to use a bank of OCRs corre-
systems are script-specific in the sense that they can read sponding to all different scripts expected to be seen. The
characters written in one particular script only. Script is characters in an input document can then be recognized
defined as the graphic form of the writing system used reliably by selecting the appropriate OCR system from
to write statements expressible in language. That means, the OCR bank. Nevertheless, this will require to know
a script class refers to a particular style of writing and a priori the script in which the input document is writ-
the set of characters used in it. Languages throughout ten. Unfortunately, this information may not be readily
this world are typeset in many different scripts. A script available. At the same time, manual identification of the
may be used by only one language or may be shared by documents’ scripts may be tedious and time consuming.
many languages, sometimes with slight variations from Therefore, automatic script recognition techniques are
one language to other. For example, Devnagari is used necessary to identify the script in the input document
for writing a number of Indian languages like Sanskrit, and then redirect it to the appropriate character recog-
Hindi, Konkani, Marathi, etc., English, French, German nition module, as illustrated in Fig. 1.
and some other European languages use different vari- Script recognizer is also useful in reading multi-script
ants of the Latin alphabet, and so on. Some languages documents in which different paragraphs, text-blocks,
even use different scripts at different point of time and textlines or words in a page are written in different
space. One good example for this is Malay that uses scripts. Fig. 2 shows several examples of multi-script
the Latin alphabet nowadays replacing previously used documents. Analysis of such documents works in two
Jawi. Another example is Sanskrit that is mainly written stages — identification and separation of different script
in Devnagari in India but is also written in Sinhala
script in Sri Lanka. Therefore, in this multilingual and
multi-script world, OCR systems need to be capable of

• D. Ghosh is with the Department of Electronics & Computer Engineering,


Indian Institute of Technology, Roorkee 247 667, India.
E-mail: [email protected]
• T. Dube is with the Indian Institute of Management, Ahmedabad 380 015,
India.
• A.P. Shivaprasad is with the Department of Electronics & Communication
Engineering, Sambhram Institute of Technology, Bangalore 560 097, India.
Fig. 1. Stages of document processing in a multi-script environment.

Authorized
Digital Object licensed
Indentifier use limited to: UNIVERSIDADE ESTADUAL DE0162-8828/10/$26.00
10.1109/TPAMI.2010.30 CAMPINAS. Downloaded
© 2010 on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
IEEE
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

of script identification research for Indian documents is


also available in [8]. A report on the key technologies
in multilingual OCR and their application in building
multilingual digital library can also be found in [9].
In this paper, we present a comprehensive survey of
different script recognition techniques developed mainly
for identification of certain major scripts of the world,
viz. Chinese, Japanese, Korean, Arabic, Hebrew, Latin,
Fig. 2. Examples of multi-script document images: (a) a government Cyrillic and the Brahmic family of Indian scripts. To
report in China containing mix of Chinese and English words, (b) a
medical report in Arabic containing words in English that do not have begin with, in Section 2, we give a brief description
exact Arabic equivalent, (c) portion of an official application form in India of different script types highlighting their main dis-
containing different script-lines typeset in Hindi and English. criminating features. Methods for script recognition in
document images are described in Section 3 giving
regions in the document followed by reading of each in- comparative analysis among them. Section 4 discusses
dividual script region using corresponding OCR system. several methods for script recognition in the realm of pen
Script identification also serves as an essential precur- computing. As said before, script identification in video
sor for recognizing the language in which a document text is also important. However, not much research has
is written. This is necessary for further processing of the been done on this topic. The only work that we have
document, such as routing, indexing or translation. For found on this is outlined in Section 5. Section 6 raises
scripts used by only one language, script identification issues related to performance evaluation of multi-script
itself accomplishes language identification. For scripts OCR systems. Finally, we state our concluding remarks
shared by many languages, script recognition acts as in Section 7, including some insights on the recent trends
the first level of classification followed by language and future scope of work in this field.
identification within the script.
Script recognition also helps in text area identification,
video indexing and retrieval, and document sorting in 2 W RITING S YSTEMS AND S CRIPTS OF THE
digital libraries when dealing with a multi-script envi- W ORLD
ronment. Text area detection refers to either segment-
ing out text-blocks from other non-textual regions like In the context of script recognition, it may be worth
halftones, images, line drawings, etc. in a document studying the characteristics of various writing systems
image, or extracting text printed against textured back- and the structural properties of the characters used in
grounds and/or embedded in images within a docu- certain major scripts of the world. In Fig. 3, we draw
ment. To do this, the system takes advantage of script a tree diagram showing different classes of writing
specific distinctive characteristics of text which make it systems. As said in [10], [11] and depicted in the tree
stand out from other non-textual parts in the document. diagram, there are six prominent writing systems. Major
Text extraction is also required in images and videos scripts that follow each of these writing systems are also
for content-based browsing. One powerful index for shown in the tree diagram and are described below.
image/video retrieval is the text appearing in them.
Efficient indexing and retrieval of digital image/video in
2.1 Logographic system
an international scenario, therefore, requires text extrac-
tion followed by script identification and then character A logogram, also called ideogram, refers to a symbol that
recognition. Similarly, text found in documents can be graphically represents a complete word. Accordingly,
used for their annotation, indexing, sorting and retrieval. the number of characters in a script for an ideographic
Thus, script identification plays an important role in writing system generally runs into the thousands. This
building a digital library containing documents written makes recognition of logographic characters a difficult
in different scripts. but interesting problem.
In short, automatic script identification is crucial to An example of logographic script is Han which is
meet the growing demand for electronic processing of mainly associated with Chinese. Japanese and Korean
volumes of documents written in different scripts. This writings also include Han modified as Kanji and Hanja,
is important for business transactions across Europe and respectively. Han characters are generally composed of
Orient, and has great significance in a country like India multiple short strokes giving them a complex and dense
which has many official state languages and scripts. Due look, distinctly different from other Western and Asian
to this, there has been a growing interest in multi-script scripts. Accordingly, character optical density and cer-
OCR technology during recent years. A brief survey on tain other visual appearance-based features have been
methods for script recognition had been reported earlier utilized by many researchers in distinguishing Han from
in [7], with emphasis on script identification in Indian other scripts. Another interesting property of Han is its
multi-script documents but little insights into the script directionality — words in a textline are written either
recognition methods for non-Indian scripts. A review from left to right or from top to bottom.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 3

Fig. 3. Tree diagram showing broad classification of prominent writing systems and scripts of the present world.

2.2 Syllabic system somewhat similar to that of Latin except that it uses
In a syllabic system, every written symbol represents a a different alphabet set. Some characters in the Cyrillic
phonetic sound or syllable, as used in Japanese. The sym- alphabet are also borrowed from Latin and Greek, modi-
bols representing the Japanese syllables are known as fied with cedillas, crosshatches or diacritical marks. This
Kanas which are of two types — Hirakana and Katakana. induces recognition ambiguity between Cyrillic, Latin
As indicated in Fig. 3, Japanese script uses a mix of lo- and Greek.
gographic Kanji and syllabic Kanas. Hence, it is visually
similar to Chinese, but less dense due to the presence of 2.4 Abjads
simpler Kanas in between the logograms. The Abjad system of writing is similar to the alpha-
betic system, but has symbols for consonantal sounds
2.3 Alphabetic system only. Unlike most other scripts in the world, Abjads are
An alphabet is a set of characters representing phonemes written from right to left within a textline. This unique
of a spoken language. Examples of scripts following this feature is particularly useful for identifying Abjad-based
system are Greek, Latin, Cyrillic and Armenian. The scripts in pen computing.
Latin script, also called Roman script, is used by many Two important scripts under this category are Arabic
languages throughout this world with varying degrees and Hebrew. A typical Arabic character is formed of
of modifications from one language to another. It is used a long main stroke along with one to three dots. The
for writing many European languages like English, Ital- characters in a word are generally conjoined giving
ian, French, German, Portuguese, Spanish, etc., and has an overall cursive appearance to the written text. This
been adopted in many Amerindian and Austronesian provides an important clue for the recognition of Arabic
languages including modern Malay, Vietnamese and In- script. The same applies to some other scripts of Arabic
donesian language. Fig. 4 shows few such variants of the origin such as Farsi (Persian), Urdu, Sindhi, Jawi, etc.
Latin script. Compared to other scripts, classical Latin On the other hand, character strokes in Hebrew are more
characters are simple in structure, mainly composed of uniform in length and the letters in a word are generally
few lines and arcs. The other major script under the discrete.
alphabetic system is Cyrillic. This script is used by some
languages of Eastern Europe, Asia and Slavic regions 2.5 Abugidas
that include Bulgarian, Russian, Macedonian, Ukrainian,
Abugida is another alphabetic-like writing system used
Mongolian, etc. The basic properties of this script are
by the Brahmic family of scripts that originated from the
ancient Indian Brahmi script and includes nearly all the
scripts of India and southeast Asia. In Fig. 5, we draw a
tree diagram to illustrate the evolution of major Brahmic
scripts in India and southeast Asia. The northern group
of Brahmic scripts (e.g. Devnagari, Bengali, Manipuri,
Gurumukhi, Gujrati and Oriya) bear strong resemblance
to the original Brahmi script. On the other hand, scripts
in south India (Tamil, Telugu, Kannada and Malayalam)
Fig. 4. Examples of some languages using the Latin alphabet with as well as in southeast Asia (e.g. Thai, Lao, Burmese,
different modifications. Javanese and Balinese) are derived from Brahmi through

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

4 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

Fig. 5. The Brahmic family of scripts used in India and southeast Asia.

many changes and so look quite different from the north- make it possible to distinguish it from other scripts. So,
ern group. One important characteristic of Devnagari, the basic task involved in script recognition is to devise a
Bengali, Gurumukhi and Manipuri is that the characters technique to discover these features from a given docu-
in a word are generally written together without spaces, ment and then classify the document’s script accordingly.
so that the top bar is unbroken. This results in the Based on the nature of approach and features used,
formation of headline, called shirorekha, at the top of each these methods may be divided into two broad cate-
word. Accordingly, these scripts can be separated from gories — structure-based and visual appearance-based
other script types by detecting the presence of a large methods. Script recognition techniques in each of these
number of horizontal lines in the textual portions of a two categories may be further classified on the basis of
document. the level at which they are applied inside a document
image, viz. page-wise, paragraph-wise, textline-wise and
2.6 Featural system word-wise. The application mode of a method depends
The last significant form of writing system is the featural on the minimum size of the text from which the fea-
system in which the symbols or characters represent the tures proposed in the method can be extracted reliably.
features that make up the phonemes. One prominent Various algorithms under each of these categories are
script of this sort is the Korean Hangul. As indicated in summarized below.
Fig. 3, the Korean script is formed by mixing logographic
Hanja with featural Hangul. However, modern Korean 3.1 Structure-based script recognition
contains more of Hangul than Hanja. Consequently, In general, script classes differ from each other in their
Korean script is relatively less complex and less dense stroke structure and connections, and the writing styles
compared to Chinese and Japanese, containing more associated with the character sets they use. One ap-
circles and ellipses. proach to script recognition may be to extract con-
nected components (continuous runs of pixels) in a doc-
3 S CRIPT R ECOGNITION M ETHODOLOGIES ument [12] and then analyze their shapes and structures
Script identification relies on the fact that each script so as to reveal the intrinsic morphological characteristics
has unique spatial distribution and visual attributes that of the script used in the document. In machine-printed

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 5

classifying printed documents written in Han, Latin,


Cyrillic and Arabic. These statistical features are more
robust compared to the structural features proposed by
Spitz and Lee et al. However, Waked et al achieved an
accuracy rate of only 91% when tested on documents of
varying kinds, diverse formats and qualities. This drop
in recognition accuracy is mainly due to misclassification
between Latin and Cyrillic scripts, which are similar-
Fig. 6. Spitz’s method of script identification. looking under this measure. Also, some test documents
of extremely poor quality account for this degradation
Latin, Greek, Han, etc., every individual character or part in performance.
of a character is a connected component. On the other Script identification in machine-printed documents us-
hand, in cursive handwritten documents, the characters ing statistical features has also been explored by Lam
in a word or part of a word can touch each other to form et al [18]. In a first level of classification, documents
one single connected component. Likewise, in scripts like are classified as Latin, Chinese, Japanese or Korean
Devnagari, Bengali, Arabic, etc., a word or a part of a using horizontal projection profiles, height distributions
word forms a connected component. Script identification of connected components and enclosing structure of
methods that are based on extraction and analysis of con- connected components. Non-Latin documents that can-
nected components fall under the category of structure- not be recognized in this stage are classified in a sec-
based methods. ond level of recognition using structural features like
character complexity, presence of circles, ellipses and
3.1.1 Page-wise script identification methods vertical strokes. In the process, more than 95% correct
A script identification method that relies on the spa- recognition was achieved.
tial relationship of character structures was developed The fact that every script class is composed of some
by Spitz for differentiating Han and Latin scripts in “textual symbols” of unique characteristic shapes had
machine-printed documents. In his first work on this been exploited by Hochberg et al in identifying the
topic [13], he used character optical density for classi- script of a printed document [19]. First, textual symbols
fying individual textlines in a document as being ei- obtained from documents of a known script are resized
ther English or Japanese. In another paper, Spitz used and clustered to generate template symbols for that
vertical distribution of upward concavities in characters script class, as depicted in Fig. 7. Textual symbols in-
for discriminating Han from Latin with 100% success clude character fragments, discrete characters, adjoined
in continuous production use [14]. Later, he developed characters, and even whole words. During classification,
a two stage classifier in [15] by combining these two textual symbols extracted from the input document are
features. In the first stage, Latin is separated from compared to the template symbols using Hamming dis-
Han-based scripts by comparing the variances of their tance and then scored against every script class on the
upward concavity distributions. Further classification basis of their distances from the best match template
within the Han-based scripts is performed by analyzing symbols in that script class. The script class with the best
the distribution of optical density in the text image. The average score is chosen as the script of the document.
system also has provisions for language identification Hochberg et al tested their method on as many as thir-
within documents using the Latin alphabet by observing teen scripts, viz. Arabic, Armenian, Burmese, Chinese,
the most frequently occurring character shape codes. A Cyrillic, Devanagari, Ethiopic, Greek, Hebrew, Japanese,
schematic diagram showing the flow of information in Korean, Latin and Thai, and obtained 96% accuracy.
the process is given in Fig. 6.
In [20], Hochberg and others proposed a feature-
The above works by Spitz was extended by Lee et
based approach for script identification in handwritten
al in [16], and by Waked et al in [17] by incorporat-
documents and achieved 88% accuracy in distinguishing
ing some additional features. In [16], the script of a
Arabic, Chinese, Cyrillic, Devnagari, Japanese and Latin.
printed document is identified via textline-wise script
In their method, a handwritten document is character-
recognition followed by a majority vote of the already
ized in terms of mean, standard deviation and skew of
decided textline classification results. The features used
are character height distribution and the top and bot-
tom profiles of character bounding boxes, in addition
to upward concavity distribution and optical density
features. Experimental results showed that these fea-
tures can separate Han-based (Chinese and Japanese)
documents from Latin-based (English, French, German,
Italian and Spanish) documents in 98.16% cases. In [17],
Waked et al used bounding box size distribution, char- Fig. 7. Hochberg et al’s method of script identification in printed
acter density distribution and horizontal projections, for documents.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

Fig. 8. Hochberg et al’s method of script identification in handwritten


documents.

five features which are relative vertical centroid, relative


horizontal centroid, number of holes, sphericity and
aspect ratio of the connected components in a document Fig. 9. Chaudhury and Sheth’s three methods of script identification.
page. A set of Fisher linear discriminants (FLD), one FLD
for every pair of script classes, is used for classification. Multi-script OCR systems that perform script recognition
The document is finally assigned to the script class to at the paragraph level are now described.
which it is classified most often. A schematic diagram Fig. 9 shows three different strategies developed by
showing different stages of the system is given in Fig. 8. Chaudhury and Sheth in [23] to recognize the script of a
A novel approach to script identification using fractal text-block in a printed document. In the first technique,
features was proposed in [21] and had been utilized for the script of the text-block is described in terms of the
discriminating printed Chinese, Japanese and Devnagari Fourier coefficients of the horizontal projection profile.
scripts. Fractal features are obtained by computing frac- Subsequent classification is based on Euclidean distance
tal signatures for the patterns extracted from a document in the eigenspace. The other two schemes are based on
image. The fractal signature is determined by the area of features derived from connected components in text-
the surface onto which a gray-level function correspond- blocks — one using the means and standard deviations
ing to the document image is mapped. of the outputs for a six-channel Gabor filter and the
A method for script identification in printed docu- other using distribution of the width-to-height ratio of
ment images based on morphological reconstruction had the connected components present in the document.
been proposed in [22]. In this method, morphological Classification in both these cases are accomplished using
erosion and opening by reconstruction is carried out on Mahalanobis distance. Average recognition rate obtained
the document image in horizontal, vertical, right and with these methods, when tested on Latin, Devnagari,
left diagonal directions using line structuring elements. Telugu and Malayalam scripts, were approximately 85%,
The average pixel distributions in the resulting images 95% and 89%, respectively.
give the measures of horizontal, vertical, 45o and 135o In [24], a neural network-based architecture was de-
slanted lines present in the document page. Finally, script veloped for identification of printed Latin, Devnagari
identification is carried out using nearest neighbor clas- and Kannada scripts. It consists of a feature extractor fol-
sification. The method showed robustness with respect lowed by a modular neural network, as shown in Fig. 10.
to noise, font sizes and styles, and an average classifi- In the feature extraction stage, a feature vector corre-
cation accuracy of 97% was achieved when applied for sponding to pixel distributions along specified directions
classification of four script classes, viz. Latin, Devnagari, is obtained via morphological operations. The modular
Urdu and Kannada. neural network structure consists of three independently
trained feed-forward neural networks, one for each of
3.1.2 Script identification at paragraph and text-block the three scripts under consideration. The input is as-
level signed to the script class of the network which produces
The script identification methods discussed above re- maximum output. It was seen that such a system can
quire large blocks of input text so that sufficient infor- classify English and Kannada with 100% accuracy while
mation is available to bring out the characteristics of the the rate is slightly lower (97%) in recognizing Devnagari.
script. They offer good performance when used for script Script recognition using feed-forward neural network
identification at the page level, but may not retain their
performance when applied on a smaller block of text. In
multi-script documents, it is necessary to identify and
separate different script regions like paragraph, textline,
word or even character in the document page. This is
particularly important in a country like India that hosts
a variety of scripts like Devnagari, Bengali, Tamil, Tel-
ugu, Kannada, Malayalam, Gujrati, Gurumukhi, Oriya,
Manipuri, Urdu, Sindhi and Latin. In view of this, sev-
eral multi-script OCR systems involving more than one Fig. 10. Neural network-based architecture for script identification
Indian scripts in a single unit have been developed [8]. proposed by Patil and Reddy.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 7

was also performed in [25]. The network is trained to


classify an input printed text-block into Han or Latin
directly without performing any feature extraction. The
network consists of four layers with 49 nodes in the
input layer, 15 and 20 nodes in the hidden layers, and
two nodes in the output layer that correspond to the
two script classes. The nodes in the input layer are fed
with pixel values in a block of size 7 × 7 pixels. A
number of sample blocks are randomly extracted from Fig. 11. Pal and Chaudhuri’s method for script-line separation from
the input text-block, and the script of the text-block is multi-script documents in India.
then determined by a simple majority vote among the
sampling blocks. Experiments on a number of mixed- Arabic using statistical as well as water reservoir-based
type document images showed the effectiveness of the features. Statistical features include the distribution of
proposed system, yielding 92.3% and 95% accuracy in lowermost points in the characters — the lowermost
determining Chinese and English texts, respectively. points of characters in a printed English textline lie only
A method for Arabic and Latin text-block differen- along the base-line and the bottom-line while that in
tiation in both printed and handwritten scripts was Arabic are more randomly distributed. Water reservoir-
proposed in [26]. This method is based on morphological based features give a measure of the cavity regions in a
analysis at the text-block level and geometrical analysis character. Based on all these structural characteristics the
at textline and connected component levels. Experimen- identification rates obtained were respectively 97.32%,
tal evaluation of the method was carried out on two 98.65%, 97.53%, 96.05% and 97.12% for Latin, Chinese,
different data sets containing 400 and 335 text-blocks, Arabic, Devnagari and Bengali scripts, with an overall
and the results obtained were quite promising. accuracy of 97.33%.
In an attempt to build automatic letter sorting ma- A more generalized scheme for script-line identifica-
chines for Bangladesh Post Offices, algorithm for Ben- tion in printed multi-script documents that can clas-
gali/English script identification has been developed sify as many as twelve Indian scripts, viz. Devnagari,
recently [27]. The method is designed for application Bengali, Latin, Gujrati, Kannada, Kashmiri, Malayalam,
to both machine-printed and handwritten address-blocks Oriya, Gurumukhi, Tamil, Telugu and Urdu, is available
on envelope images. The two scripts under consideration in [30]. Features chosen in the proposed method are
are recognized on the basis of the aggregate distance of headlines, horizontal projection profile, water reservoir-
the pixels in the topmost and the bottommost profiles based features, left and right profiles and feature based
of the connected components — an English text image on jump discontinuity, which refers to the maximum
has these two distance measures almost equal whereas horizontal distance between two consecutive border pix-
their difference in Bengali text image is quite large. It els in a character pattern. Experimental results show an
was observed in the experiments that the accuracy of this average script-line identification accuracy of 97.52%.
script identification method is quite high for printed text A method for discriminating Arabic text and English
(98% and 100% for English and Bengali, respectively) text using connected component analysis was proposed
and for handwritten text, the proposed approach can by Elgammal and Ismail in [31]. They tested their
achieve a satisfactory accuracy of about 95%. method on several machine-printed documents contain-
ing a mix of these two languages and achieved recog-
3.1.3 Textline-wise script identification nition rate as high as 99.7%. Features used for distin-
The earliest work we have found on textline-wise script guishing Arabic from Latin are the number of peaks
identification in Indian documents was reported by Pal and the moments in the horizontal projection profile, and
and Chaudhuri in [28]. The method uses projection the distribution of run-lengths over the location-length
profile, statistical and topological features, and stroke space. The horizontal projection profile of an Arabic
features for decision tree-based classification of printed textline generally has a single peak while that of an
Latin, Urdu, Devnagari and Bengali script-lines. Later, English textline has two major peaks. Thus, Arabic script
they proposed an automatic system for identification of can be distinguished from Latin by detecting the number
Latin, Chinese, Arabic, Devnagari and Bengali textlines of peaks in the horizontal projection profile. The other
in printed documents [29]. As depicted in Fig. 11, the features they used for discriminating Arabic and Latin
headline (‘shirorekha’) information is used first to sep- scripts are the third and fourth central moments of the
arate Devnagari and Bengali script-lines from Latin, horizontal projection profiles. Third moment measures
Chinese and Arabic script-lines. Next, Bengali script- the skew while fourth moment measures the kurtosis
lines are distinguished from Devnagari by observing that describes how flat the profile is. It is seen that the
the presence of certain script specific principal strokes. horizontal projection profile for English is more symmet-
Similarly, Chinese textlines are identified by checking ric and flat compared to that of Arabic. Therefore, the
the existence of characters with four or more vertical moments in case of English text are generally smaller
runs. Finally, Latin (English) textlines are separated from than those of Arabic text. Script classification using these

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

8 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

3,367,200 characters were carried out and a recognition


rate of over 98.27% was obtained.
An extension of Hochberg’s work in [19] includes sep-
aration of different script regions in a machine-printed
multi-script document [34]. In this work, every textual
symbol (character, word or part of a word) in a document
is matched to a set of template symbols, as in [19], and is
Fig. 12. Elgammal and Ismail’s technique for script identification in classified to the script class of the best matching template
Arabic-English documents. symbol. It was observed that the method offers good sep-
aration in all cases except in the case of visually similar
features is done in a two-layer feedforward network. The scripts, such as Latin/Cyrillic and Latin/Greek. The best
basic steps of processing in this method are illustrated separation was observed in visually distinct script pairs
in Fig. 12. The algorithm was also applied for script like Latin/Arabic, Latin/Japanese and Latin/Korean.
identification at the word level and a recognition rate Methods that employ clustering for generating script
of 96.8% was achieved. specific prototype symbols, much like the procedure by
Script identification using character component n- Hochberg et al, were proposed in [35], [36]. In both
grams has been patented recently by Cumbee [32]. First, these methods, classification algorithms are not based
character segments extracted from training documents of on direct shape matching, as in Hochberg’s method, but
a known script are clustered using K-means clustering use matching of shape description features of connected
and then replaced by their corresponding cluster identi- components and/or characters. The shape description
fication number. Thus, every line of text is converted to a features used in [35] are the pattern spectrum coefficients
sequence of numbers. This sequence of numbers is then of every individual character in a string of isolated
analyzed to determine all the n-grams present in it and handwritten characters. During training, prototype sym-
a weight corresponding to the frequency of occurrence bols for each script class are obtained via possibilistic
is defined for each n-gram. During recognition, n-grams clustering [37]. In the recognition phase, the algorithm
are generated in a similar fashion by comparing charac- calculates the degree to which every character in a string
ter segments in the input textline to the K-means cluster belongs to each of the script classes using possibilistic
centroids of a known script. These are then compared measure defined in [37]. The character string is clas-
to the n-grams present in the training documents of sified to that script class for which the accumulated
that script. The input is subsequently scored against that possibilistic measure is maximum. The basic structure
script class by adding the weights of the best-match n- of the proposed system is shown in Fig. 13. The method
grams. The script of the input textline is determined to was tested on several artificially generated [38] strings
be the script against which it scores the highest. of handwritten numeric characters in four different
scripts, viz. Arabic, Devnagari, Bengali and Kannada,
3.1.4 Script identification at word/character level and a recognition rate as high as 96% was achieved.
Compared to the paragraph and textline level identifi- Ablavasky and Stevens reported a similar work [36], but
cations, script recognition at the word level in a multi- for machine-printed documents. The algorithm processes
script document is generally more difficult. This is be- a stream of connected components and assigns a script
cause the information available from only a few charac- label when enough evidence has been accumulated to
ters in a word may not be sufficient for the purpose. make the decision. The method uses geometric proper-
This has motivated many researchers to take up this ties like Cartesian moments and compactness for shape
challenging problem in script identification. Some have description. The likelihood of every input textual symbol
even attempted to do script identification at the character belonging to each of the script classes is calculated using
level. However, script recognition at the character level K-nearest neighbor (KNN) classification. This approach
is generally not required in practice. This is because the was shown to be quite efficient yielding 97% success
script usually changes only from one word to the next rate in discriminating similar looking Latin and Cyrillic
and not from one character to another within a word. scripts.
In one of the earliest works on script identification In another structural approach to script identification,
at the character level, Lee and Kim tried to solve the
problem using self organizing networks [33]. The net-
work is able to determine the script of every individual
character in a machine-printed multi-script document
and classify them into four groups — Latin, Chinese,
Korean and mixed. Characters in the mixed group that
cannot be classified in the network with full confidence
are classified in the next level of fine classification using
Fig. 13. Ghosh and Shivaprasad’s method of script identification for
learning vector quantization. In order to evaluate the handwritten characters/words using pattern spectrum and possibilistic
performance of the proposed scheme, experiments with measure.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 9

stroke geometry has been utilized for script character-


ization and identification [39]. Another new approach
for identifying the script type of character images in
printed documents was proposed in [40]. Individual
character images in a document are classified either by Fig. 14. Examples of Telugu characters having tick feature.
applying prototype classification or by using support
vector machine. Both the methods were implemented word is also in Devnagari (Telugu) unless a strong clue
successfully in classifying characters into Latin, Chinese suggests otherwise. The proposed method was tested
and Japanese. extensively on several Hindi-Telugu documents with
Extraction of Arabic words from among a mix of recognition accuracies that vary in the range from 92.3%
printed Arabic-English words has gained attention in to 99.86%.
recent times [41], [42]. The method proposed in [41] The script-line identification techniques in [29], [30]
is based on recognition of Arabic characters or char- were modified in [46], [47] for script-word separation
acter segments in a word. First, a database containing in printed Indian multi-script documents by including
templates of Arabic character segments is generated some new features, in addition to the features considered
through training. A word is supposed to be Arabic if the earlier. The features used are headline feature, distribu-
percentage of matching character segments in the word tion of vertical strokes, water reservoir-based features,
exceeds a user-defined value. Otherwise, the word is shift below headline, left and right profiles, deviation
considered to be written in English (Latin). Experimental feature, loop, tick feature and left inclination feature. Tick
results showed 100% recognition accuracy on 30 text- feature refers to the distinct “tick” like structure, called
blocks containing a total of 478 words. The method telakattu, present at the top of many Telugu characters.
in [42] is also based on recognition of Arabic characters This helps in separating Telugu script from other scripts.
in the document but via feature matching. Features used Fig. 14 shows few Telugu characters having this feature.
are morphological and statistical features such as over- The overall accuracy in script word separation using
lapping and inclusion of bounding boxes, horizontal bar, this proposed set of features was about 97.92% when
low diacritics, height and width variation of connected applied to five script pairs, viz. Devnagari/Bengali,
components, etc. Recognition accuracy achieved with Bengali/Latin, Malayalam/Latin, Gujrati/Latin and Tel-
this method was 98%. ugu/Latin. Finally, based on this script-word separation
Word-wise script identification using character shape algorithm, systems for recognizing English, Devnagari
codes was proposed by Tan et al in [43] and Lu et and Urdu [48], and English and Tamil [49] have been
al in [44]. In [43], shape codes generated using basic developed in recent years. In this context, a script-
document features like elongation of bounding boxes of word discrimination system proposed by Padma and
character cells and the position of upward concavities Nagabhushan [50] also deserves mentioning. The system
are used to identify Latin, Han and Tamil in printed uses several discriminating structural features for iden-
document images. The method in [44] captures word tification and separation of Latin, Hindi and Kannada
shapes on the basis of local extremum points and hor- words in Indian multi-script documents in a manner
izontal intersections. For each script under considera- similar to the above.
tion, a word shape template is first constructed based The basic system for block-wise script identification
on a word shape coding scheme. Identification is then in [24] was modified further so as to accomplish script
accomplished using Hamming distance between the recognition at the word level. The modified system
word shape code of a query image and the previously architecture consists of a preprocessor that separates out
constructed templates. Experimental tests demonstrated individual words in a machine-printed document, fol-
99% recognition accuracy in discriminating eight Latin- lowed by a modified feature extractor and a probabilistic
based scripts/languages. neural network classifier. The probabilistic network is a
As noted earlier, multi-script document processing is two-layered structure composed of a radial basis layer
important in a multi-script country such as India. Conse- followed by a competitive layer. Experiments yielding
quently, script recognition at the word level involving In- 98.89% classification accuracy demonstrates the effective-
dian scripts is an important topic of research for the OCR ness of such a script classification system.
community. Indian scripts are in general of two types A neural network structure employing script recog-
— one that has headlines (‘shirorekha’) on top of the nition at the character level in printed documents was
characters (e.g. Devnagari, Bengali, Gurumukhi) and the presented in [51]. Script separation at the word level
other that does not carry headlines (e.g. Gujrati, Tamil, can also be achieved by combining the outputs of the
Telugu, Malayalam, Kannada). Based on this, a bilingual character level classification using Viterbi algorithm. The
OCR for printed documents was developed in [45] that algorithm was tested on five scripts commonly used
identifies Devnagari and Telugu scripts by observing the in India, namely Latin, Devnagari, Bengali, Telugu and
presence and absence of shirorekha. The classification Malayalam, and an average recognition accuracy of 97%
result is further supported with context information; if was achieved.
the previous word is Devnagari (or Telugu), the next MLP neural networks have also been employed for

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

10 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

script identification in Indian postal automation systems


developed by Roy et al in [52], [53], [54], [55], [56].
In India, people generally tend to write addresses ei-
ther in English only or English mixed with the local
language/script. This calls for script identification at
word and character levels. In their earliest work [52],
they developed a method for locating address-block and Fig. 15. Tan’s script identification system using Gabor function-based
extracting postal code from the address. In [53], [54], a rotation invariant features.
two-stage neural network-based general classifier is used
for the recognition of postal code digits written in Arabic a distinct texture pattern. Thus, the problem of script
or Bengali numerals. Since there exist shape similarities identification essentially boils down to texture analysis
between some Arabic and Bengali numerals, the final problem and one may employ any available texture clas-
assignment of script class is done in a second stage sification algorithm to perform the task. In accordance to
using majority voting. It was noted that the accuracy this, Tan developed Gabor function-based texture analy-
of the classifier was 98.42% in printed and about 89% in sis for machine-printed script identification that yielded
handwritten post-codes. Methods for word-wise script an accuracy as high as 96.7% in discriminating printed
recognition in postal addresses using features like water Chinese, Latin, Greek, Russian, Persian and Malayalam
reservoir concept, headline (‘shirorekha’), etc. in a tree script documents [58]. In the first step of this method,
classifier was proposed in [55]. Based on this, a 2-stage a uniform text-block on which texture analysis can be
MLP network was constructed in [56] that accomplishes performed is produced from the input document image
word-wise script recognition in Indian postal addresses via method given in [59]. Texture features are then
at more than 96% accuracy. extracted from the text-block using a 16-channel Gabor
filter with channels at a fixed radial frequency of 16
3.2 Appearance-based script recognition cycles/sec and at sixteen equally spaced orientations.
Script types generally differ from each other by the shape The average response of every channel provides a char-
of individual characters, and the way they are grouped acteristic measure for the script that is robust to noise
into words, words into sentences, etc. This gives different but rotation dependent. In order to achieve invariance
scripts distinctively different visual appearances. There- to rotation, Fourier coefficients for this set of sixteen
fore, one natural way of identifying the script in which channel outputs are calculated. During classification, a
a document is written may be on the basis of its visual feature vector generated from the input text-block is
appearance as seen at a glance by a casual observer compared to the class-representative feature vectors us-
without really analyzing the character patterns in the ing weighted (variance normalized) Euclidean distance
document. Accordingly, several features that describe the measure, as depicted in Fig. 15. A representative feature
visual appearance of a script region have been proposed vector for a script class is obtained by computing the
and used for script identification by many researchers, mean feature vector obtained from a large set of training
as described below. documents written in that script.
One drawback with the above method is that the
3.2.1 Page-wise script identification methods text-blocks extracted from the input documents do not
One early attempt to characterize script of a document necessarily have uniform character spacing. In view of
without actually analyzing the structure of its constituent this, Peake and Tan extended this work in [60] where
connected components was made by Wood et al in [57]. they used some simple preprocessings to obtain uniform
They proposed to use vertical and horizontal projection text-blocks from the input printed document. These in-
profiles of document images for determining scripts in clude textline location, outsized textline removal, spac-
machine generated documents. They argued that the ing normalization and padding. Documents are also
projection profiles of document images are sufficient skew compensated so that it is not necessary to generate
to characterize different scripts. For example, Roman rotation invariant features. For the purpose of feature
script shows dominant peaks at the top and bottom extraction, gray level co-occurrence matrices (GLCM)
of the horizontal projection profile, while Cyrillic script and multi-channel Gabor filter are used in independent
has a dominant midline and Arabic script has a strong experiments. GLCMs represent pairwise joint statistics
baseline. On the other hand, Korean characters usually of the pixels in an image and have been long used
have a peak on the left of the vertical projection profile. as a means for characterizing texture [61]. In Gabor
However, the authors did not suggest how these pro- filter-based feature extraction, a 16-channel filter with
jection profiles can be analyzed automatically for script four frequencies at four orientations is used. These two
determination without any user intervention. Also, they approaches for texture feature extraction were applied
did not present any recognition result to substantiate to machine-printed documents written in seven different
their argument. scripts (adding Korean to the six scripts used earlier
Since visual appearance is often related to texture, a in [58]). Script identification was then performed using
block of text corresponding to each script class forms KNN classification. It was seen that GLCM approach

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 11

yields only 77.14% accuracy at best while Gabor filter Thus, it is possible to separate out every script from other
approach yields accuracy rate as high as 95.71%. scripts by analyzing the distribution of character pixels
One problem encountered in Gabor filter related ap- in different zones inside a document.
plications is the high computational cost due to the
frequent image filtering. In order to reduce the cost 3.2.2 Script identification at paragraph and text-block
of computation, script identification in machine-printed level
documents using steerable Gabor filters was proposed The use of texture features in script identification was
in [62]. The method offers two-fold advantages. Firstly, considered by Jain and Zhong for discriminating printed
the steerability property of Gabor filter is exploited to Chinese and English documents [65]. This paper in
reduce the high computational cost. Secondly, the Gabor fact proposed a texture-based language-free page seg-
filter bank is appropriately designed so that the extracted mentation algorithm which automatically extracts text,
rotation-invariant features can discriminate scripts con- halftone and line-drawing regions from input gray-scale
taining characters that are similar in shape and even document images. An extension of this page segmen-
share many characters. In this paper, a 98.5% recognition tation procedure provides for further segmentation of
rate was achieved in discriminating Chinese, Japanese, the text regions into different script regions. First, a
Korean and Latin scripts while the number of image set of optimal texture discrimination masks are created
filtering operations was significantly reduced by 40%. through neural network training. Next, texture features
Although the above Gabor function-based script are obtained by convolving the trained masks with the
recognition schemes have shown good performance, input image. These features are then used for classifica-
their application is limited to machine-printed docu- tion.
ments only. Variations in writing style, character size, The use of other texture features for script classifi-
and inter-line and inter-word spacings make the recog- cation, other than GLCM and Gabor energy features,
nition process difficult and unreliable when these tech- has been explored by Busch et al in [66]. The features
niques are applied directly on handwritten documents. that they used are wavelet energy features, wavelet log
Therefore, it is necessary to preprocess the document mean deviation features, wavelet co-occurrence signa-
images prior to the application of Gabor filter so as to tures, wavelet log co-occurrence features, and wavelet
compensate for the different variations present. This has scale co-occurrence signatures. They tested these fea-
been addressed in the texture-based script identification tures on a database containing eight different script
scheme proposed in [63]. In the preprocessing stage, types — Latin, Han, Japanese, Greek, Cyrillic, Hebrew,
the algorithm employs denoising, thinning, pruning, m- Devnagari and Farsi. In their experiments, machine-
connectivity, and text size normalization in sequence. printed document images of size 64×64 pixels were first
Texture features are then extracted using a multi-channel binarized, skew corrected and text-block normalized, in
Gabor filter. Finally, different scripts are classified using line with the work done by Peake and Tan in [60]. In
fuzzy classification. In this proposed system, an overall order to reduce the dimensionality of the feature vectors
accuracy of 91.6% was achieved in classifying handwrit- while improving classification accuracy, Fisher linear
ten documents written in four different scripts, namely discriminant analysis technique is applied. Classification
Latin, Devnagari, Bengali and Telugu. is performed using a GMM (Gaussian mixture model)
Another visual attribute that has been used in many classifier which models each script class as a combination
image processing applications is histogram statistics, of Gaussian distributions. The GMM classifier is trained
which reflects spatial distribution of gray levels in an using a version of the expectation maximization (EM)
image. In a recent work [64], Cheng et al proposed to use algorithm. In order to create a more stable and global
normalized histogram statistics for the purpose of script script model, a maximum a posteriori (MAP) adaptation-
identification in documents typeset in Latin, Chinese, based method was also proposed. It was seen that the
Cyrillic or Japanese. In this work, every line of text in an wavelet log co-occurrence outperforms all other texture
input document is divided into three zones — ascender features for script classification (only 1% classification
zone between top-line and x-line, x-zone between x-line error) while GLCM features yielded the worst overall
and baseline, and descender zone between baseline and performance (9.1% classification error). This indicates
bottom-line. Next, a horizontal projection is obtained that pixel relationships at small distances are insufficient
for each textline that gives zone-wise distribution of to characterize the script of a document image appropri-
character pixels in a textline. It is observed that Latin ately.
and Cyrillic characters mainly distribute in the x-zone However, a single model per script class is useful
with two significant peaks located on the x-line and the only when every script is written using only one font or
baseline. The baseline peak is higher than the x-line peak using only visually similar fonts. On the contrary, there
in Latin while they are almost equal in Cyrillic. Chinese typically exists a large number of fonts, often of widely
characters, on the other hand, have relatively random varying appearance, within a given script. Because of
distribution without any peak in the profile. Japanese such variations, it is unlikely that a model trained on
characters also have the same random distribution but one set of fonts will correctly identify an image of a
the average height of the profile is significantly lower. previously unseen font of the same script. For example,

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

12 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

classification error increases from 1% and 9.1% in [66] to commonly used in India and an overall classification
15.9% and 13.2% in cases of wavelet log co-occurrence accuracy of 97.11% was achieved. The scripts used in-
and GLCM features, respectively. In view of this, Busch cluded Devnagari, Bengali, Tamil, Kannada, Malayalam,
proposed to characterize multiple fonts within a single Gurumukhi, Oriya, Gujrati, Urdu and Latin. Fig. 16
script more adequately by using multiple models per illustrates how these ten different Indian scripts are
script class [67]. This is done by partitioning each script classified using these features in two levels of hierarchy.
class into ten subclasses, each subclass corresponding to
one font included within that script class. This is fol- 3.2.3 Script identification at word/character level
lowed by linear discriminant analysis and classification While all the texture-based script identification methods
using the modified MAP-GMM classifier as above. Such described above work on a document page or a text-
a classification system provides significant improvement block, script identification at the word level had been
when compared to the results obtained using a single successfully implemented in [70], [71], [72], [73], [74],
model — classification error reduces to 2.1% and 12.5% [75], [76]. In the works by Ma et al [70], [71], Gabor
for the above two cases, respectively. filter analysis is applied to each word in a bilingual
Script identification in Indian printed documents us- document to extract features characterizing the script
ing oriented local energy features was performed in [68]. in which that particular word is written. Subsequently,
Local energy is defined as the sum of squared responses a 2-class classifier system is used to discriminate the
of a pair of conjugate symmetric Gabor filters. In an two different scripts contained in the input document.
earlier work, Chan et al [69] derived a set of descriptors Different classifier architectures based on SVM, KNN,
from oriented local energy and demonstrated their utility weighted Euclidean distance and GMM are considered.
in script classification. In line with human perception, A classifier system consisting of a single classifier may
the features chosen are energy distribution, the ratio of comprise of any of the above four architectures, while a
energies for two non-adjacent channels, and the horizon- multiple classifier system is built by combining two or
tal projection profile. The distribution of energy across more of them. In a multiple classifier system, the classifi-
differently oriented channels of a Gabor filter differs cation scores from each of the different component clas-
from one script to other. While this feature captures sifiers are combined using sum-rule to arrive at the final
the global differences among scripts, a closer analysis decision. In their papers, Ma et al considered bilingual
of the energy distribution may be necessary to reveal documents containing combinations of one Latin-based
finer differences between similar-looking scripts. This is language (mainly English) and one non-Latin language
provided by the ratios between energies at the output (e.g., Arabic, Chinese, Hindi or Korean). It was observed
of non-adjacent channel pairs. Finally, there are certain that while the performance for English-Hindi documents
scripts which are distinguishable only by the stroke was quite good (97.51% recognition rate using KNN
structures used in the upper part of the words. For classifier), script identification in English-Arabic docu-
example, Devnagari and Gurumukhi differ in the shape ments had the lowest performance (90.93% using SVM
of the matra present above the headline (‘shirorekha’). classifier). Moreover, it was established that multiple
Horizontal projection is used to discover this informa- classifier system can consistently outperform the single
tion. One major advantage with these features is that classifier systems (98.08% and 92.66% in case of English-
it is not necessary to perform analysis at multiple fre- Hindi and English-Arabic documents, respectively, using
quencies but at only one optimal frequency. This helps in a combination of KNN and SVM classifiers).
reducing the computational cost. Again, filter response A visual appearance-based approach has also been
can be enhanced by increasing filter bandwidth at this applied to identify and separate script-words in In-
optimal frequency. Accordingly, the filters employed dian multi-script documents. In [72], [73], two different
in [68] are log-Gabor filters designed for one empirically approaches to script identification at the word level
determined optimal frequency and at eight equi-spaced in printed bilingual (Latin and Tamil) documents are
orientations. For an input text-block of size 100 × 100 presented. The first method structures words into three
pixels, the aforementioned features are calculated and distinct spatial zones and utilizes the information about
then classified to different script classes using a KNN the spatial spread of the words in these zones. The
classifier. The scheme was tested on ten different scripts second technique analyzes the directional energy dis-
tribution of words using Gabor filters with suitable
frequencies and orientations. The algorithms are based
on the observations that: (1) the spatial spread of Roman
characters mostly covers the middle and upper zones;
only a few lower case characters spread to the lower
zone, (2) the Roman alphabet contains more vertical
and slanted strokes, (3) in Tamil, the characters mostly
spread to the upper and the lower zones, (4) there is a
Fig. 16. Classification hierarchy in Joshi et al’s script identification dominance of horizontal and vertical strokes in Tamil,
scheme. and (5) the aspect ratio of Tamil characters is generally

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 13

or absence of loop structure in the character. This is


followed by feature extraction, feature transformation
and finally nearest neighbor classification. Features that
may be extracted from a character are geometric mo-
ments, DCT coefficients or DWT coefficients. Feature
space transformation is required for dimension reduction
while enhancing class discrimination. Three methods are
Fig. 17. Dhanya et al’s two approaches to script identification in Tamil- proposed for the purpose — PCA, FLD or maximization
English documents.
of divergence. The whole process is explained pictorially
more than that of Roman characters. These suggest that in Fig. 18. The proposed scheme yielded recognition ac-
the features that may play a major role in discriminating curacies of 94% and above when tested on 20 document
Roman and Tamil script-words are the spatial spread samples, each containing a minimum of 300 characters.
In [75], a Gabor function-based multi-channel direc-
of the words and the direction of orientation of the
tional filtering approach is used for both text area sep-
structural elements of the characters in the words. The
aration and script identification at the word level. It
spatial feature is obtained by calculating zonal pixel
may be assumed that the text regions in a document are
concentration, while the directional features are available
predominantly high frequency regions. Hence, a filter-
as responses of Gabor filters. The extracted features
bank approach may be useful in discriminating text
are classified using SVM, Nearest Neighbor or KNN
regions from non-text regions. The script classification
classifiers. A block schematic diagram of the system is
system using a Gabor filter with four radial frequencies
presented in Fig. 17. It was observed that the directional
and four orientations showed a high degree of classifica-
features possess better discriminating capabilities than
tion accuracy (minimum 96.02% and maximum 99.56%)
the spatial features, yielding as high as 96% accuracy in
when applied to bilingual documents containing Hindi,
an SVM classifier. This may be attributed to the fact that
Tamil or Oriya along with English words. In an extended
Gabor filters can take into account the general nature of
version of this work [76], the method was applied to
scripts better.
documents containing three scripts and five scripts. In
Dhanya et al also attempted to recognize and sep-
this filter-bank approach to script recognition, the Gabor
arate out different script characters in printed Tamil-
filter bank uses three different radial frequencies and
Roman documents using zonal occupancy information
six different angles of orientations. For decision making,
along with some structural features [74]. For this, they
two different classifiers are considered — linear discrim-
proposed a hierarchical scheme for extracting features
inant classifier and the commonly used nearest neighbor
from characters and classify them accordingly. Based on
classifier. It was observed in several experiments that
the zonal occupancy of characters, the scheme divides
both the classifiers perform well with Gabor feature
the combined alphabet set into four groups — characters
vectors, although in some cases nearest neighbor classi-
that occupy all three zones (Group 1), characters that oc-
fier performs marginally better — the average accuracy
cupy middle and lower zones (Group 2), characters that
obtained in case of tri-script documents was 98.4% and
occupy middle and upper zones (Group 3), characters
98.7% with linear discriminant and nearest neighbor
that occupy middle zone only (Group 4). Group 3 and
classifiers, respectively. The highest recognition accuracy
Group 4 are further divided on the basis of presence
obtained was 99.7% using nearest neighbor classifier in
a bi-class problem, while the lowest attained recognition
rate was 97.3%.

3.3 Comparative analysis


Table 1 summarizes some of the benchmark work in
script recognition. Various script features used by dif-
ferent researchers are listed in this table. However, the
results they reported, although quite encouraging on
most occasions, were obtained using only a selected
number of script classes in their experiments. This leaves
a question that how these script features will perform
when applied to scripts other than those considered in
their works. Therefore, it is important to investigate the
discriminative power of each script identification feature
proposed in the literature before one may use it for the
purpose. In view of this, a comparative analysis between
different methods and script features is desirable.
Fig. 18. Stages of character classification in a printed Tamil-Latin One important structural feature for script recognition
document. used by Spitz and some others is the character optical

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

14 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

TABLE 1
Script Recognition Methods
Method Best recog.
Researchers Features Classifier Script types classified Scope of application reported

Structure-based script recognition methods


Upward concavity distribution Var. comparison Latin, Han
Spitz [15] Optical density LDA + Eucl. Dist. Chinese, Japanese, Korean Printed Page-wise 100%
Lam, Ding, Hor. proj., height distribution Stat. classifier Latin, Oriental scripts
Suen [18] Circles, ellipses, ver. strokes Freq. of occurr. Chinese, Japanese, Korean Printed Page-wise 95%
Hochberg Hamming Dist. Arabic, Armenian, Devnag.,
Textual symbols Printed Page-wise 96%
et al [19] classifier Chinese, Cyrillic, Burmese,
Hochberg Hamming Dist. Ethiopic, Japanese, Hebrew,
et al [34] Textual symbols classifier Greek, Korean, Latin, Thai Printed Word-wise NA*
Hochberg Hor./ver. centroids, sphericity, Arabic, Chinese, Cyrillic, Hand-
et al [20] aspect ratio, white holes FLD Devnagari, Japanese, Latin written Page-wise 88%
Pal et Headline, strokes, ver. runs, Devnag., Bengali, Chinese,
Freq. of occurr. Printed Line-wise 97.33%
al [29] lowermost pt., water resv. Arabic, Latin
Elgammal Hor. proj. peak, moments,
et al [31] run-length distribution Feedforward NN Arabic, Latin Printed Line-wise 96.8%
Moalla
Arabic character segments Template match Arabic, Latin Printed Word-wise 100%
et al [41]
Jawahar 92.3% to
et al [45] Headline, context info. PCA + SVM Devnagari, Telugu Printed Word-wise 99.86%
Headline, ver. strokes, tick
Chanda left/right profiles, water resv., Freq. of occurr. Devnagari, Bengali, Latin, Printed Word-wise 97.92%
et al [47] deviation, loop, left incline. Malayalam, Gujrati, Telugu

Visual appearance-based script recognition methods


Wood et Arabic, Cyrillic, Korean,
Horizontal / vertical proj. — Printed Page-wise NA*
al [57] Latin
Jain et Texture feature using
al [65] discriminating masks MLP Latin, Chinese Printed Para-wise NA*
Gabor filter-based texture Weighted Chinese, Greek, Malayalam,
Tan [58] feature Euclid. Dist. Latin, Russian, Persian Printed Page-wise 96.7%
Peake et GLCM features Chinese, Greek, Malayalam, 77.14%
KNN Classifier Printed Page-wise
al [60] Gabor filter Latin, Russ., Persian, Korean 95.71%
Singhal Gabor filter-based texture Devnagari, Bengali, Telugu, Hand-
et al [63] feature Fuzzy Classifier Latin written Page-wise 91.6%
GLCM features 90.9%
Gabor energy 95.1%
Wavelet energy Latin, Chinese, Japanese, 95.4%
Busch et
Wavelet Log Mean Dev. LDA + GMM Cyrillic, Greek, Devnagari, Printed Para-wise 94.8%
al [66] Wavelet Co-occurrence Hebrew, Persian 98%
Wavelet Log Co-occurrence 99%
Wavelet Scale Co-occurrence 96.8%
Gabor Energy distribution, Devnag., Latin, Gurumukhi,
Joshi et horizontal projection profile, KNN Classifier Kannada, Malayalam, Urdu, Printed Para-wise 97.11%
al [68]
energy ratios Tamil, Gujrati, Oriya, Beng.
Ma et al Gabor filter-based texture KNN + SVM Latin, Devnagari 98.08%
[70], [71] feature Multi-classifier Latin, Arabic Printed Word-wise 92.66%
Dhanya Gabor filter-based directional
et al [73] feature SVM Tamil, Latin Printed Word-wise 96%

*NA: Not available – recognition result not given in terms of numeric value.

density. This is the measure of character pixels inside a result, upward concavities in a character are observed
a character bounding box which is distinctly very high at points where two or more character strokes join. Ac-
in scripts using complex ideographic characters. Struc- cordingly, ideograms composed of multiple strokes show
turally simple Arabic characters, on the other hand, are many more upward concavities per character compared
low in density. All other scripts across Europe and Asia to that in other scripts. As observed by Spitz [77], there
show more or less the same medium character density. are usually at most two or three upward concavities in a
Therefore, while this feature may be good in separating single Latin character while Han characters have many
out Han on one hand and Arabic on the other, it does not more upward concavities per character that are evenly
help much in bringing out the difference between moder- distributed along the vertical axis. However, we observe
ately complex scripts like Latin, Cyrillic, Brahmic scripts, that most other scripts also show two or three upward
etc. The second discriminating feature that Spitz used is concavities, the same as in the Latin script. So, upward
the location of upward concavities in characters. An up- concavity is good for separating Han from others but
ward concavity is formed when a run of character pixels not good for discrimination among non-Han scripts,
spans the gap between two white runs just above it. As except perhaps for Cyrillic which contains a few more

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 15

upward concavities compared to other non-Han scripts. demonstrated in [19], [34]. However, it is not difficult
Another problem with these two features is that they to realize that the classification error due to ambiguity
highly depend on document quality. Broken character will increase if the system includes script classes that use
segments may result in detection of false upward con- similar looking characters or even share many common
cavity while noise contributes to optical density measure. characters. Therefore, Hochberg’s method may not be
Non-Han documents tend to be misclassified as Han- suitable in a multi-script country like India where most
based Oriental ones if the document quality is poor, scripts have the same line of origin. Nevertheless, it
either because many characters are broken or noisy. In offers invariance to font-size and computational simplic-
order to cope with such situations, features like character ity. This is because textual symbols are size-normalized
height distribution, character bounding box profiles, hor- and the algorithm uses simple binary shape matching
izontal projections and several other statistical features without any feature value calculation.
were proposed in [16], [17], [18]. These features do not Another important feature proposed by Wood et al
depend on the document quality and resolution but on and used by many researchers is the horizontal projec-
the overall size of the connected components. However, tion. This gives a measure of the spatial spread of the
these features are not invariant to character size and font characters in a script that provides an important clue
and offer high performance only in separating distinctly to script identification. Some scripts can be identified
different Oriental scripts from other non-Han scripts. by detecting the peaks in the projection profile, e.g.
Several different structural features like character ge- Arabic scripts having a strong baseline shows peak at
ometry, occurrence of certain stroke structures and struc- the bottom of the profile while Brahmic scripts with
tural primitives, stroke orientations, measure of cavity ‘shirorekha’ show peak at the top, and so on. However,
regions, side profiles, etc. that directly relate to the this feature also is not good for separating scripts of
character shape have also been used for script character- similar nature and structure. For example, Devnagari,
ization. However, while some features show marked dif- Bengali and Gurumukhi will show the same peak in
ference between two scripts, measures of other features the profile due to ‘shirorekha’; Arabic, Urdu and Farsi
may be the same between that script pair. For example, have the same lower peak. Hence, this feature has not
while Devnagari and Gujrati can be easily identified been used alone but mostly in combination with other
using ‘shirorekha’ and water reservoir-based features, structural features.
character aspect ratio and character moments do not A better approach to script identification is via texture
show much difference. This is because many Gujrati feature extraction using multi-channel Gabor filter that
letters are exactly same as their Devnagari counterpart provides a model for human vision system. That means,
with the headline (‘shirorekha’) removed. Again, there Gabor filter offers a powerful tool to extract out visual
are features that are optimal in one script pair but not attributes from a document. This has motivated many
in another pair. For example, presence of ‘shirorekha’ researchers to employ Gabor filter for script determina-
may be a good feature for discriminating Latin and tion. Since texture feature gives the general appearance
Devnagari, but not at all useful in separating Devnagari of a script, it can be derived from any script class of
and Bengali. Therefore, in order to separate out a script any nature. Accordingly, this feature may be considered
from all other scripts, one may need to check a large a universal one. The discriminating power of a multi-
pool of structural features before any decision can be channel Gabor filter can be varied by having more
taken. This may result in the curse of dimensionality. channels with different radial frequencies and closely
So, a better option may be to do the classification using spaced orientation angles. Thus, this system is flexible
different sets of features at different levels of hierarchy, compared to all other methods and can be effectively
as proposed in some of the works above. Another option used in discriminating scripts that are quite close in
is to learn the script characteristics in a neural network, appearance. The main criticism with this approach is that
as in [25], without bothering about the features to be it cannot be applied with confidence to small text regions
used for classification. However, a larger network with as in word-wise script recognition. Also, Gabor filters are
more number of hidden units may be necessary for not capable of handling variations in script size and font,
reliable recognition as more and more script classes are inter-line spacings, etc.
included. Table 1 also lists recognition rates, as reported in
Compared to the above, Hochberg et al’s method is the literature. Since the experiments were conducted
more versatile. The method is based on discovering independently using different datasets, however, they do
frequent characters/symbols in every script class and not reflect the comparative performance of these meth-
storing them in the system database for matching during ods. To have a proper measure of their relative script
classification. Therefore, in principle, the method can separation power, these methods need to be applied on
identify any number of scripts of varied nature and font a common dataset. Script recognition performance of
as long as they are included in the training set. It is some of the above mentioned features, when applied to
possible to apply the method in a common framework a common dataset, is given in Table 2. The dataset con-
to scripts containing discrete and connected characters, tains printed documents typeset in ten different scripts,
alphabetic and non-alphabetic scripts and so on, as including six scripts used in India. In the absence of

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

16 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

TABLE 2
Script Recognition Results (in Percentage)
Script Features Used Latin Cyrillic Arabic Urdu Chinese Korean Devnagari Bengali Gujrati Tamil
Optical Density [13] 75.4 84.6 89.1 87.2 96.3 93.7 76.2 73.4 74.0 83.8
Textual Symbol [19] 97.2 92.3 93.7 90.1 97.2 94.3 95.3 97.8 87.1 98.9
Hor. Projection Profile [23] 89.7 91.2 94.3 92.9 87.5 90.2 92.1 90.0 94.6 76.8
Gabor Coefficients [60] 95.2 92.7 97.2 94.3 93.3 89.9 95.8 91.3 87.8 96.2

any standard database, we created our own database by care of by using certain statistical features, as proposed
collecting document samples from books and magazines. in [20]. Textual symbol-based method can also be used
Some documents were also available from world-wide- but with certain modifications — some shape descriptor
web which we printed using a laser printer. All the features can be derived from the text-symbols and the
documents were scanned in black and white mode at prototypes can be generated through clustering. We
300dpi and then rescaled to have a standard textline demonstrated this approach in an earlier paper [35].
height in all documents while maintaining the character Also, a script class may be represented by multiple
aspect ratio. Script recognition was performed at the text- models to account for variation in writing from one
block level. Homogeneous text-blocks of size 256 × 256 person to another.
pixels were extracted from document pages in such Based on our discussion above, we see that script
a way that page margins and non-textual parts were features are extracted either from a list of connected
excluded. A total of 120 text-blocks were generated per components like textline, word and character in a doc-
script, each block containing 10 to 12 textlines. The print ument or from a patch of text which may be a com-
quality of the documents and hence the quality of the plete paragraph, a text-block cropped from the input
document images was reasonably good containing very document or even the whole document page. Script
little noise. identification methods that use segment-wise analysis
We observe that optical density feature is capable of character structure may hence be regarded as local
of identifying Chinese and Korean, and also Arabic approach. On the other hand, visual appearance-based
and Urdu to some extent. For other script classes, the methods that are designed to identify script by analyzing
recognition rate was well below the acceptable level. This the overall look of a text-block may be regarded as a
is because the optical density feature is not good enough global approach.
to discriminate among scripts of similar complexity. The As discussed before, many different structural features
same argument holds for other script features. The Ga- and methods for script characterization have been pro-
bor filter method shows relatively better discriminating posed over the years. In each of these methods, the
power in comparison. We noticed that the classification features were chosen keeping in view only those script
error was mainly due to misclassification between script types that were considered therein. Therefore, while
pairs like Arabic and Urdu; Chinese and Korean; Dev- these features have been proved to be efficient for script
nagari and Bengali; Devnagari and Gujrati. These pairs identification within a given set of scripts, they may not
of script classes have characters of the same nature and be good in separating a wider variety of script classes.
complexity, and even share some common characters. Again, structural features cannot effectively discriminate
This leads to ambiguity and hence the classification between scripts having similar character shapes, which
error. So, on a whole, we may say that every proposed otherwise may be distinguished by their visual appear-
script identification method and script feature works ances. Another disadvantage with structure-based meth-
well only when applied within a small set of script ods is that they require complex preprocessing involv-
classes. Classification accuracy falls significantly when ing connected component extraction. Also, extraction of
more scripts of similar nature and origin are included. structural features is highly susceptible to noise and poor
As observed in Table 1, almost all work on script quality document images. Presence of noise or signifi-
recognition is targeted toward machine-printed docu- cant image degradation adversely affect the location and
ments. They have not been tested for script recognition segmentation of these features, making them difficult or
in handwritten documents. In view of the large amount sometimes impossible to extract.
of handwritten documents that need to be processed In short, the choice of features in local approach to
electronically nowadays, script identification in hand- script classification depends on the script classes to be
written documents turns out to be an important research identified. Further, the success of classification in this ap-
issue. Unfortunately, the script features proposed for proach depends on the performance of the preprocessing
printed documents may not be always effective in case stage that includes denoising and extraction of connected
of handwritten documents. Variations in writing style, components. Ironically, document segmentation and ex-
character size, inter-line and inter-word spacings make traction of connected components sometimes require
the recognition process difficult and unreliable when the script type to be known a priori. For example, an
these techniques are applied to handwritten documents. algorithm that is good for segmenting ideograms in Han
Variation in writing across a document can be taken may not be equally effective in segmenting alphabetic

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 17

characters in the Latin script. This presents a paradox TABLE 3


in that for determining the script type it is necessary Local vs Global Approaches for Script Identification
to know the script type beforehand. In contrast, text- Local approaches Global approaches
block extraction in visual appearance-based global ap- Line, word,
character segmentation Text-block extraction
proaches is simpler and can be employed irrespective Preprocess. Complex and script Simple and script
of the document’s script. Since here it is not necessary dependent independent
to extract individual script components, such methods Page-wise, Para-wise,
Scope of Line-wise, Word-wise Page-wise, Para-wise
are better suited to degraded and noisy documents. application Limited script types Wider variety of scripts
Also, global features are of more general in nature and Sensitive to noise Less prone to noise
can be applied to a broader range of script classes. Robustness Moderately robust to Moderately robust to
They have practical importance in script-based retrieval skew, font size / type skew, font size and type
systems because they are relatively fast and reduce the
cost of document handling. Thus, visual appearance- of characters incorporated from more than one script,
based methods prove to be better than structure-based and an approach using HMM network is proposed for
script identification methods in many ways, as listed in recognizing sequences of words in multiple languages.
Table 3. However, local approach is useful in applications Viewing handwritten script as an alternating sequence of
involving textline-wise, word-wise and even character- words and inter-word ligatures, a hierarchical HMM is
wise script identification, which otherwise are generally constructed by interconnecting HMMs for ligatures and
not possible through global approach. Since local meth- words in multiple languages. These component HMMs
ods extract features from elemental structures present in are in turn modeled by a network of interconnected
a document, in principle, they can be applied at all levels character and ligature models. Thus, basic characters of
within the document. Nonetheless, some structure-based a language, language network, and intermixed use of
methods demand a minimum size of the text to arrive language are modeled with hierarchical relations. Given
at some conclusive decision. For example, Spitz’s two- such a construction, recognition corresponds to finding
stage script classifier [15] requires at least two lines of the optimal path in the network using the Viterbi algo-
text in the first level of classification and at least six rithm. This approach can be used for recognizing freely
lines in the second stage. Likewise, at least fifty textual handwritten text in more than one language and can be
symbols need to be verified for acceptable classification applied to any combination of phonetic writing systems.
in [19]. The same applies to methods in which the script Results of word recognition tests showed that Hangul
class decision is based on statistics taken across the input words can be recognized with about 92% accuracy while
document. We also note that methods developed for English words can be recognized correctly only 84%
page-wise script identification can also be used for script of the time. It was also observed that by combining
recognition in a paragraph or a text-block as long as multiple languages, recognition accuracy drops negligi-
the document size is big enough to provide necessary bly but speed is slowed substantially. Therefore, more
information. powerful search method and machine are needed to use
this technique in practice.
The basic principle behind online character recognition
4 O NLINE S CRIPT R ECOGNITION is to capture the temporal sequence of strokes. A stroke
The script identification techniques described earlier are is defined as the locus of tip of the pen from pen-
for off-line script recognition and are in general not down to the next pen-up position. For script recognition,
applicable to online data. With the advancement of pen therefore, it may be useful to check the writing style as-
computing technology in the last few decades, many sociated with each script class. For example, Arabic and
online document analysis systems have been developed Hebrew scripts are written from right to left, Devnagari
in which it is necessary to interpret the written text as it script is characterized by the presence of ‘shirorekha’, a
is input by analyzing the spatial and the temporal nature Han character is composed of several short strokes, and
of the movement of the pen. Therefore, as in the case of so on. An online system can capture such information
OCR systems for off-line data, an online character rec- and be used for script identification. In [80], Namboodiri
ognizer in a multi-script environment must be preceded and Jain proposed nine measures that may be used to
by an online script recognition system. Unfortunately, quantify the characteristic writing style of every script.
in comparison to off-line script recognition, not much They are – (1) horizontal inter-stroke direction defining
effort has been dedicated toward the development of the direction of writing within a textline, (2) average
online script recognition techniques. As of today, only stroke length, (3) ’shirorekha’ strength, (4) ’shirorekha’
few methods are available for online script recognition, confidence, (5) stroke density, (6) aspect ratio, (7) reverse
as described below. direction defined as the distance by which the pen moves
One of the earliest works on online script recog- in the direction opposite to the normal writing direction,
nition was reported in [78] by Lee et al. Later they (8) average horizontal stroke direction, and (9) average
extended their work in [79]. Their method is based on vertical stroke direction. Their proposed classification
the construction of a unified recognizer for the entire set system, based on the above spatial and temporal fea-

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

18 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

tures of the strokes, attained classification accuracies in The OCR evaluation approaches are broadly classified
between 86.5% to 95% in different experimental tests. into two categories: black-box evaluation and white-
Later, they added two more features in [81], viz. vertical box evaluation. In black-box evaluation, only the input
inter-stroke direction and variance of stroke length, and and output are visible to the evaluator. In white-box
achieved around 0.6% improvement in the classification evaluation procedure, outputs of different modules com-
accuracy. prising the system may be accessed and the total system
A unified syntactic approach to online script recogni- is evaluated stage by stage. Nevertheless, the primary
tion was presented in [82] and was applied for classify- issues related to both types of evaluation are recognition
ing Latin, Devnagari and Kanji scripts by analyzing their accuracy and processing speed. The parameters that can
characteristic properties that include the fuzzy linguistic be varied for the purpose of evaluation are content,
descriptors to describe the character features. The fuzzy font size and style, print and paper quality, scanning
pattern description language FOHDEL (Fuzzy Online resolution, and the amount of noise and degradation in
Handwriting Description Language) is used to store the document images.
fuzzy feature values for every character of a script class Needless to say, the overall performance of a multi-
in the form of fuzzy rules. For example, the character “b” script OCR greatly depends on the performance of the
in the Roman alphabet may be described as comprising script recognition algorithm used in the system. As with
of two fuzzy linguistic terms – very straight vertical line any OCR system, the efficiency of a script recognizer
at the beginning followed by an almost circular curve at is mainly assessed on the basis of accuracy and speed.
the end. These fuzzy rules aid in decision making during Another important performance criterion is the min-
classification. imum size of the document necessary for the script
recognizer to perform reliably. This is to measure how
5 S CRIPT R ECOGNITION IN V IDEO T EXT the recognizer performs with varying document size.
Script identification is not only important for docu- In a multi-script system, another issue of considera-
ment analysis but also for text recognition in images tion is the writing system adopted by a script, script
and videos. Text recognition in images and videos is complexity and the size of the character set. Since some
important in the context of image/video indexing and scripts are simple in nature and some are quite complex,
retrieval. The process includes several preprocessing a relative comparison of performance across scripts is a
steps like text detection, text localization, text segmen- difficult task. For example, Latin is generally simpler in
tation and binarization before an OCR algorithm may structure and is based on alphabetic system. A script
be applied. As with documents in multi-script envi- identifier that is good in recognizing Latin scripts may
ronment, image/video text recognition in international not be so in case of complex non-alphabetic scripts
environment also requires script identification in order like Arabic, Han and Devnagari. Therefore, in order to
to apply suitable algorithm for text extraction and recog- evaluate various systems, a standard set of data should
nition. In view of this, an approach for discriminating be used so that the evaluation is unbiased. However,
between Latin and Han script was developed in [83]. it is generally difficult to find document data-sets in
The proposed approach proceeds as follows. First, the different languages/scripts that are similar in content
text present in an image or video frame is localized and layout. To address this problem, Kanungo et al
and size normalized. Then, a set of low-level features introduced the Bible as a data-set for evaluating mul-
is extracted from the edges detected inside the text tilingual and multi-script OCR performance [85]. Bible
region. This includes mean and standard deviation of translations are closely parallel in structure, relevant
edge pixels, edge pixel density, energy of edge pixels, with respect to modern day language, widely available
horizontal projection, and Cartesian moments of the and inexpensive. These make the Bible attractive for
edge pixels. Finally, based on the extracted features, the controlling document content while varying language
decision about the type of the script is made using a and script. The document layout can also be controlled
KNN classifier. Experimental results have demonstrated by using synthetically generated page image data. Other
the efficiency of the proposed method by identifying holy books whose translation have similar properties,
Latin and Han scripts accurately at the rate of 85.5% like the Quran and the Bhagavad-Gita, have also been
and 89%, respectively. suggested by some researchers.
One major concern with most of the reported works
in script recognition is the lack of any comparative anal-
6 I SSUES IN MULTI - SCRIPT OCR SYSTEM ysis of the results. Experimental results given for every
EVALUATION proposed method have not been compared with other
In connection with research in script recognition, it benchmark works in the field. Moreover, the datasets
is useful and important to develop benchmarks and used in experiments are all different. This is mainly due
methodologies that may be employed to evaluate the to the lack of availability of a standard database for
performance of multi-script OCR systems. Some aspects script recognition research. Consequently, it is hard to
of this problem have been reported in [84], and are assess the results reported in the literature. Hence, a
discussed below. standard evaluation test-bed containing documents writ-

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 19

ten in only one script type as well as multi-script doc- be not wrong to say that script recognition in handwrit-
uments with mix of different scripts within a document ten documents is still in its early stage of research. Since
is necessary. One important consideration in selecting the present thrust in OCR research is in handwritten
the data-set for a script class is that it should reflect document analysis, parallel research on script identifi-
the global probability of occurrence of the characters in cation in handwritten documents is in demand. Also,
texts written in that particular script. Another problem of not many of these script recognition techniques have
concern is for languages that constantly undergo spelling addressed font variation within a script class. Hence,
modifications and graphemic changes over the years. As we can conclude that script recognition technology still
a result, if an old document is chosen as the corpus, has a way to go, especially for handwritten document
then it may not be suitable for evaluating a modern analysis. Therefore, there is an urgent need to work
OCR system. On the other hand, a database of modern on script recognition of handwritten documents and in
documents may not be useful if the goal of the OCR developing font independent script recognizers.
is to process historic documents. This suggests that the As is evident from our analysis, development in script
data-set should include all different forms of the same recognition technology lacks a generalized approach to
language that evolved with time, with a full coverage of the problem which can handle all different types of
script alphabet of different languages and it should be scripts under a common framework. While a particular
large enough to reflect the statistical occurrence proba- script feature proves to be efficient within a set of scripts,
bility of the characters. it may not be useful in other scripts. To some extent,
texture features can be used universally but cannot be
applied reliably at word and character levels within a
7 C ONCLUSION document.
This paper presents a comprehensive survey on the Finally, we need to create a standard data-set for
developments in script recognition technology which is research in this field. This is necessary to evaluate dif-
an important issue in OCR research in our multilin- ferent script recognition methodologies under the same
gual multi-script world. Researchers have attempted to conditions. The creation of standard data resources will
characterize different scripts either by extracting their undoubtedly provide a much needed resource to re-
structural features or by deriving some visual attributes. searchers working in this field.
Accordingly, many different script features have been
proposed over the years for script identification at dif-
ferent levels within a document – page-wise, paragraph-
R EFERENCES
wise, textline-wise, word-wise and even character-wise. [1] C.Y. Suen, M. Berthod, and S. Mori, “Automatic Recognition of
Handprinted Characters – The State of The Art,” Proc. IEEE, vol.
Textline-wise and word-wise script identifications are 68, no. 4, pp. 469-487, Apr. 1980.
particularly important for use in a multi-script docu- [2] J. Mantas, “An Overview of Character Recognition Methodolo-
ment. However, compared to the large arsenal of liter- gies,” Pattern Recognition, vol. 19, no. 6, pp. 425-430, 1986.
ature available in the field of document analysis and [3] V.K. Govindan and A.P. Shivaprasad, “Character Recognition – A
Review,” Pattern Recognition, vol. 23, no. 7, pp. 671-683, 1990.
optical character recognition, the volume of work on [4] S. Mori, C.Y. Suen, and K. Yamamoto, “Historical Review of OCR
script identification is relatively thin. The main reason is Research and Development,” Proc. IEEE, vol. 80, no. 7, pp. 1029-
that most research in the area of OCR has been directed 1058, Jul. 1992.
[5] H. Bunke and P.S.P. Wang (eds.), Handbook of Character Recognition
at solving issues within the scope of the country where and Document Image Analysis, World Scientific Publishing, Singa-
the research is conducted. Since most countries in the pore, 1997.
world use only one language/script, OCR research in [6] N. Nagy, “Twenty Years of Document Image Analysis in PAMI,”
IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 22, no.
these countries need not bother determining the script 1, pp. 38-62, Jan. 2000.
in which a document is written. For instance, the US [7] U. Pal, “Automatic Script Identification: A Survey,” J. Vivek, vol.
postal department had spent a lot in developing system 16, no. 3, pp. 26-35, 2006.
[8] U. Pal and B.B. Chaudhuri, “Indian Script Character Recognition:
for automatic reading of postal addresses, but under the A Survey,” Pattern Recognition, vol. 37, no. 9, pp. 1887-1899, Sep.
assumption that all letters originating or arriving in US 2004.
will carry address written in English only. Script recog- [9] L. Peng, C. Liu, X. Ding, and H. Wang, “Multilingual Document
Recognition Research and Its Application in China,” Proc. Int’l
nition is important only in an international environment Conf. Document Image Analysis for Libraries, Lyon, pp. 126-132, Apr.
or in a country that uses more than one script. 2006.
Nonetheless, with recent economic globalization and [10] A. Nakanishi, Writing Systems of the World: Alphabets, Syllabaries,
Pictograms, Charles E. Tuttle Co., Tokyo, 1980.
increased business transactions across the globe, there [11] F. Coulmas, The Blackwell Encyclopedia of Writing Systems, Black-
had been increased awareness of automatic script recog- well Publishers, Oxford, 1996.
nition among the OCR community. That is why majority [12] C. Ronse and P.A. Devijver, Connected Components in Binary Images:
The Detection Problem, John Wiley & Sons, New York, 1984.
of the reported works are dated only during the last [13] A.L. Spitz, “Multilingual Document Recognition,” Proc. Int’l Conf.
one decade. However, it is noted that most of these Electronic Publishing, Document Manipulation & Typography, Mary-
script recognition methods have been tested on machine- land, pp. 193-206, Sep. 1990.
[14] A.L. Spitz and M. Ozaki, “Palace: A Multilingual Document
printed documents only, and their performance in hand- Recognition System,” Proc. IAPR Workshop Document Analysis
written documents is not known. In view of this, it will Systems, Kaiserslautern, pp. 16-37, Oct. 1994.

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

20 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. XX, NO. YY, MONTH 2009

[15] A.L. Spitz, “Determination of The Script and Language Content Proc. Int’l Conf. Document Analysis & Recognition, Edinburgh, pp.
of Document Images,” IEEE Trans. Pattern Analysis & Machine 750-754, Aug. 2003.
Intelligence, vol. 19, no. 3, pp. 235-245, Mar. 1997. [37] R. Krishnapuram and J.M. Keller, “A Possihilistic Approach to
[16] D.-S. Lee, C.R. Nohl, and H.S. Baird, “Language Identification in Clustering,” IEEE Trans. Fuzzy Systems, vol. 1, no. 2, pp. 98-110,
Complex, Unoriented, and Degraded Document Images,” Proc. May 1993.
IAPR Workshop Document Analysis Systems, Malvern, pp. 76-98, [38] D. Ghosh and A.P. Shivaprasad, “An Analytic Approach for
Oct. 1996. Generation of Artificial Handprinted Character Database from
[17] B. Waked, S. Bergler, C.Y. Suen, and S. Khoury, “Skew Detection, Given Generative Models,” Pattern Recognition, vol. 32, no. 6, pp.
Page Segmentation and Script Classification of Printed Document 907-920, Jun. 1999.
Images,” Proc. IEEE Int’l Conf. Systems, Man & Cybernetics, San [39] D.W. Muir and T. Thomas, “Automatic Language Identification
Diego, vol. 5, pp. 4470-4475, Oct. 1998. by Stroke Geometry Analysis,” U.S. Patent No. 6064767, 16 May
[18] L. Lam, J. Ding, and C.Y. Suen, “Differentiating Between Oriental 2000.
and European Scripts by Statistical Features,” Int’l J. Pattern [40] Y.-H. Liu, C.-C. Lin, and F. Chang, “Language Identification of
Recognition & Artificial Intelligence, vol. 12, no. 1, pp. 63-79, Feb. Character Images Using Machine Learning Techniques,” Proc. Int’l
1998. Conf. Document Analysis & Recognition, Seoul, vol. 2, pp. 630-634,
[19] J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, “Automatic Aug.-Sep. 2005.
Script Identification from Document Images Using Cluster-based [41] I. Moalla, A. Elbaati, A.M. Alimi, and A. Benhamadou, “Ex-
Templates,” IEEE Trans. Pattern Analysis & Machine Intelligence, traction of Arabic Text from Multilingual Documents,” Proc.
vol. 19, no. 2, pp. 176-181, Feb. 1997. IEEE Int’l Conf. Systems, Man & Cybernetics, Yasmine Hammamet,
[20] J. Hochberg, K. Bowers, M. Cannon, and P. Kelly, “Script and Oct. 2002, http://ieeexplore.ieee.org/iel5/8325/26298/01173266.
Language Identification for Handwritten Document Images,” Int’l pdf?arnumber=1173266.
J. Document Analysis & Recognition, vol. 2, no. 2/3, pp. 45-52, Dec. [42] I. Moalla, A.M. Alimi, and A. Benhamadou, “Extraction of Ara-
1999. bic Words from Multilingual Documents,” Proc. Conf. Artificial
[21] Y. Tho and Y.Y. Tang, “Discrimination of Oriental and Euramer- Intelligence & Soft Computing, Marbella, Sep. 2004, http://www.
ican Scripts Using Fractal Feature,” in Proc. Int’l Conf. Document actapress.com/PDFViewer.aspx?paperId=18567.
Analysis & Recognition, Seattle, pp. 1115-1119, Sep. 2001. [43] C.L. Tan, P.Y. Leong, and S. He, “Language Identification in Multi-
[22] B.V. Dhandra, P. Nagabhushan, M. Hangarge, R. Hegadi, and lingual Documents,” Proc. Int’l Symp. Intelligent Multimedia &
V.S. Malemath, “Script Identification Based on Morphological Distance Education, Baden-Baden, pp. 59-64, Aug. 1999.
Reconstruction in Document Images,” Proc. IEEE Int’l Conf. Pattern [44] S. Lu, C.L. Tan, and W. Huang, “Language Identification in
Recognition, Hong Kong, vol. 2, pp. 950-953, Aug. 2006. Degraded and Distorted Document Images,” Lecture Notes in Com-
[23] S. Chaudhury and R. Sheth, “Trainable Script Identification Strate- puter Science: Int’l Workshop Document Analysis Systems, Nelson,
gies for Indian Languages,” Proc. Int’l Conf. Document Analysis & LNCS-3872, pp. 232-242, Feb. 2006.
Recognition, Bangalore, pp. 657-660, Sep. 1999. [45] C.V. Jawahar, M.N.S.S.K. Pavan Kumar, and S.S. Ravi Kiran, “A
[24] S.B. Patil and N.V. Subbareddy, “Neural Network Based System Bilingual OCR for Hindi–Telugu Documents and Its Applica-
for Script Identification in Indian Documents,” Sadhana, vol. 27, tions,” Proc. Int’l Conf. Document Analysis & Recognition, Edin-
part 1, pp. 83-97, Feb. 2002. burgh, pp. 408-412, Aug. 2003.
[25] Z. Chi, Q. Wang, and W.-C. Siu, “Hierarchical Content Classifica- [46] S. Sinha, U. Pal, and B.B. Chaudhuri, “Word-wise Script Identifi-
tion and Script Determination for Automatic Document Image cation from Indian Documents,” Lecture Notes in Computer Science:
Processing,” Pattern Recognition, vol. 36, no. 11, pp. 2483-2500, IAPR Int’l Workshop Document Analysis Systems, Florence, LNCS-
Nov. 2003. 3163, pp. 310-321, Sep. 2004.
[26] S. Kanoun, A. Ennaji, Y. Lecourtier, and A.M. Alimi, “Script and [47] S. Chanda. S. Sinha, and U. Pal, “Word-wise English Devnagari
Nature Differentiation for Arabic and Latin Text Images,” Proc. and Oriya Script Identification,” Speech and Language Systems for
Int’l Workshop Frontiers in Handwriting Recognition, Niagra, pp. Human Communication, R.M.K. Sinha and V.N. Shukla (eds.), Tata
309-313, Aug. 2002. McGraw-Hill, New Delhi, pp. 244-248, 2004.
[27] L. Zhou, Y. Lu, and C.L. Tan, “Bangla/English Script Identification [48] S. Chanda and U. Pal, “English, Devnagari and Urdu Text Iden-
Based on Analysis of Connected Component Profiles,” Lecture tification,” Proc. Int’l Conf. Cognition & Recognition, Mandya, pp.
Notes in Computer Science: Int’l Workshop Document Analysis Sys- 538-545, Dec. 2005.
tems, Nelson, LNCS-3872, pp. 243-254, Feb. 2006. [49] S. Chanda, R.K. Roy, and U. Pal, “English and Tamil Text Iden-
[28] U. Pal and B.B. Chaudhuri, “Script Line Separation from Indian tification,” Proc. Nat’l Conf. Recent Trends in Information Systems,
Multi-script Documents,” Proc. Int’l Conf. Document Analysis & Kolkata, pp. 184-187, Jul. 2006.
Recognition, Bangalore, pp. 406-409, Sep. 1999. [50] M.C. Padma and P. Nagabhushan, “Identification and Separation
[29] U. Pal and B.B. Chaudhuri, “Identification of Different Script of Text Words of Kannada, Hindi and English Languages Through
Lines from Multi-script Documents,” Image & Vision Computing, Discriminating Features,” Proc. Nat’l Conf. Document Analysis &
vol. 20, no. 13-14, pp. 945-954, Dec. 2002. Recognition, Mandya, pp. 252-260, Jul. 2003.
[30] U. Pal, S. Sinha, and B.B. Chaudhuri, “Multi-script Line Identifica- [51] R. Kumar, V. Chaitanya, and C.V. Jawahar, “A Novel Approach to
tion from Indian Documents,” Proc. Int’l Conf. Document Analysis Script Separation,” Proc. Int’l Conf. Advances in Pattern Recognition,
& Recognition, Edinburgh, pp. 880-884, Aug. 2003. Kolkata, pp. 289-292, Dec. 2003.
[31] A. Elgammal and M.A. Ismail, “Techniques for Language Identi- [52] K. Roy, U. Pal, and B.B. Chaudhuri, “Address Block Location
fication for Hybrid Arabic-English Document Images,” Proc. Int’l and Pin Code Recognition for Indian Postal Automation,” Proc.
Conf. Document Analysis & Recognition, Seattle, pp. 1100-1104, Sep. Workshop Computer Vision, Graphics & Image Processing, Gwalior,
2001. pp. 5-9, Feb. 2004.
[32] C.S. Cumbee, “Method of Identifying Script of Line of Text,” U.S. [53] K. Roy, S. Vajda, U. Pal, B.B. Chaudhuri, and A. Belaid, “A System
Patent No. 7020338, 28 Mar. 2006. for Indian Postal Automation,” Proc. Int’l Conf. Document Analysis
[33] S.-W. Lee and J.-S. Kim, “Multi-lingual, Multi-font, Multi-size & Recognition, Seoul, vol. 2, pp. 1060-1064, Aug.-Sep. 2005.
Large-set Character Recognition Using Self-organizing Neural [54] K. Roy, D. Pal, and U. Pal, “Pin-code Extraction and Recognition
Network,” Proc. Int’l Conf. Document Analysis & Recognition, Mon- for Indian Postal Automation,” Proc. Nat’l Conf. Recent Trends in
treal, vol. 1, pp. 28-33, Aug. 1995. Information Systems, Kolkata, pp. 192-195, Jul. 2006.
[34] J. Hochberg, M. Cannon, P. Kelly, and J. White, “Page Segmenta- [55] K. Roy and U. Pal, “Word-wise Hand-written Script Separation
tion Using Script Identification Vectors: A First Look,” Proc. Symp. for Indian Postal Automation,” Proc. Int’l Workshop Frontiers in
Document Image Understanding Technology, Annapolis, pp. 258-264, Handwriting Recognition, La Baule, pp. 521-526, Oct. 2006.
Apr.-May 1997. [56] K. Roy, U. Pal, and B.B. Chaudhuri, “Neural Network Based
[35] D. Ghosh and A.P. Shivaprasad, “Handwritten Script Identifica- Word-wise Handwritten Script Identification System for Indian
tion Using Possibilistic Approach for Cluster Analysis,” J. Indian Postal Automation,” Proc. Int’l Conf. Intelligent Sensing & Informa-
Inst. of Science, vol. 80, pp. 215-224, May-Jun. 2000. tion Processing, Chennai, pp. 240-245, Jan. 2005.
[36] V. Ablavsky and M.R. Stevens, “Automatic Feature Selection with [57] S.L. Wood, X. Yao, K. Krishnamurthi, and L. Dang, “Language
Applications to Script Identification of Degraded Documents,” Identification for Printed Text Independent of Segmentation,”

Authorized licensed use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

GHOSH ET AL.: SCRIPT RECOGNITION – A REVIEW 21

Proc. Int’l Conf. Image Processing, Washington D.C., vol. 3, pp. 428- [81] A.M. Namboodiri and A.K. Jain, “Online Handwritten Script
431, Oct. 1995. Recognition,” IEEE Trans. Pattern Analysis & Machine Intelligence,
[58] T.N. Tan, “Rotation Invariant Texture Features and Their Use in vol. 26, no. 1, pp. 124-130, Jan. 2004.
Automatic Script Identification,” IEEE Trans. Pattern Analysis & [82] A. Malaviya and L. Peters, “Fuzzy Handwriting Description
Machine Intelligence, vol. 20, no. 7, pp. 751-756, Jul. 1998. Language: FOHDEL,” Pattern Recognition, vol. 33, no. 1, pp. 119-
[59] L. O’Gorman and R. Kasturi, Document Image Analysis, IEEE-CS 131, Jan. 2000.
Press, Los Alamitos, 1995. [83] J. Gllavata and B. Freisleben, “Script Recognition in Images with
[60] G.S. Peake and T.N. Tan, “Script and Language Identification from Complex Backgrounds,” Proc. IEEE Int’l Symp. Signal Processing &
Document Images,” Lecture Notes in Computer Science: Asian Conf. Information Technology, Athens, pp. 589-594, Dec. 2005.
Computer Vision, Hong Kong, LNCS-1352, pp. 97-104, Jan. 1998. [84] B.B. Chaudhuri, “On Multi-script OCR System Evaluation,”
[61] R.M. Haralick, K. Shanmugam, and I. Dinstein, “Textural Features Int’l Workshop Performance Evaluation Issues in Multi-lingual OCR,
for Image Classification,” IEEE Trans. Systems, Man, & Cybernetics, Bangalore, Sep. 1999, http://www.kanungo.com/workshop/
vol. 3, no. 6, pp. 610-621, Nov. 1973. abstracts/chaudhuri.html.
[62] W.M. Pan, C.Y. Suen, and T.D. Bui, “Script Identification Using [85] T. Kanungo, P. Resnik, S. Mao, D.-W. Kim, and Q. Zheng, “The
Steerable Gabor Filters,” Proc. Int’l Conf. Document Analysis and Bible and Multilingual Optical Character Recognition,” Commu-
Recognition, Seoul, vol. 2, pp. 883-887, Aug.-Sep. 2005. nications of the ACM , vol. 48, no. 6, pp. 124-130, Jun. 2005.
[63] V. Singhal, N. Navin, and D. Ghosh, “Script-based Classification
of Hand-written Text documents in A Multilingual Environment,”
Proc. Int’l Workshop Research Issues in Data Engineering – Multi-
lingual Information Management, Hyderabad, pp. 47-54, Mar. 2003.
[64] J. Cheng, X. Ping, G. Zhou, and Y. Yang, “Script Identification of D. Ghosh received the B.E degree in electronics
Document Image Analysis,” Proc. Int’l Conf. Innovative Computing, & communication engineering from M.R. Engi-
Information and Control, Beijing, vol. 3, pp. 178-181, Aug.-Sep. 2006. neering College, Jaipur, India, in 1993, and the
[65] A.K. Jain and Y. Zhong, “Page Segmentation Using Texture Anal- M.S. and Ph.D degrees in electrical communi-
ysis,” Pattern Recognition, vol. 29, no. 5, pp. 743-770, May 1996. cation engineering from the Indian Institute of
[66] A. Busch, W.W. Boles, and S. Sridharan, “Texture for Script Science, Bangalore, India, in 1996 and 2000,
Identification,” IEEE Trans. Pattern Analysis & Machine Intelligence, respectively. From April 1999 to November 1999,
vol. 27, no. 11, pp. 1720-1732, Nov. 2005. he was a DAAD Research Fellow at the Univer-
[67] A. Busch, “Multi-font Script Identification Using Texture-based sity of Kaiserslautern, Germany. In November
Features,” Lecture Notes in Computer Science: Int’l Conf. Image 1999, he joined the Indian Institute of Technol-
Analysis & Recognition, Póvoa de Varzim, LNCS-4142, pp. 844-852, ogy Guwahati, India, as an Assistant Professor
Sep. 2006. of electronics and communication engineering. He spent the 2003-2004
[68] G.D. Joshi, S. Garg, and J. Sivaswamy, “Script Identification from academic year as a visiting faculty in the Department of Electrical &
Indian Documents,” Lecture Notes in Computer Science: IAPR Int’l Computer Engineering, National University of Singapore. Between 2006
Workshop Document Analysis Systems, Nelson, LNCS-3872, pp. 255- and 2008, he was a Senior Lecturer with the Faculty of Engineering
267, Feb. 2006. and Technology, Multimedia University, Malaysia. He is currently an
[69] W. Chan and G.G. Coghill, “Text Analysis Using Local Energy,” Associate Professor with the Department of Electronics & Computer
Pattern Recognition, vol. 34, no. 12, pp. 2523-2532, Dec. 2001. Engineering, Indian Institute of Technology Roorkee, India. His teaching
[70] H. Ma and D. Doermann, “Gabor Filter Based Multi-class Clas- and research interests include image/video processing, computer vision
sifier for Scanned Document Images,” Proc. Int’l Conf. Document and pattern recognition.
Analysis & Recognition, Edinburgh, pp. 968-972, Aug. 2003.
[71] S. Jaeger, H. Ma, and D. Doermann, “Identifying Script on Word-
level with Informational Confidence,” Proc. Int’l Conf. Document
Analysis & Recognition, Seoul, vol. 1, pp. 416-420, Aug.-Sep. 2005.
[72] D. Dhanya, A.G. Ramkrishnan, and P.B. Pati, “Script Identification T. Dube received the B.Tech. degree in elec-
in Printed Bilingual Documents,” Sadhana, vol. 27, part 1, pp. 73- tronics and communication engineering from the
82, Feb. 2002. Indian Institute of Technology Guwahati, India, in
[73] D. Dhanya and A.G. Ramkrishnan, “Script Identification in 2006. Soon after her graduation, she joined the
Printed Bilingual Documents,” Lecture Notes in Computer Science: Indian division of British Telecom at Bangalore,
IAPR Int’l Workshop Document Analysis Systems, Princeton, LNCS- and later moved to Ibibo Web Pvt. Ltd., Gurgaon,
2423, pp. 13-24, Aug. 2002. India, as a Software Engineer. Between 2007
[74] D. Dhanya and A.G. Ramkrishnan, “Optimal Feature Extraction and 2009, she worked as a Senior Software En-
for Bilingual OCR,” Lecture Notes in Computer Science: IAPR Int’l gineer with Infovedics Software Pvt. Ltd., Noida,
Workshop Document Analysis Systems, Princeton, LNCS-2423, pp. India. She received a search developer certifica-
25-36, Aug. 2002. tion from FAST University, Norway, in 2007. She
[75] P.B. Pati, S. Sabari Raju, N. Pati, and A.G. Ramakrishnan, “Gabor is currently pursuing a management degree from the Indian Institute of
Filters for Document Analysis in Indian Bilingual Documents,” Management, Ahmedabad, India.
Proc. Int’l Conf. Intelligent Sensing and Information Processing, Chen-
nai, pp. 123-126, Jan. 2004.
[76] P.B. Pati and A.G. Ramakrishnan, “HVS Inspired System for Script
Identification in Indian Multi-script Documents,” Lecture Notes
in Computer Science: Int’l Workshop Document Analysis Systems,
Nelson, LNCS-3872, pp. 380-389, Feb. 2006. A.P. Shivaprasad received the B.E., M.E. and
Ph.D. degrees in electrical communications en-
[77] A.L. Spitz, “Script and Language Determination from Document
gineering from the Indian Institute of Science,
Images,” Proc. Annual Symp. Document Analysis & Information
Bangalore, India, in 1965, 1967 and 1972, re-
Retrieval, Las Vegas, pp. 229-235, Apr. 1994.
spectively. Since 1967, he was a member of the
[78] J.J. Lee, B.K. Sin, and J.H. Kim, “On-line Mixed Character Recog-
academic staff of the Department of Electrical
nition Using An HMM Network,” Proc. KISS Annual Conf., vol.
Communication Engineering, Indian Institute of
20, no. 2, pp. 317-320, Oct. 1993.
Science, Bangalore, India, till he retired as a Pro-
[79] J.J. Lee, J.H. Kim, and M. Nakajima, “A Hierarchical HMM
fessor in 2006. He is currently a Guest Professor
Network-based Approach for On-line Recognition of Multi-
with the Department of Electronics & Communi-
lingual Cursive Handwritings,” IEICE Trans. Information & Sys-
cation Engineering, Sambhram Institute of Tech-
tems, vol. E81-D, no. 8, pp. 881-888, Aug. 1998.
nology, Bangalore, India. His research interests include design of micro-
[80] A.M. Namboodiri and A.K. Jain, “Online Script Recognition,” power VLSI circuits, intelligent instrumentation, communication systems
Proc. Int’l Conf. Pattern Recognition, Quebec, vol. 3, pp. 736-739, and pattern recognition.
Aug. 2002.

Authorized
View publicationlicensed
stats use limited to: UNIVERSIDADE ESTADUAL DE CAMPINAS. Downloaded on August 09,2010 at 14:29:09 UTC from IEEE Xplore. Restrictions apply.

You might also like