Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models
Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models
net/publication/315808116
CITATIONS READS
0 429
2 authors:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by N Shobha Rani on 22 October 2018.
Abstract— Digitization and automatic interpretation of extends the usage of such scanned images to perform the
document images into editable document format is the primary editing in real time for future purposes. In the process of
inclination of optical character recognition systems (OCR). This transforming scanned text images to editable format, the
paper proposes a novel technique for resolution of post output of OCR may be corrupted during the process of post
processing errors that occurs with respect to Telugu OCR using processing stage. Post processing is the final stage of OCR,
word level Unicode Approximation Models (UAM) through a where document construction takes place. Post-processing
mapper module. The mapper module performs the word level stage involves grouping of various symbols pertaining to
one-one mapping of assigning a sequence of recognized class various components of vowel-consonant clusters and words.
labels to appropriate UAM. The sequence of recognized class
The process of performing the association of symbols into
labels are related to one particular word and are generated
from the classifier as output. The proposed algorithm effectively
strings is referred to as grouping [1]. The task of post
resolves the problem of segmentation errors, preprocessing processing is quite simple and easier for the Latin scripts
errors like cuts and merges in characters, noise, occlusions, compared to some Indic scripts like Telugu, Kannada,
semantic ordering and confusing character classes. The Malayalam etc. The alphabetic set of European or Roman
proposed UAM models provide adequate and consistent languages like English is built with simple structure and each
accuracies of around 96.2% for printed words and 91.7% character is considered as a unique symbol for which a single
towards handwritten words respectively. Unicode combination persists, this makes the post processing
operations in English optical systems very simplified, thereby
Keywords— Optical Character Recognition, South Indian reducing the chance of errors that occurs due to segmentation,
languages, UAM’s, Telugu language, Post-Processing, vowel semantic ordering and mapper module [2]. The alphabetical
consonant clusters. set of Telugu language is derived from Brahmi script [3], in
which characters possess variety of structural characteristics
I. INTRODUCTION with complex grapheme and combination of one or more
characters generated from these alphabets results into
The interpretation of south Indian language documents like single/multi conjunct vowel-consonant clusters which
Telugu has been remained a challenging problem due to the complicates the process of segmentation, classification and
variety of errors incur during post processing stage of OCR. recognition.
The OCR precedes the document image through a linear
processing model that encompasses pre-processing, The major challenge in building an error free Telugu
segmentation, feature extraction, classification, recognition character recognition (TCR) system is handling the variety of
and post-processing. The simplicity of post processing stage of errors encounter during the process of character recognition.
an OCR highly relies upon the structural diacritics of the The errors may occur during segmentation of the single/multi
alphabet set pertaining to a particular language. The repeated conjunct vowel consonant clusters, classification errors for the
occurrence of vowel consonant clusters in almost every part of characters with similar visual appearance (confusing classes),
Telugu text increases the complexity of processing in various in the process of reordering of various classified characters
stages of OCR. Since each vowel-consonant cluster is built of according to the language rules and while performing one to
multiple symbols, performing individual symbol level many mapping of the single character class to a valid Unicode
segmentation, maintaining its semantic ordering and combination with the help of mapper etc. In the perspective of
classifying it further by reordering rules may lead to the rectifying the various errors in the OCR output i.e., editable
erroneous post processing output. text document of image, some traditional post processing
procedures like performing error detection and correction are
In recent times, the advancement of smart technologies had employed. The error detection and correction requires the
aggrandized the usage of OCR. The trend of digitizing the building of huge dictionaries, usage of statistical and language
hard copy text documents into scanned image formats had models [4, 5] which are popularly called as spell checkers and
become the common practice in every walk of life. The OCR
978-1-4673-6667-0/15/$31.00©2015 IEEE
1
are termed as post processors in the context of OCR. The output with a database of words that are collected from the
development of post processors for Indic scripts is more google’s spelling suggestions. Although the method achieved
complicated than the European or Roman languages due to reduced error rate in post processing of English and Arabic
variety of reasons. Some of the reasons are lack of availability words, the method doesn’t incorporate a strategy for
of standard corpus databases, language models and performing determining the probability of containing an error in each
the transliteration [6] before error detection and after error word. Naveen and Jawahar [14] had contributed various error
correction, some other issues related to encoding and decoding models for Malayalam and Telugu using statistical language
of Unicode combinations, computational complexity and learning models and F-score computational measures are
memory requirements etc. The minor errors in reordering of employed for detection of errors. Although the method had
the glyphs belonging to single vowel consonant cluster may achieved an appreciable accuracy, composing the language
lead to misleading recognition results as indicated in fig models and dictionaries for Indian languages is very
1. challenging and the experimentation requires a vast dataset.
Youssef and Alwani [15] had investigated the error correction
methods for output of speech recognition based on the Bing
search engine and database for experimentation is composed
by collecting several word tokens submitted as queries to the
search engine and accuracy of the system is around 95%.
Bansal and Sinha [16] had reported an error correction method
for Devanagari and Hindi OCR data using Hindi word
Fig. 1 Ordering of vowel-conjunct in a vowel-consonant cluster dictionary. Lehal and Chandan [17] had devised a post
Further there is numerous scope of performing incorrect processor for Gurumukhi script using statistical information
segmentation if the document image is occluded with noise and grammar rules of Punjabi language and had attained an
and clutters while decomposition of particular vowel accuracy of about 94%. The corpus is built based on the
consonant clusters into its constituent parts. In case of noisy frequency of occurrence of the words and similar looking
document images the pre-processing procedures may characters in Punjabi and had been tested on the results
introduce the artifacts like breakages and merges as shown in obtained from clean images. Anoop and Namboodiri [18] had
fig. 2 and fig. 3 respectively. The breakages and merges proposed a post processing scheme for Indian languages
occurring due to non-uniform illumination or poor resolution utilizing the classical poetry structure to consider the linguistic
in printed images and due to the user’s freedom while writing features of the language instead of focusing on statistical or
in handwritten images; breakages will convert the individual algorithmic models for error correction, thereby reducing the
textual components into multiple parts and merges will ambiguity in recognition of the various similar looking
combine one or more textual components into a single characters. Pal et. al [19] had proposed an error detection and
components forming touching characters. correction technique for Bangla script by employing the
morphological parsing which uses two separate lexicons of
root words and suffixes, the candidate root-suffix pairs of each
input string, are detected, their grammatical agreement is
tested and the root/suffix part in which the error has occurred
is noted. The method has achieved an accuracy of 84% in
Fig. 2(a) Original image Fig. 2(b) Image with breakages error correction module.
To the best of our knowledge the methods in literature has
incorporated the error correction module as an isolated module
after recognition and most of the approaches are based on the
dictionaries, statistical and linguistic models, lexicon and
grammatical rule based methods. The experimentations
Fig. 3(a) Original Image Fig. 3(b) Image with merges reported are most often focused upon languages like English,
Arabic, Latin, Devanagari, Gurumukhi, Bangla etc. and there
Many approaches are reported in the literature on post are few works reported on Telugu OCR data. However, in this
processing methodologies for languages like English based on paper, we propose a mapper module that performs the
unigram/bi-gram/n-gram analysis, string matching, techniques mapping of recognized class label to appropriate Unicode
employing statistical measures like F-scores, confidence combination. The mapper module acquires the distinctive
scores [7], probabilistic models based on Bayes theorem [8], parts of the Unicode sequence while mapping for dynamic
performing substitution, transposition and deletion operations error detection and correction. The error detection and
[9], employing key word based information [10], using correction is performed based on the database built using
spelling suggestions from search engines [11], statistical distinct part of various hexadecimal Unicode combinations at
features of character co-occurrences [12] etc. single/multi conjunct vowel consonant cluster and word level
for only strings that are composed of confusing characters or
However, very few attempts on post processing of Telugu aksharas of Telugu. The error correction and detection module
language OCR data are reported in the literature. Youssef and in the present work is incorporated a part of mapping process
Mohammad [13] had worked on the error detection and from classifier output to the respective Unicode combination.
correction of the English and Arabic words misspelled in OCR The section 2 summarizes the Unicode rules devised for post
2
processing of Telugu script, section 3 elaborates the proposed processing of vowel-consonant clusters with bi-level and
method for error detection and correction, section 4 discusses multi-level consonant conjuncts is depicted in the subsections.
the experimental results of proposed method and section 5
concludes the work. A. Bi-level and multi-level vowel-consonant clusters
A vowel-consonant cluster is a group of a vowel, one or
II. RULES FOR GROUPING OF SYMBOLS IN TELUGU SCRIPT conjunct consonants and a vowel modifier. Vowels are
represented in the various forms when are combined with
Telugu language alphabet set is composed of totally 55 letters consonants as depicted in fig. 4d and also consonants assume
as shown in Fig. 4, of these 16 vowels (Achchulu) as in fig. the various forms with double or multiple conjunct
4a, 36 consonants (Hallullu) as in fig. 4b and 3 are both consonants. There are some conjunct consonants having more
vowels and consonants called as Ubhaya-aksharaalu in Telugu than two members and used frequently in Sanskrit scripts. A
as in fig. 4c. The vowels are unique and can be represented in vowel-consonant cluster may exist with one or more
its primary and secondary form as in fig. 4d, when combined consonants + Vowel diacritics + 0 to 2 Conjunct Consonants
together results into a consonant. Consonants can be combined [20] as shown in Fig. 5.
with vowels to represent different words which results into Let V represents the number of vowels, C represents the
vowel-consonant clusters. number of consonants and CC represents the number of
conjunct consonants used to identify the diacritics of vowel-
consonant cluster. If V=1, C=1, CC=0 then there exist one
vowel, one consonant and zero conjunct consonants in a
vowel- consonant cluster, if V=1, C=1 and CC=1 then there
exists one vowel, one consonant and one conjunct consonants
in a vowel- consonant cluster, if V=1, C=1 and CC=2 then
there exists one vowel, one consonant and two conjunct
consonants in a vowel- consonant cluster, if V=2, C=1 and
CC=2 then there exists two vowels, one consonant and two
conjunct consonants in a vowel- consonant cluster.
3
cases. The degree of the post processing errors are expensive
in case of South Indian language OCRs’ like Telugu, since it
consists of very high number of similar looking characters.
The proposed methodology for the post processing of
Telugu script is designed of two phases as indicated in Fig 6.
The first phase encompasses the acquisition of the input from Fig. 7 Classification of word image components into Unicode’s
the post processor which is considered in the proposed
methodology as the sequence of hexadecimal representation of
Unicode’s corresponding to a word. The word level output of
post processor prior to the mapping to text document is
moderated to a string approximation model called Unicode
approximation model (UAM). The UAM can be distinguished
as either valid UAM or invalid UAM. Valid UAM represents
the post processed word as spell corrected as per language
rules whereas the invalid UAM is not. The second phase of Fig. 8 Modeling of UAM for a word in Fig. 7
proposed methodology encompasses the mapping process for
the invalid UAMs’, in which the valid UAM for the invalid The UAM is unique with respect to every correct word as per
counterpart is identified through string pattern matching the language rules. The grammatically correct words are
algorithm and the dynamic time wrapping is performed to
words with valid UAM’s.
replace invalid UAM with valid UAM. The proposed methods
for phase one and phase two are explained in the subsequent B. Mapping of Invalid UAM’s with valid UAM’s
subsections. The mapping of invalid UAM’s with valid UAM’s is
performed through a process of dynamic time warping by
cross evaluating with the database of UAM’s. The database
consists of a table of two columns pertaining to valid UAM’s
and its corresponding invalid UAM’s. In the proposed work
the database is populated with the set of words commonly
used in Telugu application form documents belonging to
regions of Anantapur district, Andhra Pradesh.
The UAM’s in the present work are composed only for the
words defined with confusing aksharas or character pairs in
Fig. 6 Architecture of Telugu Post processing System the Telugu language. The confusing character pairs are similar
in resemblance and due to the varying styles of handwriting
and some printed font styles or due to the errors in
A. Unicode Approximation Models (UAM)
preprocessing, there is much scope to increase the number of
UAM is the sequence of lexicons used to represent a word confusing character classes and such classes are depicted in
or a string which is spell corrected as per rules of language. Fig. 9. The errors due to confusing character classes can be
The correctness of the UAM representing a string depends on found mostly in the handwritten documents of Telugu as
the context of the documents used for extraction of text,
shown in fig. 10 and even in some Telugu font styles or due to
grammar and spellings etc. The present work focuses on
design of UAM’s for the words that are commonly used to the variety of factors mentioned above.
compose the Telugu language application forms and the
UAM’s are defined for the strings which are in common
practice of usage for filling in the application forms in the
regions of Anantapur district, Andhra Pradesh. The UAM is
defined as a sequence of distinctive components of
hexadecimal representations of Unicode representing a lexicon
in word. For example, consider the word ేరు in Telugu. The
Fig. 9 Confusing Character Classes
corresponding word image W represented in Fig. 7 is
composed of a set of Unicode combinations {U1;U2;U3…;Un}.
The U1, U2 … Un are mapped as per the class labels
recognized for each character by the classifier module. The
distinctive features of {f1, f2} existing in each Un are collected
and composed as UAM as given in Fig. 8.
4
to perform the document construction spanning from level of a
Some of the word level UAM’s generated from the outputs of simple lexicon to vowel consonant clusters to word level
classifier is tabulated in TABLE 1. combinations. The function can also be easily extensible to
The starting two letters in the UAM are considered as the Correct Valid UAM Invalid UAM Incorrect
index pointer to the access locations of words or strings Words Words
దరఖాసూ త్ A6B096BEB88DA482 A6A096BEB88DA482 దఠఖాసూ త్
starting with the same character in the UAM database as
దరఖాసూ త్ A6B096BEB88DA482 A7A096BEB88DA482 ధఠఖాసూ త్
shown in fig. 11. The database is partitioned into blocks,
దరఖాసూ త్ A6B096BEB88DA482 A7B096BEB88DA482 ధరఖాసూ త్
where each block consists of words starting with same
దరఖాసూ త్ A6B096BEB88DA482 A6B096BEA88DA482 ధరఖానూ త్
character. A819F8D9FBFA8 A819F8D9FA8
పుటిట్న పుటట్ న
పుటిట్న A819F8D9FBFA8 A819F8D9FBFB5 పుటిట్వ
పుటిట్న A819F8D9FBFA8 B519F8D9FBFB5 వుటిట్వ
పుటిట్న A819F8D9FBFA8 B519F8D9FBFA8 వుటిట్న
త౦ ిర్ A4A6A18DB0BF A4A6A68DB0BF త౦ ిర్
ేరు AA87B081 B587B081 ేరు
ే ి A487A6BF B287A6BF లే ి
ే ి A487A6BF B287A1BF లే ి
ే ి A487A6BF A487A1BF ే ి
అన౦తపుర౦ 85A8A6A4AA81B0A6 85B5A6A4AA81B0A6 అవ౦తపుర౦
అన౦తపుర౦ 85A8A6A4AA81B0A6 85B5A6A4B581B0A6 అవ౦తవుర౦
వ ాలు B5BFB5B0BEB281 B5BFAAB0BEB281 ప ాలు
వ ాలు B5BFB5B0BEB281 AABFAAB0BEB281 ిప ాలు
అయాయ్ 85AF8DAFBE 85AE8DAEBE అమామ్
ఇ చ్న 879A8D9ABF 87AC8DACBF ఇ బ్న
ఎ ి౦ ి ౖె 8EAABFA6A1BFB888 8EA6A6BFABFB888 ఎ౦౦౦◌ి ి ై
ఎ ి౦ ి ౖె 8EAABFA6A1BFB888 8EAABFA6A6BFB888 ఎ ి౦ ి ౖె
నస A8B8 A8A8 నన
Fig. 11 UAM Database Modeling నస A8B8 B8B8 సస
Whenever an access is to be made to the database, the block TABLE 1 UAM’s generated from the words composed of confusing
character classes
with the corresponding starting index is being referred. Each
post processed word prior to document construction is
perform even line level post processing for document
validated with the UAM database models to verify whether the
construction.
UAM is valid or invalid. If the UAM generated is present in
the block then it is invalid UAM then the corresponding valid
UAM is retained. Each valid UAM is mapped to its
corresponding hexadecimal representation format of Unicode
separating one to another Unicode combination by a
semicolon as depicted in fig. 12. The valid UAM is further
redirected for text document construction.
5
accuracy of experimental results of error correction for words The UAM models proves to be consistent to solve the
with confusing characters is computed as follows. recognition errors of OCR that incurs due to poor pre-
Let Tw represents the total number of words, Tvw represents processing, segmentation errors and re-ordering errors in
classification. The models are designed to the data relevant to
the words with valid UAM’s and Tiw refers to the words with the application form documents covering the regions of
invalid UAM’s. Then the total number of words with invalid Anantapur district, Andhra Pradesh. The increase in number of
UAM is given by (1). UAM models possibly shows the effectiveness of proposed
Tiw = Tw − Tvw (1) methodology. Since there are no standard data sets for the
proposed system of Telugu language documents, the datasets
Let Wn represents a word with invalid UAM and N e is are anticipated from group of confusing character classes and
number of errors with in each word; then the error rate in each analysis is performed.
word is defined as in (2).
V. CONCLUSIONS
E (Wn ) = N e (2)
In overall the research paper has outlined a methodology for
The total number of errors in all the words with invalid UAM correcting spelling errors that may incur during post
is given by (3) processing stage of OCR through the design of a novel string
n
approximation model called as Unicode approximation model
Te (Tiw ) = ∑ Wn * E (Wn ) (3) (UAM). The experimentation supports that the accuracies
i =1 obtained are adequate in case of printed words and are to be
where n= 1, 2, 3…n. revised with more UAM’s for handwritten words, since
variety of errors in handwritten words can’t be predicted in its
If N c represents number of errors corrected, then the character entirety. UAM’s are very flexible to handle and also it can be
level accuracy of the UAM technique for error correction is easily extended for any type of south Indian language and any
given by eq. (4) number of words with or without confusing character classes.
Thus the proposed algorithm is capable of handling either
Nc
Accuracy = (4) printed words or handwritten words of language for which
Te (Tiw ) Unicode representation exists. In future, the efficiency of
proposed system can be enhanced by incorporating the large
The word level accuracies of the experimentation are given by number of UAM’s including all the grammatical and spelling
(5). corrections to be performed at line level. Since no similar
works are reported in the literature related to Telugu post
processing techniques, the comparative analysis of the
proposed method with other methods could not be made.
6
[8] Federico Boschetti, Matteo Romanello, Alison Babeu, David Bamman,
Gregory Crane, “Improving OCR Accuracy for Classical Critical
Editions”, Tufts University, Perseus Digital Library, Eaton 124,
Medford MA, 02155, USA.
[9] Kai Niklas, “Unsupervised Post-Correction of OCR Errors”, Diploma
Thesis, Leibniz Universit¨at Hannover Fakult¨at f¨ur Elektrotechnik und
Informatik, Institut f¨ur verteilte Systeme, Fachgebiet Wissensbasierte
Systeme, Forschungszentrum L3S.
[10] Hisao Niwa. Kazuhiro Kayashima, Yasuham Shimeki, “ Post processing
for character recognition using keyword information”, MVA '92 IAPR
Workshop on Machine Vis~on Appl~catrons Dec. 7-9,1992, Tokyo.
[11] Youssef Bassil, Mohammad Alwani, “OCR Post-Processing Error
Correction Algorithm Using Google's Online Spelling Suggestion”,
Journal of Emerging Trends in Computing and Information Sciences,
Vol. 3, No. 1, January 2012, ISSN 2079-8407.
[12] S. Kaki, E. Sumita, and H. Iida, “A method for correcting errors in
speech recognition using the statistical features of character co-
occurrence”, In COLING-ACL, pp.653–657, Montreal, Quebec, Canada,
1998
[13] Youssef Bassil, Mohammad Alwani, “OCR post-processing error
correction algorithm using google's online spelling suggestion”, Journal
of Emerging Trends in Computing and Information Sciences, ISSN
2079-8407, Vol. 3, No. 1, January 2012.
[14] Naveen Sankaran and C. V. Jawahar, “Error Detection in Highly
Inflectional Languages”, International Institute of Information
Technology, Hyderabad, India.
[15] Youssef Bassil, Mohammad Alwani, “Post-Editing Error Correction
Algorithm For Speech Recognition using Bing Spelling Suggestion”,
(IJACSA) International Journal of Advanced Computer Science and
Applications, Vol. 3, No.2, 2012.
[16] Bansal, R. M. K. Sinha, “Partitioning and Searching Dictionary for
Correction of Optically Read Devanagari Character Strings”,
International Journal on Document Analysis and Recognition.
[17] G S Lehal, Chandan Singh, “A post-processor for Gurmukhi OCR”,
S¯adhan¯a Vol. 27, Part 1, February 2002, pp. 99–111. © Printed in
India.
[18] Anoop M. Namboodiri, P. J. Narayanan, C. V. Jawahar, “On Using
Classical Poetry Structure for Indian Language Post-Processing”,
International Institute of Information Technology, Hyderabad, INDIA.
[19] U. Pal, P. K. Kundu, B. B. Chaudhuri, “OCR Error Correction of an
Inflectional Indian Language using Morphological Parsing”, Journal of
Information Science and Engineering 16, 903-922 (2000).
[20] J. Bharathi, P. Chandrasekhara Reddy, “Segmentation of Telugu
Touching Conjunct Consonants Using Overlapping Bounding Boxes”,
International Journal on Computer Science and Engineering (IJCSE),
ISSN : 0975-3397 Vol. 5 No. 06 Jun 2013.
[21] Nikhil Rajiv Pai., Vijaykumar S. Kolkure, “Design and implementation
of optical character recognition using template matching for multi fonts
/size”, IJRET: International Journal of Research in Engineering and
Technology eISSN: 2319-1163 | ISSN: 2321-7308.