0% found this document useful (0 votes)

82 views

Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models

Uploaded by

writetoevv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

82 views

Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models

Uploaded by

writetoevv

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/315808116

Post-Processing Methodology for Word Level Telugu Character Recognition

Systems using Unicode Approximation Models

Conference Paper · December 2015

DOI: 10.1109/ITACT.2015.7492681

CITATIONS READS

0 429

2 authors:

N Shobha Rani T. Vasudev

Amrita Vishwa Vidyapeetham Maharaja Institute of Technology
43 PUBLICATIONS 67 CITATIONS 25 PUBLICATIONS 55 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Deformed Character Recognition Using Convolutional Neural Networks View project

Image Processing View project

All content following this page was uploaded by N Shobha Rani on 22 October 2018.

The user has requested enhancement of the downloaded file.

Post-Processing Methodology for Word Level Telugu
Character Recognition Systems using Unicode
Approximation Models
N. Shobha Rani T. Vasudev
Maharaja Research Foundation Maharaja Research Foundation
Maharaja Institute of Technology Maharaja Institute of Technology
Mysuru, Karnataka, India Mysuru, Karnataka, India
[email protected] [email protected]

Abstract— Digitization and automatic interpretation of extends the usage of such scanned images to perform the
document images into editable document format is the primary editing in real time for future purposes. In the process of
inclination of optical character recognition systems (OCR). This transforming scanned text images to editable format, the
paper proposes a novel technique for resolution of post output of OCR may be corrupted during the process of post
processing errors that occurs with respect to Telugu OCR using processing stage. Post processing is the final stage of OCR,
word level Unicode Approximation Models (UAM) through a where document construction takes place. Post-processing
mapper module. The mapper module performs the word level stage involves grouping of various symbols pertaining to
one-one mapping of assigning a sequence of recognized class various components of vowel-consonant clusters and words.
labels to appropriate UAM. The sequence of recognized class
The process of performing the association of symbols into
labels are related to one particular word and are generated
from the classifier as output. The proposed algorithm effectively
strings is referred to as grouping [1]. The task of post
resolves the problem of segmentation errors, preprocessing processing is quite simple and easier for the Latin scripts
errors like cuts and merges in characters, noise, occlusions, compared to some Indic scripts like Telugu, Kannada,
semantic ordering and confusing character classes. The Malayalam etc. The alphabetic set of European or Roman
proposed UAM models provide adequate and consistent languages like English is built with simple structure and each
accuracies of around 96.2% for printed words and 91.7% character is considered as a unique symbol for which a single
towards handwritten words respectively. Unicode combination persists, this makes the post processing
operations in English optical systems very simplified, thereby
Keywords— Optical Character Recognition, South Indian reducing the chance of errors that occurs due to segmentation,
languages, UAM’s, Telugu language, Post-Processing, vowel semantic ordering and mapper module [2]. The alphabetical
consonant clusters. set of Telugu language is derived from Brahmi script [3], in
which characters possess variety of structural characteristics
I. INTRODUCTION with complex grapheme and combination of one or more
characters generated from these alphabets results into
The interpretation of south Indian language documents like single/multi conjunct vowel-consonant clusters which
Telugu has been remained a challenging problem due to the complicates the process of segmentation, classification and
variety of errors incur during post processing stage of OCR. recognition.
The OCR precedes the document image through a linear
processing model that encompasses pre-processing, The major challenge in building an error free Telugu
segmentation, feature extraction, classification, recognition character recognition (TCR) system is handling the variety of
and post-processing. The simplicity of post processing stage of errors encounter during the process of character recognition.
an OCR highly relies upon the structural diacritics of the The errors may occur during segmentation of the single/multi
alphabet set pertaining to a particular language. The repeated conjunct vowel consonant clusters, classification errors for the
occurrence of vowel consonant clusters in almost every part of characters with similar visual appearance (confusing classes),
Telugu text increases the complexity of processing in various in the process of reordering of various classified characters
stages of OCR. Since each vowel-consonant cluster is built of according to the language rules and while performing one to
multiple symbols, performing individual symbol level many mapping of the single character class to a valid Unicode
segmentation, maintaining its semantic ordering and combination with the help of mapper etc. In the perspective of
classifying it further by reordering rules may lead to the rectifying the various errors in the OCR output i.e., editable
erroneous post processing output. text document of image, some traditional post processing
procedures like performing error detection and correction are
In recent times, the advancement of smart technologies had employed. The error detection and correction requires the
aggrandized the usage of OCR. The trend of digitizing the building of huge dictionaries, usage of statistical and language
hard copy text documents into scanned image formats had models [4, 5] which are popularly called as spell checkers and
become the common practice in every walk of life. The OCR

978-1-4673-6667-0/15/$31.00©2015 IEEE
1
are termed as post processors in the context of OCR. The output with a database of words that are collected from the
development of post processors for Indic scripts is more google’s spelling suggestions. Although the method achieved
complicated than the European or Roman languages due to reduced error rate in post processing of English and Arabic
variety of reasons. Some of the reasons are lack of availability words, the method doesn’t incorporate a strategy for
of standard corpus databases, language models and performing determining the probability of containing an error in each
the transliteration [6] before error detection and after error word. Naveen and Jawahar [14] had contributed various error
correction, some other issues related to encoding and decoding models for Malayalam and Telugu using statistical language
of Unicode combinations, computational complexity and learning models and F-score computational measures are
memory requirements etc. The minor errors in reordering of employed for detection of errors. Although the method had
the glyphs belonging to single vowel consonant cluster may achieved an appreciable accuracy, composing the language
lead to misleading recognition results as indicated in fig models and dictionaries for Indian languages is very
1. challenging and the experimentation requires a vast dataset.
Youssef and Alwani [15] had investigated the error correction
methods for output of speech recognition based on the Bing
search engine and database for experimentation is composed
by collecting several word tokens submitted as queries to the
search engine and accuracy of the system is around 95%.
Bansal and Sinha [16] had reported an error correction method
for Devanagari and Hindi OCR data using Hindi word
Fig. 1 Ordering of vowel-conjunct in a vowel-consonant cluster dictionary. Lehal and Chandan [17] had devised a post
Further there is numerous scope of performing incorrect processor for Gurumukhi script using statistical information
segmentation if the document image is occluded with noise and grammar rules of Punjabi language and had attained an
and clutters while decomposition of particular vowel accuracy of about 94%. The corpus is built based on the
consonant clusters into its constituent parts. In case of noisy frequency of occurrence of the words and similar looking
document images the pre-processing procedures may characters in Punjabi and had been tested on the results
introduce the artifacts like breakages and merges as shown in obtained from clean images. Anoop and Namboodiri [18] had
fig. 2 and fig. 3 respectively. The breakages and merges proposed a post processing scheme for Indian languages
occurring due to non-uniform illumination or poor resolution utilizing the classical poetry structure to consider the linguistic
in printed images and due to the user’s freedom while writing features of the language instead of focusing on statistical or
in handwritten images; breakages will convert the individual algorithmic models for error correction, thereby reducing the
textual components into multiple parts and merges will ambiguity in recognition of the various similar looking
combine one or more textual components into a single characters. Pal et. al [19] had proposed an error detection and
components forming touching characters. correction technique for Bangla script by employing the
morphological parsing which uses two separate lexicons of
root words and suffixes, the candidate root-suffix pairs of each
input string, are detected, their grammatical agreement is
tested and the root/suffix part in which the error has occurred
is noted. The method has achieved an accuracy of 84% in
Fig. 2(a) Original image Fig. 2(b) Image with breakages error correction module.
To the best of our knowledge the methods in literature has
incorporated the error correction module as an isolated module
after recognition and most of the approaches are based on the
dictionaries, statistical and linguistic models, lexicon and
grammatical rule based methods. The experimentations
Fig. 3(a) Original Image Fig. 3(b) Image with merges reported are most often focused upon languages like English,
Arabic, Latin, Devanagari, Gurumukhi, Bangla etc. and there
Many approaches are reported in the literature on post are few works reported on Telugu OCR data. However, in this
processing methodologies for languages like English based on paper, we propose a mapper module that performs the
unigram/bi-gram/n-gram analysis, string matching, techniques mapping of recognized class label to appropriate Unicode
employing statistical measures like F-scores, confidence combination. The mapper module acquires the distinctive
scores [7], probabilistic models based on Bayes theorem [8], parts of the Unicode sequence while mapping for dynamic
performing substitution, transposition and deletion operations error detection and correction. The error detection and
[9], employing key word based information [10], using correction is performed based on the database built using
spelling suggestions from search engines [11], statistical distinct part of various hexadecimal Unicode combinations at
features of character co-occurrences [12] etc. single/multi conjunct vowel consonant cluster and word level
for only strings that are composed of confusing characters or
However, very few attempts on post processing of Telugu aksharas of Telugu. The error correction and detection module
language OCR data are reported in the literature. Youssef and in the present work is incorporated a part of mapping process
Mohammad [13] had worked on the error detection and from classifier output to the respective Unicode combination.
correction of the English and Arabic words misspelled in OCR The section 2 summarizes the Unicode rules devised for post

2
processing of Telugu script, section 3 elaborates the proposed processing of vowel-consonant clusters with bi-level and
method for error detection and correction, section 4 discusses multi-level consonant conjuncts is depicted in the subsections.
the experimental results of proposed method and section 5
concludes the work. A. Bi-level and multi-level vowel-consonant clusters
A vowel-consonant cluster is a group of a vowel, one or
II. RULES FOR GROUPING OF SYMBOLS IN TELUGU SCRIPT conjunct consonants and a vowel modifier. Vowels are
represented in the various forms when are combined with
Telugu language alphabet set is composed of totally 55 letters consonants as depicted in fig. 4d and also consonants assume
as shown in Fig. 4, of these 16 vowels (Achchulu) as in fig. the various forms with double or multiple conjunct
4a, 36 consonants (Hallullu) as in fig. 4b and 3 are both consonants. There are some conjunct consonants having more
vowels and consonants called as Ubhaya-aksharaalu in Telugu than two members and used frequently in Sanskrit scripts. A
as in fig. 4c. The vowels are unique and can be represented in vowel-consonant cluster may exist with one or more
its primary and secondary form as in fig. 4d, when combined consonants + Vowel diacritics + 0 to 2 Conjunct Consonants
together results into a consonant. Consonants can be combined [20] as shown in Fig. 5.
with vowels to represent different words which results into Let V represents the number of vowels, C represents the
vowel-consonant clusters. number of consonants and CC represents the number of
conjunct consonants used to identify the diacritics of vowel-
consonant cluster. If V=1, C=1, CC=0 then there exist one
vowel, one consonant and zero conjunct consonants in a
vowel- consonant cluster, if V=1, C=1 and CC=1 then there
exists one vowel, one consonant and one conjunct consonants
in a vowel- consonant cluster, if V=1, C=1 and CC=2 then
there exists one vowel, one consonant and two conjunct
consonants in a vowel- consonant cluster, if V=2, C=1 and
CC=2 then there exists two vowels, one consonant and two
conjunct consonants in a vowel- consonant cluster.

Fig. 5 Single-level and Multi-level vowel-consonant clusters

III. PROPOSED METHODOLOGY

Post processing refers to the process of grouping of various
symbols pertaining to the class labels that are recognized by
the classifier and redirecting the codes for information
interchange of grouped symbols or lexicons for the text
editable document construction. The outputs obtained by post
processing stage of OCR could be adulterated with
misclassification and misrecognition errors impacts the
outputs of post processing stage. The various characteristics of
Fig. 4 Telugu Alphabetical Set
noisy images, the font styles used to compose the script, the
When a consonant is used in combination with 16 vowels it
handwriting styles of users, clutter and occlusions, merges and
leads to 16 different single level vowel-consonant cluster. For cuts resulting from pre-processing stage and the similar visual
example consonant ‘క’ with 16 vowel combinations along appearance of the different characters belonging to the same
script etc, are the barriers for achieving accuracy in the post
with the mere consonant (◌్) leads to a single level vowel- processing stage. All these factors will result in spell and
consonant cluster. The grouping of symbols for post grammar errors which leaves the document text to
meaningless sentences or words or even characters in some

3
cases. The degree of the post processing errors are expensive
in case of South Indian language OCRs’ like Telugu, since it
consists of very high number of similar looking characters.
The proposed methodology for the post processing of
Telugu script is designed of two phases as indicated in Fig 6.
The first phase encompasses the acquisition of the input from Fig. 7 Classification of word image components into Unicode’s
the post processor which is considered in the proposed
methodology as the sequence of hexadecimal representation of
Unicode’s corresponding to a word. The word level output of
post processor prior to the mapping to text document is
moderated to a string approximation model called Unicode
approximation model (UAM). The UAM can be distinguished
as either valid UAM or invalid UAM. Valid UAM represents
the post processed word as spell corrected as per language
rules whereas the invalid UAM is not. The second phase of Fig. 8 Modeling of UAM for a word in Fig. 7
proposed methodology encompasses the mapping process for
the invalid UAMs’, in which the valid UAM for the invalid The UAM is unique with respect to every correct word as per
counterpart is identified through string pattern matching the language rules. The grammatically correct words are
algorithm and the dynamic time wrapping is performed to
words with valid UAM’s.
replace invalid UAM with valid UAM. The proposed methods
for phase one and phase two are explained in the subsequent B. Mapping of Invalid UAM’s with valid UAM’s
subsections. The mapping of invalid UAM’s with valid UAM’s is
performed through a process of dynamic time warping by
cross evaluating with the database of UAM’s. The database
consists of a table of two columns pertaining to valid UAM’s
and its corresponding invalid UAM’s. In the proposed work
the database is populated with the set of words commonly
used in Telugu application form documents belonging to
regions of Anantapur district, Andhra Pradesh.
The UAM’s in the present work are composed only for the
words defined with confusing aksharas or character pairs in
Fig. 6 Architecture of Telugu Post processing System the Telugu language. The confusing character pairs are similar
in resemblance and due to the varying styles of handwriting
and some printed font styles or due to the errors in
A. Unicode Approximation Models (UAM)
preprocessing, there is much scope to increase the number of
UAM is the sequence of lexicons used to represent a word confusing character classes and such classes are depicted in
or a string which is spell corrected as per rules of language. Fig. 9. The errors due to confusing character classes can be
The correctness of the UAM representing a string depends on found mostly in the handwritten documents of Telugu as
the context of the documents used for extraction of text,
shown in fig. 10 and even in some Telugu font styles or due to
grammar and spellings etc. The present work focuses on
design of UAM’s for the words that are commonly used to the variety of factors mentioned above.
compose the Telugu language application forms and the
UAM’s are defined for the strings which are in common
practice of usage for filling in the application forms in the
regions of Anantapur district, Andhra Pradesh. The UAM is
defined as a sequence of distinctive components of
hexadecimal representations of Unicode representing a lexicon
in word. For example, consider the word ేరు in Telugu. The
Fig. 9 Confusing Character Classes
corresponding word image W represented in Fig. 7 is
composed of a set of Unicode combinations {U1;U2;U3…;Un}.
The U1, U2 … Un are mapped as per the class labels
recognized for each character by the classifier module. The
distinctive features of {f1, f2} existing in each Un are collected
and composed as UAM as given in Fig. 8.

Fig. 10 Handwritten words with confusing characters

4
to perform the document construction spanning from level of a
Some of the word level UAM’s generated from the outputs of simple lexicon to vowel consonant clusters to word level
classifier is tabulated in TABLE 1. combinations. The function can also be easily extensible to

The starting two letters in the UAM are considered as the Correct Valid UAM Invalid UAM Incorrect
index pointer to the access locations of words or strings Words Words
దరఖాసూ త్ A6B096BEB88DA482 A6A096BEB88DA482 దఠఖాసూ త్
starting with the same character in the UAM database as
దరఖాసూ త్ A6B096BEB88DA482 A7A096BEB88DA482 ధఠఖాసూ త్
shown in fig. 11. The database is partitioned into blocks,
దరఖాసూ త్ A6B096BEB88DA482 A7B096BEB88DA482 ధరఖాసూ త్
where each block consists of words starting with same
దరఖాసూ త్ A6B096BEB88DA482 A6B096BEA88DA482 ధరఖానూ త్
character. A819F8D9FBFA8 A819F8D9FA8
పుటిట్న పుటట్ న
పుటిట్న A819F8D9FBFA8 A819F8D9FBFB5 పుటిట్వ
పుటిట్న A819F8D9FBFA8 B519F8D9FBFB5 వుటిట్వ
పుటిట్న A819F8D9FBFA8 B519F8D9FBFA8 వుటిట్న
త౦ ిర్ A4A6A18DB0BF A4A6A68DB0BF త౦ ిర్
ేరు AA87B081 B587B081 ేరు
ే ి A487A6BF B287A6BF లే ి
ే ి A487A6BF B287A1BF లే ి
ే ి A487A6BF A487A1BF ే ి
అన౦తపుర౦ 85A8A6A4AA81B0A6 85B5A6A4AA81B0A6 అవ౦తపుర౦
అన౦తపుర౦ 85A8A6A4AA81B0A6 85B5A6A4B581B0A6 అవ౦తవుర౦
వ ాలు B5BFB5B0BEB281 B5BFAAB0BEB281 ప ాలు
వ ాలు B5BFB5B0BEB281 AABFAAB0BEB281 ిప ాలు
అయాయ్ 85AF8DAFBE 85AE8DAEBE అమామ్
ఇ చ్న 879A8D9ABF 87AC8DACBF ఇ బ్న
ఎ ి౦ ి ౖె 8EAABFA6A1BFB888 8EA6A6BFABFB888 ఎ౦౦౦◌ి ి ై
ఎ ి౦ ి ౖె 8EAABFA6A1BFB888 8EAABFA6A6BFB888 ఎ ి౦ ి ౖె
నస A8B8 A8A8 నన
Fig. 11 UAM Database Modeling నస A8B8 B8B8 సస

Whenever an access is to be made to the database, the block TABLE 1 UAM’s generated from the words composed of confusing
character classes
with the corresponding starting index is being referred. Each
post processed word prior to document construction is
perform even line level post processing for document
validated with the UAM database models to verify whether the
construction.
UAM is valid or invalid. If the UAM generated is present in
the block then it is invalid UAM then the corresponding valid
UAM is retained. Each valid UAM is mapped to its
corresponding hexadecimal representation format of Unicode
separating one to another Unicode combination by a
semicolon as depicted in fig. 12. The valid UAM is further
redirected for text document construction.

Fig. 13 Writetotext( ) function for mapping of Unicode to document

IV. EXPERIMENTAL ANALYSIS

The experimentation is conducted with about 1500 UAM’s
which include the words that are used in the context of filling
Fig. 12 Reconstruction of Unicode from UAM in Telugu application forms. Most of the commonly used
place names, street names, locality names coming under
The hexadecimal Unicode representation separated by various towns in the regions of Anantapur district, Andhra
semicolon is processed through the invocation of our devised Pradesh are considered for building the database of UAM’s.
function writetotext( ) has given in Fig. 13. The writetotext( ) The invalid UAM’s are obtained from the recognition results
processes the ‘n’ number of hexadecimal representations of attained using template matching classification [21]. The
Unicode combinations at a time. The writetotext( ) will be able

5
accuracy of experimental results of error correction for words The UAM models proves to be consistent to solve the
with confusing characters is computed as follows. recognition errors of OCR that incurs due to poor pre-
Let Tw represents the total number of words, Tvw represents processing, segmentation errors and re-ordering errors in
classification. The models are designed to the data relevant to
the words with valid UAM’s and Tiw refers to the words with the application form documents covering the regions of
invalid UAM’s. Then the total number of words with invalid Anantapur district, Andhra Pradesh. The increase in number of
UAM is given by (1). UAM models possibly shows the effectiveness of proposed
Tiw = Tw − Tvw (1) methodology. Since there are no standard data sets for the
proposed system of Telugu language documents, the datasets
Let Wn represents a word with invalid UAM and N e is are anticipated from group of confusing character classes and
number of errors with in each word; then the error rate in each analysis is performed.
word is defined as in (2).
V. CONCLUSIONS
E (Wn ) = N e (2)
In overall the research paper has outlined a methodology for
The total number of errors in all the words with invalid UAM correcting spelling errors that may incur during post
is given by (3) processing stage of OCR through the design of a novel string
n
approximation model called as Unicode approximation model
Te (Tiw ) = ∑ Wn * E (Wn ) (3) (UAM). The experimentation supports that the accuracies
i =1 obtained are adequate in case of printed words and are to be
where n= 1, 2, 3…n. revised with more UAM’s for handwritten words, since
variety of errors in handwritten words can’t be predicted in its
If N c represents number of errors corrected, then the character entirety. UAM’s are very flexible to handle and also it can be
level accuracy of the UAM technique for error correction is easily extended for any type of south Indian language and any
given by eq. (4) number of words with or without confusing character classes.
Thus the proposed algorithm is capable of handling either
Nc
Accuracy = (4) printed words or handwritten words of language for which
Te (Tiw ) Unicode representation exists. In future, the efficiency of
proposed system can be enhanced by incorporating the large
The word level accuracies of the experimentation are given by number of UAM’s including all the grammatical and spelling
(5). corrections to be performed at line level. Since no similar
works are reported in the literature related to Telugu post
processing techniques, the comparative analysis of the
proposed method with other methods could not be made.

The TABLE 2 depicts the experimental results of the proposed References

system.
[1] Piyush Sudip Patel, D.P Sai Pratick Yadav, “Post-Processing on Optical
Type of Total Accuracy of Accuracy of error
words number Tiw error correction at word
Character Recognition”, International Journal of advanced engineering
applications(IJAEA), Vol. 2, Issue 6, pp. 71-76, 2009.
of correction at level
words character level [2] Okan Kolak, Philip Resnik, “ OCR Post Processing for Low Density
Printed 500 176 219/232=94.39 484/500=96.20% Languages”, EMNLP-2005.
% [3] Venu Govindaraju, Srirangaraj (Ranga) Setlur, “Guide to OCR for Indic
Handwritt 1000 450 513/577=88.90 917/1000=91.70 Scripts: Document Recognition and Retrieval”, Document recognition
en % and Retrieval, Advances in Pattern Recognition, Springer-Verlag,
London, 2009.
TABLE 2 Experimental results of proposed system [4] Yogomaya Mohapatra, Ashis Kumar Mishra, Anil Kumar Mishra, “Spell
Checker for OCR”, International Journal of Computer Science and
The experimental results reveal that errors in printed words are Information Technologies, Vol. 4 (1) , 2013, pp. 91 – 97, ISSN: 0975-
9646.
less compared to number of errors in handwritten words. The
[5] Xiang Tong and David A. Evans , “A Statistical Approach to Automatic
invalid UAM’s for printed words could be anticipated for OCR Error Correction in Context”, Laboratory for Computational
generation of UAM’s, whereas for handwritten words it is Linguistics, Carnegie Mellon University, Pittsburgh, PA 15213,U.S.A.
highly dependent upon the writing style of users and it is
quite difficult to anticipate the variety invalid UAM’s. [6] Utpal Garain, Arjun Das, David S. Doermann Douglas D. Oard,
However the various letter arrangements of a word with “Leveraging Statistical Transliteration for Dictionary-Based English-
Bengali CLIR of OCR’d Text”, Proceedings of COLING 2012: Posters,
confusing characters is considered in the proposed system for pages 339–348, COLING 2012, Mumbai, December 2012.
generation of invalid UAM’s and had achieved the accuracies [7] A. R. Setlur, R. A. Sukkar, and J. Jacob, “Correcting recognition errors
as mentioned in TABLE 2. via discriminative utterance verification”, In Proceedings of the
International Conference on Spoken Language Processing, pp.602–605,
Philadelphia, PA, 1996.

6
[8] Federico Boschetti, Matteo Romanello, Alison Babeu, David Bamman,
Gregory Crane, “Improving OCR Accuracy for Classical Critical
Editions”, Tufts University, Perseus Digital Library, Eaton 124,
Medford MA, 02155, USA.
[9] Kai Niklas, “Unsupervised Post-Correction of OCR Errors”, Diploma
Thesis, Leibniz Universität Hannover Fakultät für Elektrotechnik und
Informatik, Institut für verteilte Systeme, Fachgebiet Wissensbasierte
Systeme, Forschungszentrum L3S.
[10] Hisao Niwa. Kazuhiro Kayashima, Yasuham Shimeki, “ Post processing
for character recognition using keyword information”, MVA '92 IAPR
Workshop on Machine Vis~on Appl~catrons Dec. 7-9,1992, Tokyo.
[11] Youssef Bassil, Mohammad Alwani, “OCR Post-Processing Error
Correction Algorithm Using Google's Online Spelling Suggestion”,
Journal of Emerging Trends in Computing and Information Sciences,
Vol. 3, No. 1, January 2012, ISSN 2079-8407.
[12] S. Kaki, E. Sumita, and H. Iida, “A method for correcting errors in
speech recognition using the statistical features of character co-
occurrence”, In COLING-ACL, pp.653–657, Montreal, Quebec, Canada,
1998
[13] Youssef Bassil, Mohammad Alwani, “OCR post-processing error
correction algorithm using google's online spelling suggestion”, Journal
of Emerging Trends in Computing and Information Sciences, ISSN
2079-8407, Vol. 3, No. 1, January 2012.
[14] Naveen Sankaran and C. V. Jawahar, “Error Detection in Highly
Inflectional Languages”, International Institute of Information
Technology, Hyderabad, India.
[15] Youssef Bassil, Mohammad Alwani, “Post-Editing Error Correction
Algorithm For Speech Recognition using Bing Spelling Suggestion”,
(IJACSA) International Journal of Advanced Computer Science and
Applications, Vol. 3, No.2, 2012.
[16] Bansal, R. M. K. Sinha, “Partitioning and Searching Dictionary for
Correction of Optically Read Devanagari Character Strings”,
International Journal on Document Analysis and Recognition.
[17] G S Lehal, Chandan Singh, “A post-processor for Gurmukhi OCR”,
S¯adhan¯a Vol. 27, Part 1, February 2002, pp. 99–111. © Printed in
India.
[18] Anoop M. Namboodiri, P. J. Narayanan, C. V. Jawahar, “On Using
Classical Poetry Structure for Indian Language Post-Processing”,
International Institute of Information Technology, Hyderabad, INDIA.
[19] U. Pal, P. K. Kundu, B. B. Chaudhuri, “OCR Error Correction of an
Inflectional Indian Language using Morphological Parsing”, Journal of
Information Science and Engineering 16, 903-922 (2000).
[20] J. Bharathi, P. Chandrasekhara Reddy, “Segmentation of Telugu
Touching Conjunct Consonants Using Overlapping Bounding Boxes”,
International Journal on Computer Science and Engineering (IJCSE),
ISSN : 0975-3397 Vol. 5 No. 06 Jun 2013.
[21] Nikhil Rajiv Pai., Vijaykumar S. Kolkure, “Design and implementation
of optical character recognition using template matching for multi fonts
/size”, IJRET: International Journal of Research in Engineering and
Technology eISSN: 2319-1163 | ISSN: 2321-7308.

View publication stats

Java Algorithms for Beginners: A Practical Guide with Examples
From Everand
Java Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
Ameron Calculation Manual For Bondstand GRE Pipe Systems
No ratings yet
Ameron Calculation Manual For Bondstand GRE Pipe Systems
16 pages
Telugu Script Achanta Hastie 2015.2805047
No ratings yet
Telugu Script Achanta Hastie 2015.2805047
32 pages
Optical Character Recognition OCR For Telugu Datab
No ratings yet
Optical Character Recognition OCR For Telugu Datab
6 pages
OCR For Printed Telugu Documents
No ratings yet
OCR For Printed Telugu Documents
32 pages
Integration of Telugu Dictionary Into Tesseract OCR
No ratings yet
Integration of Telugu Dictionary Into Tesseract OCR
25 pages
A top-down character segmentation approach for Assamese and Telugu handwritten documents
No ratings yet
A top-down character segmentation approach for Assamese and Telugu handwritten documents
13 pages
Enabling Search Over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach
No ratings yet
Enabling Search Over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach
12 pages
Accuracy Augmentation of Tamil OCR Using Algorithm Fusion
No ratings yet
Accuracy Augmentation of Tamil OCR Using Algorithm Fusion
6 pages
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
From Everand
Lex Analysis and Implementation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Telugu Letters Dataset and Parallel Deep Convolutional Neural Network With A SGD Optimizer Model For TCR
No ratings yet
Telugu Letters Dataset and Parallel Deep Convolutional Neural Network With A SGD Optimizer Model For TCR
10 pages
Custom OCR For Tamil Using CNN and Dictionary
No ratings yet
Custom OCR For Tamil Using CNN and Dictionary
6 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Spell Checker For Kannada OCR
No ratings yet
Spell Checker For Kannada OCR
4 pages
Handwritten Tamil Character Recognition Using SVM: Prof. Dr.J.Venkatesh, C. Sureshkumar
No ratings yet
Handwritten Tamil Character Recognition Using SVM: Prof. Dr.J.Venkatesh, C. Sureshkumar
5 pages
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
From Everand
Speech-to-Text Systems and Technologies: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Comparison Analysis of Post - Processing Method For Punjabi Font
No ratings yet
Comparison Analysis of Post - Processing Method For Punjabi Font
6 pages
Awk Programming in Practice: Definitive Reference for Developers and Engineers
From Everand
Awk Programming in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Recognition of Devanagari Printed Text Using Neural Network and Genetic Algorithm
No ratings yet
Recognition of Devanagari Printed Text Using Neural Network and Genetic Algorithm
4 pages
Optical Character Recognition For Printed Tamil Text Using Unicode
No ratings yet
Optical Character Recognition For Printed Tamil Text Using Unicode
9 pages
Telugu Text Nomalization
No ratings yet
Telugu Text Nomalization
9 pages
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
From Everand
SpaCy for Natural Language Processing: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
A Simple and Efficient Optical Character Recognition System For Basic Symbols in Printed Kannada Text
No ratings yet
A Simple and Efficient Optical Character Recognition System For Basic Symbols in Printed Kannada Text
13 pages
Optical_Character_Recognition_Techniques
No ratings yet
Optical_Character_Recognition_Techniques
6 pages
Project of Matlab
No ratings yet
Project of Matlab
42 pages
Kannada Text Extraction From Images and Videos For Vision Impaired Persons
No ratings yet
Kannada Text Extraction From Images and Videos For Vision Impaired Persons
8 pages
Optical Character Recognition of Handwri PDF
No ratings yet
Optical Character Recognition of Handwri PDF
6 pages
Aradhya-Multi-Lingual OCR
No ratings yet
Aradhya-Multi-Lingual OCR
11 pages
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
From Everand
Objective-C Language Reference and Techniques: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Recital Comparison of Bilingual Language Using Various Filters For Offline Handwritten Character
No ratings yet
Recital Comparison of Bilingual Language Using Various Filters For Offline Handwritten Character
6 pages
Java Fundamentals Made Easy: A Practical Guide with Examples
From Everand
Java Fundamentals Made Easy: A Practical Guide with Examples
William E. Clark
No ratings yet
A Survey of Neural Network Based Script Recognition Using Wavelet Features
No ratings yet
A Survey of Neural Network Based Script Recognition Using Wavelet Features
4 pages
IJARCCE 5
No ratings yet
IJARCCE 5
5 pages
Report For OCR Project
No ratings yet
Report For OCR Project
18 pages
Naresh Kumar a Viva Voce PPT
No ratings yet
Naresh Kumar a Viva Voce PPT
68 pages
kunte2007
No ratings yet
kunte2007
6 pages
Review On Optical Character Recognition of Devanagari Script Using Neural Network
No ratings yet
Review On Optical Character Recognition of Devanagari Script Using Neural Network
6 pages
Java OOP Simplified: A Practical Guide with Examples
From Everand
Java OOP Simplified: A Practical Guide with Examples
William E. Clark
No ratings yet
Tesseract Ocr Engine
No ratings yet
Tesseract Ocr Engine
5 pages
(IJIT-V3I2P4) : Dr. V.Ajantha Devi, J. Ashifa
No ratings yet
(IJIT-V3I2P4) : Dr. V.Ajantha Devi, J. Ashifa
3 pages
An Efficient OCR For Printed Malayalam Text Using Novel Segmentation Algorithm and SVM Classifiers
No ratings yet
An Efficient OCR For Printed Malayalam Text Using Novel Segmentation Algorithm and SVM Classifiers
5 pages
Fluent Rust: Crafting Robust Software with Idiomatic Design Principles
From Everand
Fluent Rust: Crafting Robust Software with Idiomatic Design Principles
Aarav Joshi
No ratings yet
Applied APL Programming: Definitive Reference for Developers and Engineers
From Everand
Applied APL Programming: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
From Everand
Advanced Deep Learning Techniques for Natural Language Understanding: A Comprehensive Guide
Adam Jones
No ratings yet
Go Algorithms for Beginners: A Practical Guide with Examples
From Everand
Go Algorithms for Beginners: A Practical Guide with Examples
William E. Clark
No ratings yet
Rust Programming Basics: A Practical Guide with Examples
From Everand
Rust Programming Basics: A Practical Guide with Examples
William E. Clark
No ratings yet
Rust Essentials for New Developers: A Practical Guide with Examples
From Everand
Rust Essentials for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
OCR_of_Kannada_Characters_Using_Deep_Learning[1]
No ratings yet
OCR_of_Kannada_Characters_Using_Deep_Learning[1]
4 pages
Devanagari OCR Using A Recognition Driven Segmenta
No ratings yet
Devanagari OCR Using A Recognition Driven Segmenta
17 pages
Getting Started with Go: A Practical Guide with Examples
From Everand
Getting Started with Go: A Practical Guide with Examples
William E. Clark
No ratings yet
Introduction to Programming Languages
From Everand
Introduction to Programming Languages
IntroBooks Team
4/5 (1)
Assembly Language: From Basics to Expert Proficiency
From Everand
Assembly Language: From Basics to Expert Proficiency
William Smith
No ratings yet
Programming Best Practices for New Developers: A Practical Guide with Examples
From Everand
Programming Best Practices for New Developers: A Practical Guide with Examples
William E. Clark
No ratings yet
Unveiling_the_Past_A_Holistic_Approach_to_Rescuing
No ratings yet
Unveiling_the_Past_A_Holistic_Approach_to_Rescuing
14 pages
Project Report
No ratings yet
Project Report
38 pages
TextMate in Depth: Definitive Reference for Developers and Engineers
From Everand
TextMate in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
An Attempt To Recognize Handwritten Tamil Character Using Kohonen SOM
No ratings yet
An Attempt To Recognize Handwritten Tamil Character Using Kohonen SOM
5 pages
Writing Clean Code Step by Step: A Practical Guide with Examples
From Everand
Writing Clean Code Step by Step: A Practical Guide with Examples
William E. Clark
No ratings yet
146-GSJ72641
No ratings yet
146-GSJ72641
10 pages
Script Recognition Ghosh 2009
No ratings yet
Script Recognition Ghosh 2009
22 pages
(IJCST-V8I5P9) :honey Patel
No ratings yet
(IJCST-V8I5P9) :honey Patel
6 pages
Tib Ip Workspace Browser Config
No ratings yet
Tib Ip Workspace Browser Config
405 pages
How To Invest
No ratings yet
How To Invest
13 pages
Introduction To Web Services Protocols
No ratings yet
Introduction To Web Services Protocols
33 pages
Web Services 1
No ratings yet
Web Services 1
16 pages
Brokerage Calculator For Zerodha, RK, EDEL WEISS-2007 Format
No ratings yet
Brokerage Calculator For Zerodha, RK, EDEL WEISS-2007 Format
9 pages
Delay-Locked Loop (DLL) : Clk0 Clk2X
No ratings yet
Delay-Locked Loop (DLL) : Clk0 Clk2X
1 page
How WebKit Works
No ratings yet
How WebKit Works
16 pages
Ccie Questions
No ratings yet
Ccie Questions
123 pages
1.7 Description of Logic Nodes: Programmable Logic P54X/En Pl/K74 Micom P543, P544, P545, P546 (PL) 7-15
No ratings yet
1.7 Description of Logic Nodes: Programmable Logic P54X/En Pl/K74 Micom P543, P544, P545, P546 (PL) 7-15
32 pages
Mobilgard DEO Series: Diesel Engine Oil (DEO) For Previous Generation Marine Engine Designs
No ratings yet
Mobilgard DEO Series: Diesel Engine Oil (DEO) For Previous Generation Marine Engine Designs
1 page
P2-Oct-2019 QP-2
No ratings yet
P2-Oct-2019 QP-2
32 pages
Shaping Violin or Viola Corners
100% (4)
Shaping Violin or Viola Corners
19 pages
TG-8000 e 201512 PDF
No ratings yet
TG-8000 e 201512 PDF
2 pages
Installation Manual Powerwifi USB Router
No ratings yet
Installation Manual Powerwifi USB Router
10 pages
MODEL 65 Data Sheet
No ratings yet
MODEL 65 Data Sheet
2 pages
Electrical Transformers
No ratings yet
Electrical Transformers
21 pages
Module 1 Part 2
No ratings yet
Module 1 Part 2
63 pages
Econometric Analysis 8th Edition Greene Solutions Manual - Complete Set Of Chapters Available For Instant Download
100% (3)
Econometric Analysis 8th Edition Greene Solutions Manual - Complete Set Of Chapters Available For Instant Download
34 pages
Foster&TomlinsonBK1CH16 PlaneShapes2
No ratings yet
Foster&TomlinsonBK1CH16 PlaneShapes2
10 pages
Bill of Quantity (Boq) Rekapitulasi Pembangunan Tahap Ii Gedung Rawat Inap Rumah Sakit Umum Daerah Dr. M. Yunus Bengkulu Tahun Anggaran 2019
No ratings yet
Bill of Quantity (Boq) Rekapitulasi Pembangunan Tahap Ii Gedung Rawat Inap Rumah Sakit Umum Daerah Dr. M. Yunus Bengkulu Tahun Anggaran 2019
6 pages
System For Remote Monitoring and Control of Baby Incubator and Warmer PDF
No ratings yet
System For Remote Monitoring and Control of Baby Incubator and Warmer PDF
7 pages
National Institute of Physics: Carlo Vincienzo G. Dajac T 2-4PM, WF 12-4PM Cdajac@nip - Upd.edu - PH
No ratings yet
National Institute of Physics: Carlo Vincienzo G. Dajac T 2-4PM, WF 12-4PM Cdajac@nip - Upd.edu - PH
9 pages
Lesson 7
No ratings yet
Lesson 7
63 pages
Earnings Management and Cash Holdings
No ratings yet
Earnings Management and Cash Holdings
15 pages
Manual MagIC Net 3.0
No ratings yet
Manual MagIC Net 3.0
1,678 pages
Heat Transfer Oil - Total Seriola K 3120 Spec
No ratings yet
Heat Transfer Oil - Total Seriola K 3120 Spec
3 pages
Probability And Statistics For Engineering And The Sciences With Modeling Using R 1st Edition William P Fox pdf download
100% (1)
Probability And Statistics For Engineering And The Sciences With Modeling Using R 1st Edition William P Fox pdf download
90 pages
C ++ Wsu
No ratings yet
C ++ Wsu
36 pages
Technical Notes for 3-Phase Inverter Conducted Emissions
No ratings yet
Technical Notes for 3-Phase Inverter Conducted Emissions
8 pages
Lecture 6
No ratings yet
Lecture 6
43 pages
Final Info2206 en 2022-2023.vers1
No ratings yet
Final Info2206 en 2022-2023.vers1
10 pages
The Gothic Window: Tracery - Ornamental Work of Branchlike Lines Especially The Lacy Openwork in The Upper Part of
No ratings yet
The Gothic Window: Tracery - Ornamental Work of Branchlike Lines Especially The Lacy Openwork in The Upper Part of
7 pages
Chapter 2
No ratings yet
Chapter 2
4 pages
Simulate Analysis Functionality
No ratings yet
Simulate Analysis Functionality
2 pages

Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models

Uploaded by

Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Post-Processing Methodology for Word Level Telugu Character Recognition

Conference Paper · December 2015

N Shobha Rani T. Vasudev

SEE PROFILE SEE PROFILE

Deformed Character Recognition Using Convolutional Neural Networks View project

Image Processing View project

The user has requested enhancement of the downloaded file.

Fig. 5 Single-level and Multi-level vowel-consonant clusters

III. PROPOSED METHODOLOGY

Fig. 10 Handwritten words with confusing characters

Fig. 13 Writetotext( ) function for mapping of Unicode to document

IV. EXPERIMENTAL ANALYSIS

The TABLE 2 depicts the experimental results of the proposed References

View publication stats

You might also like