0% found this document useful (0 votes)

81 views25 pages

Integration of Telugu Dictionary Into Tesseract OCR

The document summarizes work done on integrating a Telugu dictionary into the Tesseract OCR system. It provides background on the Telugu script and challenges with OCR. It compares the pre-processing, feature extraction, and classification stages of Tesseract and an in-house OCR system. Experiments were conducted to improve Tesseract's word recognition accuracy by adding frequently used words and their frequencies to its dictionary. The effect of linguistic features like sandhi on word recognition is also analyzed.

Uploaded by

jit genesis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

81 views25 pages

Integration of Telugu Dictionary Into Tesseract OCR

Uploaded by

jit genesis

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Integration of Telugu dictionary into

Tesseract OCR

MTech Stage II Project Report

Submitted in partial fulfillment of the

requirements for the degree of
Master of Technology
by
Manasa Gootla
123050093

under the guidance of

Prof. J. Saketha Nath

Dept of CSE, IIT Bombay
Generated by CamScanner from intsig.com
Generated by CamScanner from intsig.com
Acknowledgements

I would sincerely like to thank my guide, Prof. J.Saketh Nath for his motivating support
throughout the semester and the consistent directions that he has fed into my work. I
would like to thank Prof. Ganesh Ramakrishnan for valuable guidance and discussions.

3
Contents

1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Stage1 Summary 4
2.1 Comparision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Adaptive Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Proposed improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Word Recognition 8
3.1 Chopping joined characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Associating chopped characters . . . . . . . . . . . . . . . . . . . . . . . . 9

4 Experiments 10

5 Results 12

6 Conclusion and Future Scope 15

4
List of Tables

5.1 Comparison of accuracies for historical documents . . . . . . . . . . . . . . 13

5.2 Comparison of accuracies for the modern documents . . . . . . . . . . . . . 14

5
List of Figures

1.1 Alphabets in Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Vowels and their associated Modifiers . . . . . . . . . . . . . . . . . . . . . 2
1.3 Confusing characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 Compound characters and complexities . . . . . . . . . . . . . . . . . . . . 3

5.1 Top confused characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6
Abstract

Applications of OCR in digitizing printed and hand-written documents have found their
way into preserving historical documents. OCR on Telugu language is particularly diffi-
cult because of the complexities in the script. The existing OCR systems for Telugu are
TESSERACT and Drishti. An in-house OCR has been developed in an attempt to over-
come the difficulties in recognizing Telugu characters. The comparision of performance of
Tesseract and in-house system on a toy corpus is given. We implemented Myers algorithm
to find and compare the accuracies of Tesseract and in-house system. A detailed insight
into the Tesseract system and its novel approaches like line finding, page layout analysis,
feature extraction, adaptive classification and an overview of the in-house system is also
given. To improve the word accuracies of tesseract system, the dictionary has been popu-
lated with words, which are more likely to appear in the historical documents, along with
their frequencies. The words that are currently in use, colloquial words, have also been
added to the dictionary to make the OCR robust to the type of document given as input.
We have also observed the effect of sandhis on the word recognition module.
Chapter 1

Introduction

Optical Character Recognition(OCR) converts scanned images of text into machine-

encoded format thereby enabling editing and searching data. This ability of OCR has
led to huge research in this area in the past few decades. Significant amount of research
has been done on OCR for languages like English, Chinese and Japanese. Its applications
include: search through printed and handwritten documents, aid for visually impaired,
check clearing for finance companies.
The existing OCR systems for English, Chinese cannot be adopted for Indian lan-
guages. Every character in English or Chinese are aligned in a single line and there
aren’t many discontinuities in the characters. OCR for Indian languages is much more
difficult because of the Script(lipi) as more strokes are involved. A compound charac-
ter(Samyuktas) can be formed from a huge combination of characters imposing complex-
ities. Active Research has been done in recent years on Devanagari, Hindi, Kannada,
Tamil and Telugu characters.
Telugu script is comprised of 16 vowels and 36 consonants(Figure 1.1). A com-
pound character can be formed from a consonant and a vowel or it can be formed from
2 consonants and a single vowel leading to possible combinations of 36x16 and 36x36x16
respectively. When compared to other Indian languages, it has additional complexity
involved beacause of the similarities in between many of the characters in the script. The
nuances in these characters are difficult to catch even by human readers posing much
more challenge for the OCR(Figure 1.3 and Figure 1.4). It is clear from Figure 1.4 that
the compound character has many disjoint connected ocmponents associated with it.

1
Figure 1.1: Alphabets in Telugu script

Figure 1.2: Vowels and their associated Modifiers

Figure 1.3: Confusing characters

1.1 Motivation
The existing OCR systems for Telugu are TESSERACT[7] and Drishti [1] have constraints
on images like quality, resolution, font, input/output formats. Inspite of these constraints
2
Figure 1.4: Compound characters and complexities

the accuracy levels achieved by these systems are not satisfactory, the best of these being
Tesseract. Tesseract has only considered Pothana fonts making it vulnerable for font
independent system. Drishti showed better results only when the scans are of laser quality
prints. An in-house system has been developed that helps in overcoming these constraints
[5] and [6].

3
Chapter 2

Stage1 Summary

In Stage1, we have studied the implementation of Tesseract software in detail and con-
trasted it against the in-house system. Each OCR has primarily 3 stages : preprocessing,
Feature extraction and Classification. In the following sections we give a brief insight into
the differences of the two OCR systems in discussion in each stage.

2.1 Comparision
In Tesseract, the image is stored in a raw format in the data structure IMAGE. The
leptonica library uses pix data structure to refer to this raw image. Instead of messing
with the raw data, all the image processing operations are done on this pix data structure.
Image is stored as data structure Pix (Leptonica style). The image is scanned row wise
and the rgb info is stored in a color map. The width, depth, height, the raster data and
the number of pix pointers referring the particular pix object are stored.
In in-house system, Java imaging API stores the input as BufferedImage, which is an
Image with an accessible data buffer. It is therefore more efficient to work directly with
BufferedImage. A BufferedImage has a ColorModel and a Raster of image data. The
ColorModel provides a color interpretation of the image’s pixel data.The Raster performs
the following functions:
-Represents the rectangular coordinates of the image
-Maintains image data in memory
-Provides a mechanism for creating multiple subimages from a single image data buffer

4
-Provides methods for accessing specific pixels within the image

2.1.1 Pre-processing

In pre-preocessing, the first step is image binarization. The image is binarized so that
the text is in black representing foreground and the rest is in white representing the
background. Tesseract used a global binarization technique called Otsus thresholding
method whereas the inhouse system used MaxVariance threshold (Java Imaging API) for
the global binarization.
The next step is character and word segmentation, in which connected components are
found through 8-way flood fill algorithm. In Tesseract, auto page segmentation is done
initially to mask the photos in the image and also page is divided into multiple blocks
if there are multiple columns. For each block, outlines are extracted from the connected
components through edge detection, which are then gathered and nested into blobs which
give the average character spacing. The words are then split through character spacing.
Tesseract also does skew detection and correction by tracing along the bounding boxes
of the blobs. The page layout analysis provides text regions of roughly uniform text size.
Drop caps are removed by simple percentile filter. Small blobs like punctuation, noise are
removed by taking threshold a fraction of median height. The characteristics of the font
are determined by x-height, ascender height, fixed pitch, baseline, median line, descender
height which are calculated from page layout analysis. The filtered blobs are sorted by
x-coordinates so that they are uniquely associated with text line. The slope is tracked
across the page using the baseline. Once the filtered blobs are assigned to their respective
text lines, the filtered out blobs are fitted back into appropriate lines.
The in-house system uses Hidden Markov Model to model the characteristic of each
telugu line i.e., white pixels, followed by black pixels with low density, black pixels with
high density and black pixels with low density. The pattern is fed to the 4-state HMM
with density of black pixels as feature. SVM Torch API has been used for HMM im-
plementation. In vertical segmentation, since each word has significant gap between its
neighbours when compared to gaps within word, each segmented line is processed for
significant white pixel density and thus words are segmented[3].

5
2.1.2 Feature Extraction

In training process, a set of images consisting of characters are gathered. For frequent
characters, a minimum 30 samples per each character, for moderate characters 20-30
samples and for less frequent 5-10 samples are to be provided for training. Manually for
each character, a bounding box is to be created coordinates and the corresponding ASCII
code. The vowel modifiers can be trained seperately too with the bounding box defined
only for those. In tesseract, the features extracted in training data are the segments of
polygonal approximation. For recognition i.e., for unknown characters, features of a small,
fixed length are extracted from outline and are matched many to one against the clustered
prototype features of the training data. The features for prototypes are 4-dimensional (x,y
position, angle, length) with 10-20 features in a prototype configuration. The features
for unknown are 3-dimensional (x, y position, angle) and for each character there are
approximately 50-100 features. The normalized features for the unknown are computed
by tracing around the outline of the blob unitwise. the x,y position and the angle between
the tangent line and a vector eastward from the centre of the blob are saved as feature.
In the in-house system, wavelet features have been extracted by dowsampling the
character image to an approximation image and three images: vertical, horizontal and
diagonal. By downsampling, the directional features of the image are captured which is
Battle-Lemarie filter is used as wavelet basis function to capture the circular property of
Telugu characters. Other features were also experimented on i.e., Gabor features, skeleton
features, circular zonal features, out of which wavelet features were found to give better
results.

2.1.3 Classification

In Tesseract, classfication proceeds as a two step process: in the first step: each feature
in the unknown matched against 3-D look-up table and fetches a bit vector of all the
classes it might match. The bit vectors are summed over all features. The classes with
the highest counts become shortlisted for the next step of classification. This step is
called class pruning. In the second step: each feature of unknown looks up for bit vector
of prototypes of given class it matches. Each prototype character class is logical sum of

6
product expression where each term is configuration. So the distance is evaluated and the
total similarity evidence of each feature in each configuration as well as each prototype
is recorded. The best combined distance which is calculated from summed feature and
prototype evidence is the best over all stored configuration of the class. The data structure
used is a k-d tree (k- dimensional tree) for each character. Nearest negihbour classification
is employed and the best choice is returned during the traversal of the tree. Each character
classification is associated with two numbers:
- confidence : prototype - normalized distance
- rating : confidence * (total outline length of the character)
In in-house system, the feature vectors generated are passed onto the classification
stage. The unicode of the letters thus classified are generated individually for the top
, middle and bottom parts of each character. The unicode of all these are combined to
generate the compound character.

2.2 Adaptive Binarization

As a preprocessing task, binarization affected the later stages thereby effecting the OCR
accuracies. This is because of the global binarization technique where the threshold
value is chosen from the whole image. Several local binarization methods[4] were tested
to capture the characteristics of the historical documents viz, no proper illumination and
fading. Adaptive binarization using Sauvolas technique is found to give better binarization
result. There was an increase of 4-5% on an average in the character accuracies per
document.

2.3 Proposed improvement

Tesseract has word recognition module which has dictionary files to improve the word
accuracies. Tesseract for telugu does not implement the word recognition module. In the
next chapter, we discuss the creation of dictionary files and the problems faced in collecting
the required data and how they effected the overall character and word accuracies.

7
Chapter 3

Word Recognition

Linguistic module (Permuter) contains language files freq dawg, word dawg, user words,
unicharset, DangAmbigs. DAWG (Directed Acyclic Word Graph) file is a datastructure
that facilitates fast search for words. The dictionary can be stored in the word dawg
file. This cannot be changed by the user. The user can modify user words dawg file and
personalize the word recognition module. The start node is the first character for the
current word to be searched. These files are not defined for Telugu language resulting in
much lower accuracies than that of English, chinese. These files ensures the correct word
inspite of few characters being wrongly classified in the word.
Permuter chooses best word string from each of the following categories:top frequency
word, top dictionary word, top numeric word, top upper case, top lower case, top classifier
choice The final decision for a given segmentation is the word with the lowest distance
rating. The classifier stores the top best choices for each character during the character
recognition process.
Words from different segmentations may have different numbers of characters in them.
It is hard to compare these words directly, even where a classifier claims to be producing
probabilities, which Tesseract does not. This problem is solved in Tesseract by generating
two numbers for each character classification. The first, called the confidence, is minus the
normalized distance from the prototype. This enables it to be a confidence in the sense
that greater numbers are better, but still a distance, as, the farther from zero, the greater
the distance. The second output, called the rating, multiplies the normalized distance
from the prototype by the total outline length in the unknown character. Ratings for

8
characters within a word can be summed meaningfully, since the total outline length for
all characters within a word is always the same. The word confidence is calculated as:
WERD CHOICE* choice = word->best choice
int w conf = 100 + 5 * choice->certainty()

3.1 Chopping joined characters

While the result from a word is unsatisfactory, Tesseract attempts to improve the result
by chopping the blob with worst confidence from the character classifier. Candidate chop
points are found from concave vertices of a polygonal approximation of the outline, and
may have either another concave vertex opposite, or a line segment. It may take up to 3
pairs of chop points to successfully separate joined characters from the ASCII set. Chops
are executed in priority order. Any chop that fails to improve the confidence of the result
is undone, but not completely discarded so that the chop can be re-used later by the
associator if needed.

3.2 Associating chopped characters

When the potential chops have been exhausted, if the word is still not good enough, it is
given to the associator. The associator makes an A* (best first) search of the segmentation
graph of possible combinations of the maximally chopped blobs into candidate characters.
It does this without actually building the segmentation graph, but instead maintains a
hash table of visited states. The A* search proceeds by pulling candidate new states from a
priority queue and evaluating them by classifying unclassified combinations of fragments.
It may be argued that this fully-chop-then-associate approach is at best inefficient, at
worst liable to miss important chops, and that may well be the case. The advantage is
that the chop-then-associate scheme simplifies the data structures that would be required
to maintain the full segmentation graph.

9
Chapter 4

Experiments

We have added word data file and frequent word data files for Telugu, as they were not not
officially available in the current version of Tesseract. The word data file is a dictionary
that enlists the words in alphabetical order. The frequent word list contains the most
frequently occuring words in Telugu.
We have observed that the addition of word data files effected the character accuracies
because of the chopping and association being done in the word recognition module as
mentioned in the earlier section. Wrongly classfied characters are replaced by the correct
characters as per the words in the dictionary if the correct character to be replaced is
available in the top best choices of the character. If it is not available, then the character
is replaced by a character that is most likely to be the character to be appearing in
the dictionary during the traversal from the letter preceding it. Some characters which
were correctly recognized were now wrongly classified because if the word confidence is not
satisfactory, the chopping and then the A* graph during association look for the next best
character replacement that also can be found in the dictionary file during the traversal in
the dawg. We have experimented with dictionaries that contain only colloquial words ,
only historical words, a unified dictionary.
We have also populated freq word list with the most frequent words. We have exper-
imented on the number of words the freq list should have without compromising on the
accuracies. The way it works is that words in the freq-words list are weighted higher when
Tesseract looks for likely matches. So if there are too many, it is likely that we get less
common variants switched to the most common ones. We have also experimented with

10
dictionary consisting of only limited historical words(300) tightly bound to the text in toy
corpus and then with dictionary consisting of only colloquial words(3M) and then with
historical words unrelated to the toy corpus. We have observed that many characters are
being confused with others. We have also observed that since the vowel modifiers were
included in the training, the output resulted many characters broken into consonant and
vowel modifier. Though the characters were recognised correct, the output was mostly
filled with vowel modifiers seperated from the consonants.

11
Chapter 5

Results

Implemented Myer’s algorithm[2] to find the differences between the actual text of the
image and the output text produced by the OCR system. The algorithm was implemented
in python. The output is returned as a list where all the characters that are inserted are
preceded by a ’+1’ and that are deleted by a ’-1’ and those characters which remain same
are preceded by a ’0’. The accuracy is calculated as:

r
A=
c
where r = the number of characters that are correctly recognized c = number of
characters in the actual text
The character and word level accuracies for Tesseract system without and with our
integrated binarization module and Telugu specific dictionary files are tabulated in ta-
ble 5.1. We can see a significant increase in the accuracy levels. The addition of dictio-
nary files also effected the character accuracies, because of the chopping and association
of characters to get the most probable word that could be in the dictionary.
We have experimented on the recent documents with the changed dictionary files. The
results are tabulated in table 5.2.
We have also observed that many characters are being confused with others. The top
5 characters which are observed to be more confused are tabulated in figure 5.1
For the tables below the first row indicates character accuracies and the next row
indicates word accuracies.

12
Table 5.1: Comparison of accuracies for historical documents
Filename Tesseract OCR with binarization and dictionary files
fwdpayasam/scan0002.jpg 58.384% 65.75%
15% 20%
fwdpayasam/scan0003.jpg 66.54% 72.83%
20% 23%
oct01/govu1/scan0006.jpg 68.95% 78.92%
34% 38%
oct01/govu1/scan0007.jpg 77.03% 79.57%
30% 33%
oct01/govu1/scan0008.jpg 77.38% 79.54%
32% 34%
oct01/ks4/scan0102.jpg 71.36% 74.07%
10% 23%
oct01/govu5/scan0082.jpg 38.7% 78.12%
19% 29%
oct01/govu5/scan0083.jpg 71.28% 77.91%
22% 28%

Figure 5.1: Top confused characters

13
Table 5.2: Comparison of accuracies for the modern documents
Filename Tesseract OCR with binarization and dictionary files
Doc01 37.29% 74.63%
16% 24%
Doc02 17.7% 35.83%
21% 26%
Doc03 55.4% 56.19%
4% 9%
Doc04 56.34% 57.62%
3% 8%
Doc05 62.4% 64.06%
11% 20%
Doc06 65.26% 66.05%
11% 15%
Doc07 67.48% 68.19%
11% 18%

14
Chapter 6

Conclusion and Future Scope

We have observed that the dictionary can never be completely populated with all the
words because of the complex morphology of the language. The union of two words
results in a morphology where the last character of first word and the first character of
second word are combined to get another character and the rest of the characters remain
the same. Sandhis and samasas result in countless combination of words which cannot be
included in the dictionary unless we use a sandhi splitter making the recognition of words
much simpler. This phenomena results in an infinite words list which makes the purpose
of dictionary useless. This problem can be overcome by using a sandhi splitter, which will
break a word into its root words which can be easily found in the dictionary of words.
The quality of the image has a lot of impact in the later stages in the process of char-
acter recognition. We have observed that during the binarization process, the characters
are broken when theres a thin line and faded background and if the characters which are
at the back cover are visible to the front cover and the background is light, then these
characters are also included on the front page which leads in garbled image. This can be
overcome by focusing more on the digital processing of the image.
Training the character variants is highly difficult due to the huge number of combina-
tion of characters possible and the nuances add to the complexity. One possible solution
is the brute force method of training all variants of characters with large number of sam-
ples per variants instead of training the vowel modifiers alone. The other approach is to
instead the output can be processed for finding the disconnected character variants and
joining them.

15
Bibliography

[1] Web Refererence http://www.ildc.in/Telugu/htm/lin ocr spell.htm. Drishti telugu

ocr.

[2] EUGENE W. MYERS. An o(nd) difference algorithm and its variations. Algorithmica,
pages 251–266, 1986.

[3] Srikanta Pal Nallapareddy Priyanka and Ranju Manda. Article:line and word segmen-
tation approach for printed documents. IJCA,Special Issue on RTIPPR, pages 30–36,
2010.

[4] B. Gatos K. Ntirogiannis I. Pratikakis. Dibco 2009: document image binarization

contest. IJDAR, pages 35–44, 2011.

[5] J. Saketha Nath Rajesh Arja. Ocr for historical telugu documents. MTP Stage 1,
2011.

[6] J. Saketha Nath Rajesh Arja. Ocr for historical telugu documents. MTP Stage 2,
2012.

[7] Ray Smith. An overview of tesseract engine.

16
Appendix

The following section provides the results of the experimentation with dictionary consist-
ing of only limited historical words(300) tightly bound to the text in toy corpus and the
next column with dictionary consisting of only colloquial words and the next column of
colloquial words along with historical words unrelated to the toy corpus

Filename Colloquial words Historical Unrelated words

fwdpayasam/scan0002.jpg 63.15% 65.07% 51.63%
15% 17% 9%
fwdpayasam/scan0003.jpg 66.54% 72.83% 61.5%
20% 23% 15%
oct01/govu1/scan0006.jpg 76.21% 76.12% 65.45%
25% 26% 21%
oct01/govu1/scan0007.jpg 78.7% 78.82% 78.41%
24% 26% 21%
oct01/govu1/scan0008.jpg 78.12% 78.01% 75.8%
25% 28% 20%
oct01/govu5/scan0082.jpg 43.04% 43.04% 42.9%
16% 18% 13%
oct01/govu5/scan0083.jpg 55.48% 55.25% 61.25
15% 16% 18%

Plate Heat Exchanger: Pre-Commissioning Checklist
100% (1)
Plate Heat Exchanger: Pre-Commissioning Checklist
1 page
SAP S4 Hana Syllabus
No ratings yet
SAP S4 Hana Syllabus
3 pages
5 Principles of Presentation Design-18
No ratings yet
5 Principles of Presentation Design-18
27 pages
Practical Transformer Winding
No ratings yet
Practical Transformer Winding
34 pages
Offline Urdu Nastaliq OCR
100% (1)
Offline Urdu Nastaliq OCR
161 pages
Transformations SAT
No ratings yet
Transformations SAT
14 pages
Nova Southeastern Dissertation Guide
100% (2)
Nova Southeastern Dissertation Guide
4 pages
HUMIDIFICADOR Fisher Paykel MR850 2
No ratings yet
HUMIDIFICADOR Fisher Paykel MR850 2
57 pages
Interview Karen Barad: Intra Active Entanglements
100% (2)
Interview Karen Barad: Intra Active Entanglements
14 pages
Optical Character Recognition
No ratings yet
Optical Character Recognition
27 pages
Report For OCR Project
No ratings yet
Report For OCR Project
18 pages
Telugu Letters Dataset and Parallel Deep Convolutional Neural Network With A SGD Optimizer Model For TCR
No ratings yet
Telugu Letters Dataset and Parallel Deep Convolutional Neural Network With A SGD Optimizer Model For TCR
10 pages
GS EP EXP 207 09 Systems Units
No ratings yet
GS EP EXP 207 09 Systems Units
18 pages
JETIRFP06101
No ratings yet
JETIRFP06101
10 pages
Jnu Final 6303
No ratings yet
Jnu Final 6303
36 pages
RHCSA Rapid Track Course
No ratings yet
RHCSA Rapid Track Course
3 pages
The Impact of Artificial Intelligence On Software Development
No ratings yet
The Impact of Artificial Intelligence On Software Development
3 pages
Reportlab Userguide PDF
No ratings yet
Reportlab Userguide PDF
134 pages
PUMA Annual Report 2021
No ratings yet
PUMA Annual Report 2021
322 pages
A Guide To Flyback Transformers - Coilcraft
No ratings yet
A Guide To Flyback Transformers - Coilcraft
5 pages
Khushboo Plastics Project 2
No ratings yet
Khushboo Plastics Project 2
42 pages
Mini Project-04,52 00
No ratings yet
Mini Project-04,52 00
85 pages
RevModPhys 75 121
No ratings yet
RevModPhys 75 121
60 pages
Optical Character Recognition
No ratings yet
Optical Character Recognition
27 pages
500D High Pressure Syringe Pump Datasheet PDF
No ratings yet
500D High Pressure Syringe Pump Datasheet PDF
2 pages
Fi Pdflatex mk4 - Bezdeklarace
No ratings yet
Fi Pdflatex mk4 - Bezdeklarace
41 pages
Entreprene Urship: Welcome
No ratings yet
Entreprene Urship: Welcome
33 pages
Only One Mind PDF
No ratings yet
Only One Mind PDF
34 pages
Bangla Character Recognition For Android Devices: School of Engineering and Computer Science BRAC University
No ratings yet
Bangla Character Recognition For Android Devices: School of Engineering and Computer Science BRAC University
63 pages
Telugu Script Achanta Hastie 2015.2805047
No ratings yet
Telugu Script Achanta Hastie 2015.2805047
32 pages
OCR For Printed Telugu Documents
No ratings yet
OCR For Printed Telugu Documents
32 pages
Ocr Gtts
No ratings yet
Ocr Gtts
49 pages
RP 1 - Merged
No ratings yet
RP 1 - Merged
104 pages
3hac042305 041
No ratings yet
3hac042305 041
1 page
Multilingual Text Recognition System
No ratings yet
Multilingual Text Recognition System
21 pages
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
No ratings yet
DT0400002 en - Fe FRENIC Lift Asíncrono - Síncrono r0b
24 pages
Project of Matlab
No ratings yet
Project of Matlab
42 pages
Tesseract Osc On
No ratings yet
Tesseract Osc On
22 pages
Devanagari OCR Using A Recognition Driven Segmenta
No ratings yet
Devanagari OCR Using A Recognition Driven Segmenta
17 pages
Paper 8863
No ratings yet
Paper 8863
16 pages
A Terrain Parabolic Equation Model For Propagation in The Troposphere
No ratings yet
A Terrain Parabolic Equation Model For Propagation in The Troposphere
9 pages
Sudan Prajapati Aman Maharjan Prof. Dr. Shashidhar Ram Joshi Asst. Prof. Bikash Balami
No ratings yet
Sudan Prajapati Aman Maharjan Prof. Dr. Shashidhar Ram Joshi Asst. Prof. Bikash Balami
15 pages
Capital Budget
No ratings yet
Capital Budget
15 pages
An Overview of Tesseract OCR Engine
No ratings yet
An Overview of Tesseract OCR Engine
15 pages
Tesseract OSCON
No ratings yet
Tesseract OSCON
22 pages
Advent Review, and Sabbath Herald - August 7, 1883
No ratings yet
Advent Review, and Sabbath Herald - August 7, 1883
16 pages
Enabling Search Over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach
No ratings yet
Enabling Search Over Large Collections of Telugu Document Images - An Automatic Annotation Based Approach
12 pages
A Deep Neural Network Based Holistic Approach For Optical Character Recognition of Handwritten Documents
No ratings yet
A Deep Neural Network Based Holistic Approach For Optical Character Recognition of Handwritten Documents
9 pages
Aradhya-Multi-Lingual OCR
No ratings yet
Aradhya-Multi-Lingual OCR
11 pages
Optical Character Recognition
No ratings yet
Optical Character Recognition
69 pages
Study of Tesseract OCR
No ratings yet
Study of Tesseract OCR
12 pages
Ocr Intern Report PDF
No ratings yet
Ocr Intern Report PDF
14 pages
Array: Intermediate Level Questions
No ratings yet
Array: Intermediate Level Questions
3 pages
Feature Amhed Farouk R4
No ratings yet
Feature Amhed Farouk R4
11 pages
Optical Character Recognition OCR For Telugu Datab
No ratings yet
Optical Character Recognition OCR For Telugu Datab
6 pages
(IJCST-V8I5P9) :honey Patel
No ratings yet
(IJCST-V8I5P9) :honey Patel
6 pages
A Top-Down Character Segmentation Approach For Assamese and Telugu Handwritten Documents
No ratings yet
A Top-Down Character Segmentation Approach For Assamese and Telugu Handwritten Documents
13 pages
Optical Character Recognition of Handwri PDF
No ratings yet
Optical Character Recognition of Handwri PDF
6 pages
Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models
No ratings yet
Post-Processing Methodology For Word Level Telugu Character Recognition Systems Using Unicode Approximation Models
8 pages
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
No ratings yet
Safety Data Sheet: 1. Identification of The Substance/Mixture and The Supplier
8 pages
Tesseract I CD Ar 2007
No ratings yet
Tesseract I CD Ar 2007
5 pages
Enhancing The Recognition of Hand Written Telugu Characters Natural Language Processing and Machine Learning Approach
No ratings yet
Enhancing The Recognition of Hand Written Telugu Characters Natural Language Processing and Machine Learning Approach
6 pages
Handwritten Telugu Character Recognition Using Machine Learning
No ratings yet
Handwritten Telugu Character Recognition Using Machine Learning
6 pages
Praveen 2014 Towards
No ratings yet
Praveen 2014 Towards
6 pages
Tài liệu về OCR
No ratings yet
Tài liệu về OCR
4 pages
Stauffer 1957
No ratings yet
Stauffer 1957
7 pages
Assessing and Analyzing Application of Tesseract and Artificial Neural Network To Nepali Script OCR
No ratings yet
Assessing and Analyzing Application of Tesseract and Artificial Neural Network To Nepali Script OCR
7 pages
Review On Optical Character Recognition of Devanagari Script Using Neural Network
No ratings yet
Review On Optical Character Recognition of Devanagari Script Using Neural Network
6 pages
The Men Old
No ratings yet
The Men Old
8 pages
OCR of Kannada Characters Using Deep Learning
No ratings yet
OCR of Kannada Characters Using Deep Learning
4 pages
IJNRD2304119
No ratings yet
IJNRD2304119
5 pages
Segmentation of Devanagari Handwritten Characters: Ankita Srivastav Neha Sahu
No ratings yet
Segmentation of Devanagari Handwritten Characters: Ankita Srivastav Neha Sahu
4 pages
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
No ratings yet
Optical Character Recognition by Open Source OCR Tool Tesseract A Case Study
7 pages
Tesseract Ocr Engine
No ratings yet
Tesseract Ocr Engine
5 pages
Optical Character Recognition For Devanagari Script
No ratings yet
Optical Character Recognition For Devanagari Script
5 pages
Character Recognition of Devanagari Characters Using Artificial Neural Network
No ratings yet
Character Recognition of Devanagari Characters Using Artificial Neural Network
4 pages
Integrating Bangla Script Recognition Support in Tesseract OCR
No ratings yet
Integrating Bangla Script Recognition Support in Tesseract OCR
5 pages
Optical Character Recognition For Printed Tamil Text Using Unicode
No ratings yet
Optical Character Recognition For Printed Tamil Text Using Unicode
9 pages
Application Guide
No ratings yet
Application Guide
4 pages
Recognition of Devanagari Printed Text Using Neural Network and Genetic Algorithm
No ratings yet
Recognition of Devanagari Printed Text Using Neural Network and Genetic Algorithm
4 pages
11592i PDF
No ratings yet
11592i PDF
1 page
Accuracy Augmentation of Tamil OCR Using Algorithm Fusion
No ratings yet
Accuracy Augmentation of Tamil OCR Using Algorithm Fusion
6 pages
Handwritten Tamil Character Recognition Using SVM: Prof. Dr.J.Venkatesh, C. Sureshkumar
No ratings yet
Handwritten Tamil Character Recognition Using SVM: Prof. Dr.J.Venkatesh, C. Sureshkumar
5 pages
French Sociologist Pierre Bourdieu
No ratings yet
French Sociologist Pierre Bourdieu
3 pages
The Contemporary World
No ratings yet
The Contemporary World
2 pages
TextMate in Depth: Definitive Reference for Developers and Engineers
From Everand
TextMate in Depth: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Optical Character Recognition Algorithms and Systems: Definitive Reference for Developers and Engineers
From Everand
Optical Character Recognition Algorithms and Systems: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
PostScript Language Essentials: Definitive Reference for Developers and Engineers
From Everand
PostScript Language Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Building Software Interpreters: Definitive Reference for Developers and Engineers
From Everand
Building Software Interpreters: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Tesseract OCR Essentials: Definitive Reference for Developers and Engineers
From Everand
Tesseract OCR Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Schematron: A language for validating XML
From Everand
Schematron: A language for validating XML
Erik Siegel
No ratings yet
Using Vocals Determine Human Emotion
From Everand
Using Vocals Determine Human Emotion
Faiz ul haque Zeya
No ratings yet
Data Structures and Algorithms Implementation through C: Let’s Learn and Apply
From Everand
Data Structures and Algorithms Implementation through C: Let’s Learn and Apply
Brijesh Bakariya
No ratings yet
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
From Everand
JAVA PROGRAMMING FOR BEGINNERS: Master Java Fundamentals and Build Your Own Applications (2023 Crash Course)
Theo Houle
No ratings yet

Integration of Telugu Dictionary Into Tesseract OCR

Uploaded by

Integration of Telugu Dictionary Into Tesseract OCR

Uploaded by

Integration of Telugu dictionary into

MTech Stage II Project Report

Submitted in partial fulfillment of the

under the guidance of

Prof. J. Saketha Nath

6 Conclusion and Future Scope 15

5.1 Comparison of accuracies for historical documents . . . . . . . . . . . . . . 13

1.1 Alphabets in Telugu script . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

5.1 Top confused characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Optical Character Recognition(OCR) converts scanned images of text into machine-

Figure 1.2: Vowels and their associated Modifiers

Figure 1.3: Confusing characters

2.2 Adaptive Binarization

2.3 Proposed improvement

3.1 Chopping joined characters

3.2 Associating chopped characters

Figure 5.1: Top confused characters

Conclusion and Future Scope

[1] Web Refererence http://www.ildc.in/Telugu/htm/lin ocr spell.htm. Drishti telugu

[4] B. Gatos K. Ntirogiannis I. Pratikakis. Dibco 2009: document image binarization

[7] Ray Smith. An overview of tesseract engine.

Filename Colloquial words Historical Unrelated words

You might also like