Integration of Telugu Dictionary Into Tesseract OCR
Integration of Telugu Dictionary Into Tesseract OCR
Tesseract OCR
I would sincerely like to thank my guide, Prof. J.Saketh Nath for his motivating support
throughout the semester and the consistent directions that he has fed into my work. I
would like to thank Prof. Ganesh Ramakrishnan for valuable guidance and discussions.
3
Contents
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Stage1 Summary 4
2.1 Comparision . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Adaptive Binarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Proposed improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Word Recognition 8
3.1 Chopping joined characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.2 Associating chopped characters . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Experiments 10
5 Results 12
4
List of Tables
5
List of Figures
6
Abstract
Applications of OCR in digitizing printed and hand-written documents have found their
way into preserving historical documents. OCR on Telugu language is particularly diffi-
cult because of the complexities in the script. The existing OCR systems for Telugu are
TESSERACT and Drishti. An in-house OCR has been developed in an attempt to over-
come the difficulties in recognizing Telugu characters. The comparision of performance of
Tesseract and in-house system on a toy corpus is given. We implemented Myers algorithm
to find and compare the accuracies of Tesseract and in-house system. A detailed insight
into the Tesseract system and its novel approaches like line finding, page layout analysis,
feature extraction, adaptive classification and an overview of the in-house system is also
given. To improve the word accuracies of tesseract system, the dictionary has been popu-
lated with words, which are more likely to appear in the historical documents, along with
their frequencies. The words that are currently in use, colloquial words, have also been
added to the dictionary to make the OCR robust to the type of document given as input.
We have also observed the effect of sandhis on the word recognition module.
Chapter 1
Introduction
1
Figure 1.1: Alphabets in Telugu script
1.1 Motivation
The existing OCR systems for Telugu are TESSERACT[7] and Drishti [1] have constraints
on images like quality, resolution, font, input/output formats. Inspite of these constraints
2
Figure 1.4: Compound characters and complexities
the accuracy levels achieved by these systems are not satisfactory, the best of these being
Tesseract. Tesseract has only considered Pothana fonts making it vulnerable for font
independent system. Drishti showed better results only when the scans are of laser quality
prints. An in-house system has been developed that helps in overcoming these constraints
[5] and [6].
3
Chapter 2
Stage1 Summary
In Stage1, we have studied the implementation of Tesseract software in detail and con-
trasted it against the in-house system. Each OCR has primarily 3 stages : preprocessing,
Feature extraction and Classification. In the following sections we give a brief insight into
the differences of the two OCR systems in discussion in each stage.
2.1 Comparision
In Tesseract, the image is stored in a raw format in the data structure IMAGE. The
leptonica library uses pix data structure to refer to this raw image. Instead of messing
with the raw data, all the image processing operations are done on this pix data structure.
Image is stored as data structure Pix (Leptonica style). The image is scanned row wise
and the rgb info is stored in a color map. The width, depth, height, the raster data and
the number of pix pointers referring the particular pix object are stored.
In in-house system, Java imaging API stores the input as BufferedImage, which is an
Image with an accessible data buffer. It is therefore more efficient to work directly with
BufferedImage. A BufferedImage has a ColorModel and a Raster of image data. The
ColorModel provides a color interpretation of the image’s pixel data.The Raster performs
the following functions:
-Represents the rectangular coordinates of the image
-Maintains image data in memory
-Provides a mechanism for creating multiple subimages from a single image data buffer
4
-Provides methods for accessing specific pixels within the image
2.1.1 Pre-processing
In pre-preocessing, the first step is image binarization. The image is binarized so that
the text is in black representing foreground and the rest is in white representing the
background. Tesseract used a global binarization technique called Otsus thresholding
method whereas the inhouse system used MaxVariance threshold (Java Imaging API) for
the global binarization.
The next step is character and word segmentation, in which connected components are
found through 8-way flood fill algorithm. In Tesseract, auto page segmentation is done
initially to mask the photos in the image and also page is divided into multiple blocks
if there are multiple columns. For each block, outlines are extracted from the connected
components through edge detection, which are then gathered and nested into blobs which
give the average character spacing. The words are then split through character spacing.
Tesseract also does skew detection and correction by tracing along the bounding boxes
of the blobs. The page layout analysis provides text regions of roughly uniform text size.
Drop caps are removed by simple percentile filter. Small blobs like punctuation, noise are
removed by taking threshold a fraction of median height. The characteristics of the font
are determined by x-height, ascender height, fixed pitch, baseline, median line, descender
height which are calculated from page layout analysis. The filtered blobs are sorted by
x-coordinates so that they are uniquely associated with text line. The slope is tracked
across the page using the baseline. Once the filtered blobs are assigned to their respective
text lines, the filtered out blobs are fitted back into appropriate lines.
The in-house system uses Hidden Markov Model to model the characteristic of each
telugu line i.e., white pixels, followed by black pixels with low density, black pixels with
high density and black pixels with low density. The pattern is fed to the 4-state HMM
with density of black pixels as feature. SVM Torch API has been used for HMM im-
plementation. In vertical segmentation, since each word has significant gap between its
neighbours when compared to gaps within word, each segmented line is processed for
significant white pixel density and thus words are segmented[3].
5
2.1.2 Feature Extraction
In training process, a set of images consisting of characters are gathered. For frequent
characters, a minimum 30 samples per each character, for moderate characters 20-30
samples and for less frequent 5-10 samples are to be provided for training. Manually for
each character, a bounding box is to be created coordinates and the corresponding ASCII
code. The vowel modifiers can be trained seperately too with the bounding box defined
only for those. In tesseract, the features extracted in training data are the segments of
polygonal approximation. For recognition i.e., for unknown characters, features of a small,
fixed length are extracted from outline and are matched many to one against the clustered
prototype features of the training data. The features for prototypes are 4-dimensional (x,y
position, angle, length) with 10-20 features in a prototype configuration. The features
for unknown are 3-dimensional (x, y position, angle) and for each character there are
approximately 50-100 features. The normalized features for the unknown are computed
by tracing around the outline of the blob unitwise. the x,y position and the angle between
the tangent line and a vector eastward from the centre of the blob are saved as feature.
In the in-house system, wavelet features have been extracted by dowsampling the
character image to an approximation image and three images: vertical, horizontal and
diagonal. By downsampling, the directional features of the image are captured which is
Battle-Lemarie filter is used as wavelet basis function to capture the circular property of
Telugu characters. Other features were also experimented on i.e., Gabor features, skeleton
features, circular zonal features, out of which wavelet features were found to give better
results.
2.1.3 Classification
In Tesseract, classfication proceeds as a two step process: in the first step: each feature
in the unknown matched against 3-D look-up table and fetches a bit vector of all the
classes it might match. The bit vectors are summed over all features. The classes with
the highest counts become shortlisted for the next step of classification. This step is
called class pruning. In the second step: each feature of unknown looks up for bit vector
of prototypes of given class it matches. Each prototype character class is logical sum of
6
product expression where each term is configuration. So the distance is evaluated and the
total similarity evidence of each feature in each configuration as well as each prototype
is recorded. The best combined distance which is calculated from summed feature and
prototype evidence is the best over all stored configuration of the class. The data structure
used is a k-d tree (k- dimensional tree) for each character. Nearest negihbour classification
is employed and the best choice is returned during the traversal of the tree. Each character
classification is associated with two numbers:
- confidence : prototype - normalized distance
- rating : confidence * (total outline length of the character)
In in-house system, the feature vectors generated are passed onto the classification
stage. The unicode of the letters thus classified are generated individually for the top
, middle and bottom parts of each character. The unicode of all these are combined to
generate the compound character.
7
Chapter 3
Word Recognition
Linguistic module (Permuter) contains language files freq dawg, word dawg, user words,
unicharset, DangAmbigs. DAWG (Directed Acyclic Word Graph) file is a datastructure
that facilitates fast search for words. The dictionary can be stored in the word dawg
file. This cannot be changed by the user. The user can modify user words dawg file and
personalize the word recognition module. The start node is the first character for the
current word to be searched. These files are not defined for Telugu language resulting in
much lower accuracies than that of English, chinese. These files ensures the correct word
inspite of few characters being wrongly classified in the word.
Permuter chooses best word string from each of the following categories:top frequency
word, top dictionary word, top numeric word, top upper case, top lower case, top classifier
choice The final decision for a given segmentation is the word with the lowest distance
rating. The classifier stores the top best choices for each character during the character
recognition process.
Words from different segmentations may have different numbers of characters in them.
It is hard to compare these words directly, even where a classifier claims to be producing
probabilities, which Tesseract does not. This problem is solved in Tesseract by generating
two numbers for each character classification. The first, called the confidence, is minus the
normalized distance from the prototype. This enables it to be a confidence in the sense
that greater numbers are better, but still a distance, as, the farther from zero, the greater
the distance. The second output, called the rating, multiplies the normalized distance
from the prototype by the total outline length in the unknown character. Ratings for
8
characters within a word can be summed meaningfully, since the total outline length for
all characters within a word is always the same. The word confidence is calculated as:
WERD CHOICE* choice = word->best choice
int w conf = 100 + 5 * choice->certainty()
9
Chapter 4
Experiments
We have added word data file and frequent word data files for Telugu, as they were not not
officially available in the current version of Tesseract. The word data file is a dictionary
that enlists the words in alphabetical order. The frequent word list contains the most
frequently occuring words in Telugu.
We have observed that the addition of word data files effected the character accuracies
because of the chopping and association being done in the word recognition module as
mentioned in the earlier section. Wrongly classfied characters are replaced by the correct
characters as per the words in the dictionary if the correct character to be replaced is
available in the top best choices of the character. If it is not available, then the character
is replaced by a character that is most likely to be the character to be appearing in
the dictionary during the traversal from the letter preceding it. Some characters which
were correctly recognized were now wrongly classified because if the word confidence is not
satisfactory, the chopping and then the A* graph during association look for the next best
character replacement that also can be found in the dictionary file during the traversal in
the dawg. We have experimented with dictionaries that contain only colloquial words ,
only historical words, a unified dictionary.
We have also populated freq word list with the most frequent words. We have exper-
imented on the number of words the freq list should have without compromising on the
accuracies. The way it works is that words in the freq-words list are weighted higher when
Tesseract looks for likely matches. So if there are too many, it is likely that we get less
common variants switched to the most common ones. We have also experimented with
10
dictionary consisting of only limited historical words(300) tightly bound to the text in toy
corpus and then with dictionary consisting of only colloquial words(3M) and then with
historical words unrelated to the toy corpus. We have observed that many characters are
being confused with others. We have also observed that since the vowel modifiers were
included in the training, the output resulted many characters broken into consonant and
vowel modifier. Though the characters were recognised correct, the output was mostly
filled with vowel modifiers seperated from the consonants.
11
Chapter 5
Results
Implemented Myer’s algorithm[2] to find the differences between the actual text of the
image and the output text produced by the OCR system. The algorithm was implemented
in python. The output is returned as a list where all the characters that are inserted are
preceded by a ’+1’ and that are deleted by a ’-1’ and those characters which remain same
are preceded by a ’0’. The accuracy is calculated as:
r
A=
c
where r = the number of characters that are correctly recognized c = number of
characters in the actual text
The character and word level accuracies for Tesseract system without and with our
integrated binarization module and Telugu specific dictionary files are tabulated in ta-
ble 5.1. We can see a significant increase in the accuracy levels. The addition of dictio-
nary files also effected the character accuracies, because of the chopping and association
of characters to get the most probable word that could be in the dictionary.
We have experimented on the recent documents with the changed dictionary files. The
results are tabulated in table 5.2.
We have also observed that many characters are being confused with others. The top
5 characters which are observed to be more confused are tabulated in figure 5.1
For the tables below the first row indicates character accuracies and the next row
indicates word accuracies.
12
Table 5.1: Comparison of accuracies for historical documents
Filename Tesseract OCR with binarization and dictionary files
fwdpayasam/scan0002.jpg 58.384% 65.75%
15% 20%
fwdpayasam/scan0003.jpg 66.54% 72.83%
20% 23%
oct01/govu1/scan0006.jpg 68.95% 78.92%
34% 38%
oct01/govu1/scan0007.jpg 77.03% 79.57%
30% 33%
oct01/govu1/scan0008.jpg 77.38% 79.54%
32% 34%
oct01/ks4/scan0102.jpg 71.36% 74.07%
10% 23%
oct01/govu5/scan0082.jpg 38.7% 78.12%
19% 29%
oct01/govu5/scan0083.jpg 71.28% 77.91%
22% 28%
13
Table 5.2: Comparison of accuracies for the modern documents
Filename Tesseract OCR with binarization and dictionary files
Doc01 37.29% 74.63%
16% 24%
Doc02 17.7% 35.83%
21% 26%
Doc03 55.4% 56.19%
4% 9%
Doc04 56.34% 57.62%
3% 8%
Doc05 62.4% 64.06%
11% 20%
Doc06 65.26% 66.05%
11% 15%
Doc07 67.48% 68.19%
11% 18%
14
Chapter 6
We have observed that the dictionary can never be completely populated with all the
words because of the complex morphology of the language. The union of two words
results in a morphology where the last character of first word and the first character of
second word are combined to get another character and the rest of the characters remain
the same. Sandhis and samasas result in countless combination of words which cannot be
included in the dictionary unless we use a sandhi splitter making the recognition of words
much simpler. This phenomena results in an infinite words list which makes the purpose
of dictionary useless. This problem can be overcome by using a sandhi splitter, which will
break a word into its root words which can be easily found in the dictionary of words.
The quality of the image has a lot of impact in the later stages in the process of char-
acter recognition. We have observed that during the binarization process, the characters
are broken when theres a thin line and faded background and if the characters which are
at the back cover are visible to the front cover and the background is light, then these
characters are also included on the front page which leads in garbled image. This can be
overcome by focusing more on the digital processing of the image.
Training the character variants is highly difficult due to the huge number of combina-
tion of characters possible and the nuances add to the complexity. One possible solution
is the brute force method of training all variants of characters with large number of sam-
ples per variants instead of training the vowel modifiers alone. The other approach is to
instead the output can be processed for finding the disconnected character variants and
joining them.
15
Bibliography
[2] EUGENE W. MYERS. An o(nd) difference algorithm and its variations. Algorithmica,
pages 251–266, 1986.
[3] Srikanta Pal Nallapareddy Priyanka and Ranju Manda. Article:line and word segmen-
tation approach for printed documents. IJCA,Special Issue on RTIPPR, pages 30–36,
2010.
[5] J. Saketha Nath Rajesh Arja. Ocr for historical telugu documents. MTP Stage 1,
2011.
[6] J. Saketha Nath Rajesh Arja. Ocr for historical telugu documents. MTP Stage 2,
2012.
16
Appendix
The following section provides the results of the experimentation with dictionary consist-
ing of only limited historical words(300) tightly bound to the text in toy corpus and the
next column with dictionary consisting of only colloquial words and the next column of
colloquial words along with historical words unrelated to the toy corpus
17