Tesseract I CD Ar 2007
Tesseract I CD Ar 2007
2. Architecture
Since HP had independently-developed page layout analysis technology that was used in products, (and therefore not released for open-source) Tesseract never needed its own page layout analysis. Tesseract therefore assumes that its input is a binary image with optional polygonal text regions defined. Processing follows a traditional step-by-step pipeline, but some of the stages were unusual in their day, and possibly remain so even now. The first step is a connected component analysis in which outlines of the components are stored. This was a computationally expensive design decision at the time, but had a significant advantage: by inspection of the nesting of outlines, and the number of child and grandchild outlines, it is simple to detect inverse text and recognize it as easily as black-on-white text. Tesseract was probably the first OCR engine able to handle white-on-black text so trivially. At this stage, outlines are gathered together, purely by nesting, into Blobs. Blobs are organized into text lines, and the lines and regions are analyzed for fixed pitch or proportional text. Text lines are broken into words differently according to the kind of character spacing. Fixed pitch text is chopped immediately by character cells. Proportional text is broken into words using definite spaces and fuzzy spaces. Recognition then proceeds as a two-pass process. In the first pass, an attempt is made to recognize each word in turn. Each word that is satisfactory is passed to an adaptive classifier as training data. The adaptive classifier then gets a chance to more accurately recognize text lower down the page. Since the adaptive classifier may have learned something useful too late to make a contribution near the top of the page, a second pass is run over the page, in which words that were not recognized well enough are recognized again. A final phase resolves fuzzy spaces, and checks alternative hypotheses for the x-height to locate smallcap text.
629
Fig.1 shows an example of a line of text with a fitted baseline, descender line, meanline and ascender line. All these lines are parallel (the y separation is a constant over the entire length) and slightly curved. The ascender line is cyan (prints as light gray) and the black line above it is actually straight. Close inspection shows that the cyan/gray line is curved relative to the straight black line above it.
4. Word Recognition
Part of the recognition process for any character recognition engine is to identify how a word should be segmented into characters. The initial segmentation output from line finding is classified first. The rest of the word recognition step applies only to non-fixedpitch text.
630
When the A* segmentation search was first implemented in about 1989, Tesseracts accuracy on broken characters was well ahead of the commercial engines of the day. Fig. 5 is a typical example. An essential part of that success was the character classifier that could easily recognize broken characters.
Fig. 4. Candidate chop points and chop. Fig. 4 shows a set of candidate chop points with arrows and the selected chop as a line across the outline where the r touches the m. Chops are executed in priority order. Any chop that fails to improve the confidence of the result is undone, but not completely discarded so that the chop can be re-used later by the associator if needed.
631
The features extracted from the unknown are thus 3dimensional, (x, y position, angle), with typically 50100 features in a character, and the prototype features are 4-dimensional (x, y, position, angle, length), with typically 10-20 features in a prototype configuration.
5.2. Classification
Classification proceeds as a two-step process. In the first step, a class pruner creates a shortlist of character classes that the unknown might match. Each feature fetches, from a coarsely quantized 3-dimensional lookup table, a bit-vector of classes that it might match, and the bit-vectors are summed over all the features. The classes with the highest counts (after correcting for expected number of features) become the short-list for the next step. Each feature of the unknown looks up a bit vector of prototypes of the given class that it might match, and then the actual similarity between them is computed. Each prototype character class is represented by a logical sum-of-product expression with each term called a configuration, so the distance calculation process keeps a record of the total similarity evidence of each feature in each configuration, as well as of each prototype. The best combined distance, which is calculated from the summed feature and prototype evidences, is the best over all the stored configurations of the class.
word. The final decision for a given segmentation is simply the word with the lowest total distance rating, where each of the above categories is multiplied by a different constant. Words from different segmentations may have different numbers of characters in them. It is hard to compare these words directly, even where a classifier claims to be producing probabilities, which Tesseract does not. This problem is solved in Tesseract by generating two numbers for each character classification. The first, called the confidence, is minus the normalized distance from the prototype. This enables it to be a confidence in the sense that greater numbers are better, but still a distance, as, the farther from zero, the greater the distance. The second output, called the rating, multiplies the normalized distance from the prototype by the total outline length in the unknown character. Ratings for characters within a word can be summed meaningfully, since the total outline length for all characters within a word is always the same.
7. Adaptive Classifier
It has been suggested [11] and demonstrated [12] that OCR engines can benefit from the use of an adaptive classifier. Since the static classifier has to be good at generalizing to any kind of font, its ability to discriminate between different characters or between characters and non-characters is weakened. A more font-sensitive adaptive classifier that is trained by the output of the static classifier is therefore commonly [13] used to obtain greater discrimination within each document, where the number of fonts is limited. Tesseract does not employ a template classifier, but uses the same features and classifier as the static classifier. The only significant difference between the static classifier and the adaptive classifier, apart from the training data, is that the adaptive classifier uses isotropic baseline/x-height normalization, whereas the static classifier normalizes characters by the centroid (first moments) for position and second moments for anisotropic size normalization. The baseline/x-height normalization makes it easier to distinguish upper and lower case characters as well as improving immunity to noise specks. The main benefit of character moment normalization is removal of font aspect ratio and some degree of font stroke width. It also makes recognition of sub and superscripts simpler, but requires an additional classifier feature to distinguish some upper and lower case characters. Fig. 7 shows an example of 3 letters in baseline/x-height normalized form and moment normalized form.
6. Linguistic Analysis
Tesseract contains relatively little linguistic analysis. Whenever the word recognition module is considering a new segmentation, the linguistic module (mis-named the permuter) chooses the best available word string in each of the following categories: Top frequent word, Top dictionary word, Top numeric word, Top UPPER case word, Top lower case word (with optional initial upper), Top classifier choice
632
source, the ISRI group at UNLV for sharing their tools and data, as well as Luc Vincent, Igor Krivokon, DarShyang Lee, and Thomas Kielbus for their comments on the content of this paper. Fig. 7. Baseline and moment normalized letters.
11. References
[1] S.V. Rice, F.R. Jenkins, T.A. Nartker, The Fourth Annual Test of OCR Accuracy, Technical Report 95-03, Information Science Research Institute, University of Nevada, Las Vegas, July 1995. [2] R.W. Smith, The Extraction and Recognition of Text from Multimedia Document Images, PhD Thesis, University of Bristol, November 1987. [3] R. Smith, A Simple and Efficient Skew Detection Algorithm via Text Row Accumulation, Proc. of the 3rd Int. Conf. on Document Analysis and Recognition (Vol. 2), IEEE 1995, pp. 1145-1148. [4] P.J. Rousseeuw, A.M. Leroy, Robust Regression and Outlier Detection, Wiley-IEEE, 2003. [5] S.V. Rice, G. Nagy, T.A. Nartker, Optical Character Recognition: An Illustrated Guide to the Frontier, Kluwer Academic Publishers, USA 1999, pp. 57-60. [6] P.J. Schneider, An Algorithm for Automatically Fitting Digitized Curves, in A.S. Glassner, Graphics Gems I, Morgan Kaufmann, 1990, pp. 612-626. [7] R.J. Shillman, Character Recognition Based on Phenomenological Attributes: Theory and Methods, PhD. Thesis, Massachusetts Institute of Technology. 1974. [8] B.A. Blesser, T.T. Kuklinski, R.J. Shillman, Empirical Tests for Feature Selection Based on a Pscychological Theory of Character Recognition, Pattern Recognition 8(2), Elsevier, New York, 1976. [9] M. Bokser, Omnidocument Technologies, Proc. IEEE 80(7), IEEE, USA, Jul 1992, pp. 1066-1078. [10] H.S. Baird, R. Fossey, A 100-Font Classifier, Proc. of the 1st Int. Conf. on Document Analysis and Recognition, IEEE, 1991, pp 332-340. [11] G. Nagy, At the frontiers of OCR, Proc. IEEE 80(7), IEEE, USA, Jul 1992, pp 1093-1100. [12] G. Nagy, Y. Xu, Automatic Prototype Extraction for Adaptive OCR, Proc. of the 4th Int. Conf. on Document Analysis and Recognition, IEEE, Aug 1997, pp 278-282. [13] I. Marosi, Industrial OCR approaches: architecture, algorithms and adaptation techniques, Document Recognition and Retrieval XIV, SPIE Jan 2007, 6500-01.
8. Results
Tesseract was included in the 4th UNLV annual test [1] of OCR accuracy, as HP Labs OCR, but the code has changed a lot since then, including conversion to Unicode and retraining. Table 1 compares results from a recent version of Tesseract (shown as 2.0) with the original 1995 results (shown as HP). All four 300 DPI binary test sets that were used in the 1995 test are shown, along with the number of errors (Errs), the percent error rate (%Err) and the percent change relative to the 1995 results (%Chg) for both character errors and non-stopword errors. [1] More up-to-date results are at http://code.google.com/p/tesseract-ocr. Table 1. Results of Current and old Tesseract.
Ver HP 2.0 HP 2.0 HP 2.0 HP 2.0 2.0 Character Word Set Errs %Err %Chg Errs %Err %Chg bus 5959 1.86 1293 4.27 bus 6449 2.02 8.22 1295 4.28 0.15 doe 36349 2.48 7042 5.13 doe 29921 2.04 -17.68 6791 4.95 -3.56 mag 15043 2.26 3379 5.01 mag 14814 2.22 -1.52 3133 4.64 -7.28 news 6432 1.31 1502 3.06 news 7935 1.61 23.36 1284 2.62 -14.51 total 59119 -7.31 12503 -5.39
10. Acknowledgements
The author would like to thank John Burns and Tom Nartker for their efforts in making Tesseract open
633