A Comparison of Some Morphological Filters For Improving OCR Performance
A Comparison of Some Morphological Filters For Improving OCR Performance
A Comparison of Some Morphological Filters For Improving OCR Performance
1 Introduction
Mathematical Morphology offers powerful tools that are widely recognized for
their utilities for applicative purposes, in particular for filtering out many image
defects. The old opening and closing based on structuring elements are still
widely used and are described in most image analysis textbooks, although their
combination at various scales, namely the granulometries [16], are not as well
known. Their main implementation is on the usual 4, 6 or 8 connected grids.
However, there exist several recent variations of these operators, depending on
the space on which they are defined: we are especially interested in this paper in
graphs, first by considering only the vertices (corresponding to the pixels) [18,7]
and then, by considering edges (between pixels) and vertices [4,12]. The incentive
for using more evolved space representations is to enhance the performance by
getting subpixelic accuracy. Such an idea has been pushed a step further by
considering simplicial complexes [6] (see [5] for a different point of view), a
generalization of graphs. Although these new frameworks look promising from
a theoretical point of view, to the best of our knowledge, to date, a systematic
comparison of these old and novel operators for a dedicated application has not
yet been performed. The goal of this paper is to fill that gap, focusing on Optical
Character Recognition, or OCR. As it is well known that connected filters, and
especially area opening and closing [19] are well adapted to document image
analysis, we include them in the present study.
The filtering step is generally just one step in the many ones composing the
full application chain. Linear filters can be evaluated by their response to some
model of noise. It is more difficult to apply the same evaluation process to the
non-linear morphological filters. This is why we choose to assess the performance
of an OCR against some model of noise/degradation dedicated to documents. In-
deed, OCR is the process of converting a scanned document to machine-encoded
text [15]. Such an operation is generally impacted by the quality of the origi-
nal document and by the introduction of artefacts during the scanning process.
Our performance evaluation is hence a measure of the ability of the aforemen-
tioned morphological operators to improve OCR performance when used as a
preprocessing step on degraded binary document images.
The paper is organized as follows. Section 2 presents the document degra-
dation model used to alter the documents. Section 3 presents the compared
morphological filters. Section 4 describes the thorough test protocols of this
experiment. Results of the experiments are detailed in Section 5, before, in Sec-
tion 6, discussing them and concluding the paper.
The quality of the document images to be processed is a key point in any doc-
ument recognition application. Indeed, the accuracy of the results often depend
on this quality, and drastic failure can be expected if the quality is too low. For
this reason, researchers in the domain have developed models of document image
degradation. A state of the art of this research can be found in [3]. Document
degradation models are designed to simulate local distortions that are introduced
during the processes of document scanning, printing and photocopying. That in-
cludes global (perspective and non-linear illumination) and local (speckle, blur,
jitter and threshold) effects. Applications of these degradation models are nu-
merous, see [8] for a survey. In this paper, we are using these models to carry
out a systematic study of the performance of some morphological filters.
There exist two types of degradation models. As their name implies, physics-
based ones [1,2] model the physics of the printing and imaging apparatus, with
as much detail as possible. While they lead to accurate models, they might be
unnecessary complicated for our purpose. On the other hand, statistics-based
models [10] are much simpler, both from an implementation and usage point of
views. We thus choose to use this class of models. Relying on statistics of image
distributions, they propose a model of real document imaging defects. In the
context of this experiment, some of these models have the ability to generate
realistic degradations that are appropriate for an OCR performance evaluation.
Besides, increasing levels of such degradations can also be produced by adjusting
the models parameters, thus allowing for a proper level of comparison.
The binary document degradation model used in this experiment has been
presented by Kanungo et al. in [9]. This local model, which only applies to
binary images, accounts for two types of document degradation, which are pixel
inversion and blurring.
Pixel inversion simulates image noise usually generated by light intensity
variations, sensor sensitivity and image thresholding, while blurring simulates
the point-spread function of the scanner optical system. The pixel inversion
probability of a background (resp. foreground) pixel is modelled following an ex-
ponential function of its distance from the nearest foreground (resp. background)
pixel as:
2
p(0|1, d, 0 , ) = 0 ed + , (1)
d2
p(1|0, d, 0 , ) = 0 e + , (2)
= (, 0 , , 0 , , k) (3)
3 Morphological filtering
Morphological filters are commonly used to restore or improve the image qual-
ity of digitally converted documents. Thus, they can increase OCR performance
when used as a preprocessing step. This section roughly presents the four mor-
phological filters that are assessed in our experiment on OCR preprocessing. Due
to space restriction, the precise definitions of the operators is not made available
in this article but can be found in [11].
Morphological area opening a and closing a filters for binary and greyscale
images have been presented by Vincent in [19]. These operators respectively
remove light and dark regions of the image whose area is superior to a parameter
N. The 4-adjacency relation was used for these filters in this experiment.
4 Test protocols
The Tesseract OCR engine, presented in [17], has been used to perform optical
character recognition in this experiment. This powerful system has been evalu-
ated by UNLV-ISRI in 1995 (refer to [13]) along with other commercial OCR
engines and proved its top-tier performance at the time. Since then, it has been
improved extensively by Google. In order to get OCR performance results from
this engine on preprocessed documents, the test data and software tools from
UNLV-ISRI presented in [14] have been used. More precisely, we have used on a
random selection of 100 instances of 300 DPI binary document images. The test
procedure is basically the iteration of degradation, filtering, OCR analysis and
MSE measure of each document, repeated for each pair (d, ) of degradation and
filtering parameters. Note, however, that the used binary documents are scanned
versions of real documents, meaning that they are imperfect and consequently
contain noise. Degradation performed on these documents simply allows for a
better comparison of the filters efficiency in critical conditions.
Filtering The filters used in this experiment are the ASF filters defined in
vertices, graphs and simplicial complexes, as well as a combination of area closing
and area opening filters. The tests have been conducted on both regular and
inverse versions of each document. Furthermore, ASF filters on graph and vertex
Filtering parameters Best OCR results
ID Filter Adj. Inv. Usc Thr1 Dsc Thr2 ? Char acc (%) Word acc (%)
F00 None 0.02 0.01
F01 Complex 6 5 59.36 61.60
F02 Complex 6 3 13.51 32.13
F03 Graph 6 2 14.81 33.77
F04 Graph 6 3 50.68 56.89
F05 Graph 6 3/2 2/3 2 13.82 30.23
F06 Graph 6 3/2 2/3 5 48.54 47.53
F07 Graph 4 2 49.48 46.79
F08 Graph 4 2 59.02 57.71
F09 Graph 4 3/2 2/3 2 38.06 34.84
F10 Graph 4 3/2 2/3 5 45.54 40.66
F11 Vertices 6 1 23.52 22.86
F12 Vertices 6 2 25.71 17.57
F13 Vertices 6 3/1 1/3 2 63.78 66.18
F14 Vertices 6 3/1 1/3 3 65.51 69.07
F15 Vertices 4 1 33.98 33.43
F16 Vertices 4 2 28.69 24.24
F17 Vertices 4 3/1 1/3 2 62.14 60.54
F18 Vertices 4 3/1 1/3 3 64.93 66.02
F19 area 4 6 47.10 44.82
F20 area 4 6 45.66 44.09
spaces have also been evaluated with document resolution scaling of respectively
3/1 and 3/2 (Usc), in order to preserve the same number of iterations between
each filter. For instance, the results produced by the ASF3 on the vertex space
whose resolution was upscaled by 3 are comparable (with respect to the size
of the removed noise) to the results of ASF3/3 where the simplicial complex
space is build from the image at the original resolution. Moreover, in both cases,
the filters require the same number of iterations (i.e., each one of them needs
three opening/closing iterations) for producing the result. In the case of binary
document filtering, the corresponding upscaled documents were then binarized
with a threshold value of 128 (Thr1), downscaled to their original size after
filtering (Dsc) and binarized again with a threshold value of 128 (Thr2) in order
to preserve the characters size for OCR processing, since the OCR engine that
is used only accepts 300 DPI resolution. Filtering levels were specified for each
morphological filter with parameter N. Detailed settings are described in
table 1. One can note that binarization after upscaling (Thr1) is not performed
in the case of vertex filtering. This is simply due to the fact that these documents
are already in binary form after an exact upscaling of 3/1.
OCR analysis OCR analysis has been performed by the Tesseract OCR engine
in its latest version (3.02). Character and word accuracy obtained from OCR
processing of each document, as well as 95% confidence intervals of the obtained
accuracy for each set of documents processed with every pair (d, ) of degrada-
tion and filtering levels were then computed with the accuracy, wordacc, accci
and wordaccci tools provided in [14].
MSE measure Mean squared error has been measured for each processed image
I of dimensions w h with respect to its unprocessed counterpart O considered
as the ground truth:
1 h1
X w1
X 2
M SE = [I(i, j) O(i, j)] (5)
w h i=0 j=0
Some observations can be stated about the first test protocol. One can note that
MSE measures performed with binary documents that already contain noise as
ground truths cannot be considered as a proper evaluation of the filters denois-
ing ability. In addition, OCR analysis and MSE measures are also affected by
document scaling, which produce a slight smoothing effect that can impact the
results in this situation. This second test protocol has been performed to address
these two problems. As the characters size is a crucial factor of OCR analysis,
this second test protocol is only focused on MSE performance and has thus been
performed on a noise-free binary document that was not downscaled at all. The
test procedure is the iteration of degradation, filtering and MSE measure of each
document, repeated for each pair (d, ) of degradation and filtering parameters.
5 Results
In this section, we present the results of the first test protocol in the most critical
tested conditions (d = 4), to better compare the efficiency of each filtering
setting. In figure 3, only the best performing setting is shown for each filter
among regular and inverted document filtering and Table 1 presents the results
obtained by each method for the parameter ? that maximizes the quality of its
results.
As can be observed in figure 1, complex filtering on non inverted documents
produces better accuracy results than any other filter at original resolution, with
graph filtering closely behind.
One can note that 4-connected and 6-connected vertex filtering at scaled
resolution on inverted documents outperforms complex filtering. However, it is at
the expense of higher computational time and memory. Indeed, filtering at triple
resolution requires to handle a number of pixels multiplied by 9 compared to the
original resolution, whereas the size of a simplicial complex space is roughly 6
times the number of pixels. With our implementation, filtering in vertex spaces
at triple resolution takes twice the time for filtering in the complex space.
Binary - Original resolution - Accuracy - Degradation level 4 Binary - Original resolution - Mean Squared Error - Degradation level 4
80 4000
70 3500
F 01
60 F 01 3000
F 04
F 04 F 01
Accuracy (%)
50 2500
F 08 F 04
F 08 F 08
MSE
40 2000
F 12 F 12
F 12 F 15
30 F 15 1500 F 19
F 15
20 F 19 1000
F 19
10 500
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Filter level Filter level
Binary - Scaled resolution - Accuracy - Degradation level 4 Binary - Scaled resolution - Mean Squared Error - Degradation level 4
80 4000
70 3500
F 01
60 F 01 3000
F 06
F 06 F 01
Accuracy (%)
50 2500
F 10 F 06
F 10 F 10
MSE
40 2000
F 14 F 14
F 14 F 18
30 F 18 1500 F 19
F 18
20 F 19 1000
F 19
10 500
0 0
0 1 2 3 4 5 6 0 1 2 3 4 5 6
Filter level Filter level
Fig. 1: OCR accuracy and MSE measured in the first test after filtering of 100
binary documents at original and scaled resolution. Dashed lines represent word
accuracy while solid lines represent character accuracy.
Finally, MSE results of this first test protocol clearly show that complex
filtering, graph filtering and area filtering are very close at original resolution.
On the other hand, vertex filtering performs best at scaled resolution while graph
filtering at scaled resolution is significantly outperformed in this scenario.
A summary of OCR results obtained in this experiment is presented in Ta-
ble 1 (along with the filtering parameters).
In this section we present the results of the second test protocol in the most
critical tested conditions, to better compare the efficiency of each filtering setting.
As shown in figure 2, where only the best performing setting is shown for each
filter among regular and inverted document filtering, complex filtering of the
binary image shown in figure 4b produces better MSE results than any other
filter tested in these conditions.
Note also the good performance of graph filtering at original and scaled
resolution in second test protocol, a result contrasting with the first test protocol,
where graph filtering at scaled resolution was clearly outperformed.
Additionally, one can notice that 6-connected graph filtering performs better
than 4-connected graph filtering at scaled resolution and that vertex filtering is
outperformed at original resolution but close at scaled resolution.
Binary - Original resolution - Mean Squared Error - Degradation level 4 Binary - Scaled resolution - Mean Squared Error - Degradation level 4
1200 700 01
F
1100
650
1000
F 01 600
900
F 04 F 06
F 08 F 10
MSE
MSE
800 550
F 12 F 14
F 15 F 17
700 F 19 F 19
500
600
450
500
400 400
0 2 4 6 8 10 0 2 4 6 8 10
Filter level Filter level
Fig. 2: MSE measured in second test after filtering of the binary image shown in
Fig. 4b.
Fig. 3: First test protocol sample on binary documents. Original and degraded
images with binary degradation model, along with best filtering results obtained
on image 3b, under the form [ID : f ].
(a) Crop (b) d = 4 (c) F01 : 8 : 453 (d) F04 : 4 : 560
(e) F06 : 6 : 475 (f) F08 : 4 : 551 (g) F10 : 6 : 511 (h) F12 : 3 : 952
(i) F14 : 8 : 489 (j) F15 : 1 : 763 (k) F17 : 5 : 530 (l) F19 : 7 : 576
Fig. 4: Second test protocol sample on a binary document. Original and degraded
images with binary degradation model, along with best filtering results obtained
on image 4b (d = 4), under the form [ID : f : MSE].
References
[1] Baird, H.S.: Document image defect models. In: Structured Document Im-
age Analysis, pp. 546556. Springer (1992)
[2] Baird, H.S.: Calibration of document image defect models. In: Annual
Symp. on Doc. Anal. and Inf. Retr. pp. 116 (1993)
[3] Baird, H.S.: The state of the art of document image degradation modelling.
In: Digital Document Processing, pp. 261279. Springer (2007)
[4] Cousty, J., Najman, L., Dias, F., Serra, J.: Morphological filtering on graphs.
Computer Vision and Image Understanding 117(4), 370385 (2013)
[5] Dias, F., Cousty, J., Najman, L.: Dimensional operators for mathematical
morphology on simplicial complexes. PRL 47, 111119 (2014)
[6] Dias, F., Cousty, J., Najman, L.: Some morphological operators on simplicial
complex spaces. In: DGCI 2011. LNCS, vol. 6607, pp. 441452 (2011)
[7] Heijmans, H., Nacken, P., Toet, A., Vincent, L.: Graph morphology. Journal
of Visual Communication and Image Representation 3(1), 2438 (1992)
[8] Ho, T.K., Baird, H.S.: Evaluation of ocr accuracy using synthetic data. In:
Annual Symp. on Doc. Anal. and Inf. Retr. (1995)
[9] Kanungo, T., Haralick, R.M., Baird, H.S., Stuezle, W., Madigan, D.: A
statistical, nonparametric methodology for document degradation model
validation. PAMI 22(11), 12091223 (2000)
[10] Kanungo, T., Haralick, R.M., Phillips, I.: Global and local document degra-
dation models. In: Document Analysis and Recognition, 1993., Proceedings
of the Second International Conference on. pp. 730734. IEEE (1993)
[11] Mennillo, L., Cousty, J., Najman, L.: Morphological filters for ocr:
a performance comparison. Tech. rep. (dec 2012), http://hal.archives-
ouvertes.fr/hal-00762631
[12] Meyer, F., Angulo, J.: Micro-viscous morphological operators. In: ISMM
2007. pp. 165176. INPE (oct 2007)
[13] Nartker, T.A., Rice, S.V., Jenkins, F.R.: OCR accuracy: UNLVs fourth
annual test. Inform 9(7), 3846 (jul 1995)
[14] Nartker, T.A., Rice, S.V., Lumos, S.E.: Software tools and test data for
research and testing of page-reading ocr systems. In: Document Recognition
and Retrieval XII. vol. 5676, pp. 3747. SPIE (2005)
[15] Rice, S.V., Nagy, G., Nartker, T.A.: Optical character recognition: An il-
lustrated guide to the frontier. Springer (1999)
[16] Serra, J.: Image analysis and mathematical morphology. Academic press
(1982)
[17] Smith, R.: An overview of the tesseract ocr engine. In: ICDAR 2007. vol. 2,
pp. 629633 (2007)
[18] Vincent, L.: Graphs and mathematical morphology. Signal Processing 16(4),
365388 (1989)
[19] Vincent, L.: Morphological area openings and closings for greyscale images.
In: Shape in Picture. Nato ASI Series, vol. 126, pp. 197208. Springer Berlin
/ Heidelberg, Driebergen, The Netherlands (sep 1992)