GrayLineExtraction PDF
GrayLineExtraction PDF
Document Images
Abedelkadir Asi1 Raid Saabni2,3 Jihad El-Sana1
1
{abedas,saabni,el-sana}@cs.bgu.ac.il
ABSTRACT
In this paper we present a new approach for text line segmentation that works directly on gray-scale document images.
Our algorithm constructs distance transform directly on the
gray-scale images, which is used to compute two types of
seams: medial seams and separating seams. A medial seam
is a chain of pixels that crosses the text area of a text line
and a separating seam is a path that passes between two consecutive rows. The medial seam determines a text line and
the separating seams define the upper and lower boundaries
of the text line. The medial and separating seams propagate
according to energy maps, which are defined based on the
constructed distance transform. We have performed various experimental results on different datasets and received
encouraging results.
Keywords
Seam Carving, Line Extraction, Multilingual, Signed Distance Transform, Dynamic programming, Handwriting
1.
INTRODUCTION
Historical handwritten documents are valuable cultural heritage, as they provide insights into both tangible and intangible cultural aspects from the past. The need to preserve
these documents demands global emerging efforts to analyze
and manipulate them by utilizing techniques from various
science fields. Handwritten historical documents pose real
challenges for automatic processing, such as image binarization, writer identification, page segmentation, and keyword
searching and indexing. A considerable number of algorithms address these tasks; some provide acceptable results
and already integrated into working systems.
Document image segmentation into text lines is a major
prerequisite procedure for various document image analysis
tasks, such as word spotting, key-word searching, and text
alignment [1, 2, 3, 4, 5, 6]. Extracting text lines from handwritten document images poses different challenges than those
Permission to make digital or hard copies of part or all of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for profit or commercial advantage and that copies bear
this notice and the full citation on the first page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting with
credit is permitted. To copy otherwise, to republish, to post on servers or to
redistribute to lists, requires prior specific permission and/or a fee.
HIP 11, September 16 - September 17 2011, Beijing, China
Copyright 2011 ACM 978-1-4503-0916-5/11/09...$10.00.
(a)
(b)
3.
(c)
(d)
Figure 1: Algorithm flow: (a) the seam-map, (b)
medial seam (blue) and seam fragments (green and
red), (c) seam seeds (green and red), and (d) medial
seam and separating seams.
for the entire page image after the extraction of each text
line. The separating seams determine the text line boundaries, define the region to be updated, and overcome the
limitation of recomputing the energy map.
In the rest of this paper we briefly review related work, describe our approach in detail, and report experimental results. Finally we conclude and discuss directions for future
work.
2.
RELATED WORK
Determining the text lines of a document image is a basic procedure for various document processing applications
and have received tremendous attention over the last several
decades.
Image smearing was among the earliest approaches used to
determine text lines; Wong et al. [10] applied image smearing
to binarized printed document images and Bar-Yosef et al. [11]
used it for historical documents. Projection profiles along
a predetermined direction is used in top-down approaches
to estimate the paths separating consecutive text lines [12,
13, 14, 11, 15, 16]. Adaptive local projection profiles is employed to handle multi-skew in document images [11, 17].
Hough transform is used to compute the direction to apply projection profile; and to generate good text line hypotheses [18, 19]. Fuzzy run length matrices and adaptive
local connectivity maps are applied directly to the gray-scale
document images [20, 4, 21]. Tracking minima points to
follow the white-most and black-most pixels along horizontal paths are used to estimate the boundaries and baselines
of text lines [22]. Seam carving approach is used to find
the seams, which resemble the baseline of text row, using
a signed-distance-transform based energy map [8]. Various
grouping techniques, such as heuristic rules, learning algorithms, nearest neighbor, and search trees are applied to
OUR APPROACH
3.1
Energy Map
l(p) =
n1
X
(1)
(a)
(b)
i=0
3.2
Seam Generation
Seams are computed using dynamic programming which relies on generating an energy map that encodes the minimal
cost of the valid paths. We refer to this energy map as the
seam-map, which is computed similar to [8] with slight modification to generate salient line structure. We replace the
equal weights for the horizontal and diagonal distances by
different weights that reflects the actual distance on the image (see Equation 2, where w0 = 1 and w1 = w1 = 12 ).
We found that this modification generates accurate energy
maps and produce robust seams. The algorithm determines
the minimal cost path by starting with the minimal cost on
the last column (right column) and traversing the seam-map
backward from right to left.
map[i, j] =
3.2.1
(2)
Medial Seam
(a)
(b)
(c)
(d)
certainty(f, ms) =
fe
X
|(ms(i) f (i))|
(a)
(b)
the seam (see Figure 4). The spring factor k was determined
experimentally, and we have found out that we need small
values of k, usually 1/dr .
(3)
i=fs
3.2.3
Seam Propagation
Growing a seam seed into a separating seam should maintain an appropriate distance from the corresponding medial
seam. The separating seam is guided by the GDT map,
which is computed based on the topography (gray levels) of
the image. The fork of ridges leads to the existence of separating seams with small differences in their weights, where
the maximal-weight seam may not be the sought seam (see
Figure 5). To overcome this limitation we incorporate the
distance from the medial seam into the propagation scheme
of the separating seam by integrating a spring model within
the seam prorogation scheme. The applied force of the
spring model is used as a weight in the propagation scheme;
i.e., F = k(|dr d|), where dr and d are the rest distance
and the distance from the medial seam, respectively, and k
is the spring constant. The rest distance is the average distance between the medial seam and the currently computed
separating seam. This scheme pushes the separating seam
away from the medial seam, when it is too close and attracts
the seam toward the medial seam when it moves aways from
Figure 5: A document image and its distance transform, where two fork examples are marked with red
rectangles.
4.
EXPERIMENTAL RESULTS
correctness(medial(l))
2
correctness(upper(l))
+
4
correctness(lower(l))
+
4
DataSet
Wadod
Al-Majid
AUB
Congress L.
correctness(l) =
(4)
Medial
99
98
96
95
Correctness(%)
Upper Lower
97
97
96
97
95
94
93
94
Line
98
97
95
94.2
Stroke (%)
Crossing
9
2
9
11
Table 1: The performance of our algorithm on various dataset written in different languages.
Our approach enables the separating seams to split touching components along the path passing between the lines
and separate them, not necessarily on the optimal position.
This procedure may split fractions of bracket-shape ascenders or descenders that besiege the propagating seam and
force it to pass through, as shown in Figure 7. Nevertheless,
it is easy to fix this in a post-processing procedure that examines the cases where the separating seam passes through
local minima. Propagating along local minima path usually
reveals whether the crossed shape was a touching component, ascender, or descender.
Figure 7: The last word on the second line(right-toleft), descender besieges the propagating seam and
force it to pass through.
Figure 6: Random samples from the tested document images: Arabic, English and Spanish.
The absence of publicly availability database for evaluating line extraction algorithm on gray-scale images drove the
development of our own dataset, which consist of various
historical manuscripts in different languages. We have evaluated our system using 97 Arabic pages (900 lines) from
Juma Al-Majid Center for Culture and Heritage [35], 70
pages (1050 lines) from Wadod Center for Manuscripts [36],
40 pages (420 lines) from a 19th-century master thesis collection in the American University of Beirut(AUB) [37] and
10 pages (150 lines) from Thomas Jefferson manuscripts located at the Congress Libray. Our dataset includes Arabic, English, and Spanish handwritten document images.
The images have been selected to have multi-skew, touching/ overlapping components and both regular and irregular
spacing between lines.
Table 1 shows the average performance of our algorithm using various datasets of different qualities. Figure 6 presents
samples from the tested datasets. As can be seen, it performs well independent of the used script and manages to
generate very good results for languages that include delayed
strokes, dots, and diacritics.
5.
6.
ACKNOWLEDGMENT
[10]
[11]
[12]
[13]
[14]
[15]
7.
REFERENCES
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Europia,Paris, p. 117U135,
1994.
[25] I. S. I. Abuhaiba, S. Datta, and M. J. J. Holt, Line
extraction and stroke ordering of text pages, in
ICDAR, 1995, p. 390.
[26] A. Simon, J.-C. Pret, and A. P. Johnson, A fast
algorithm for bottom-up document layout analysis,
IEEE Trans. Pattern Anal. Mach. Intell., vol. 19,
no. 3, pp. 273277, 1997.
[27] Y.Pu and Z.Shi, Anatural learning algorithm based
on hough transform for text lines extraction in hand
written documents, in in Proceedings sixth
International Workshop on Frontiers of Handwriting
646.
Recognition, 1998, p. 637 U
[28] K. Koichi, S. Akinori, and I. Motoi, Segmentation of
page images using the area voronoi diagram, Comput.
Vis. Image Underst., vol. 70, no. 3, pp. 370382, 1998.
[29] S. Nicolas, T. Paquet, and L. Heutte, Text line
segmentation in handwritten document using a
production system, in Proceedings of the Ninth
International Workshop on Frontiers in Handwriting