0% found this document useful (0 votes)
136 views5 pages

A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson

kruskal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
136 views5 pages

A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson

kruskal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO.

3, MARCH 1997 273

A Fast Algorithm for Bottom-Up Document method are: no necessity for a priori knowledge about the charac-
ter size or line-spacing; detection of all the elements (words, lines,
Layout Analysis blocks, columns, stripes); efficient processing of multicolumned
images involving complicated text and graphic arrangements.
Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
2 BACKGROUND
Abstract—This paper describes a new bottom-up method for
document layout analysis. The algorithm was implemented in the There are two basic approaches to document layout analysis: the
CLiDE (Chemical Literature Data Extraction) system top-down and the bottom-up methods. The top-down methods
(http://chem.leeds.ac.uk/ICAMS/CLiDE.html), but the method look for global information on the page, e.g., black and white
described here is suitable for a broader range of documents. It is stripes, and on the basis of this split the page into columns, the
based on Kruskal’s algorithm and uses a special distance-metric columns into blocks, the blocks into lines and the lines into words.
between the components to construct the physical page structure. The
method has all the major advantages of bottom-up systems: The bottom-up techniques start with local information (concerning
1
independence from different text spacing and independence from the black pixels, or the connected components ) and determines
different block alignments. The algorithms computational complexity is first the words, then merge the words into lines, and the lines into
reduced to linear by using heuristics and path-compression. blocks. The two methods can be combined in various ways. A
Index Terms—Document analysis, physical page layout, bottom-up survey was given by Tang et al. [4] in 1994, which was a thorough
layout analysis, Kruskal’s algorithm, spanning tree, chemical summary of the published methods up to 1991.
documents. The global information in the case of the top-down method
given by Krishnamoorthy et al. [5] is the white gap between the
———————— ✦ ———————— segments, that is used in the proposed recursive X-Y cut algo-
rithm. Pavlidis and Zhou [6] introduced a method, that even
1 INTRODUCTION though employs the white streams and projection profile analysis,
ALL documents have some structure, either perceived by the still does not require preliminary skew-correction, as the analysis
reader or defined by the basic publishing units such as letters, is done on short-height segments. The implied assumption for this
words, lines and blocks. The first type of structure is referred to as method is that the page does not have severe skew. A robust,
the logical structure of the document, while the second is termed multi-lingual top-down method was proposed by Ittner and Baird
the physical structure of the page. The logical structure of a [7]. After correction of the skew of the image, this method detects
document can vary for both subjective reasons (since different the blocks on the basis of the white streams. It generates the mini-
people can interpret the same document differently) and objective mal spanning tree for detection of the text line orientation, and
reasons (different types of document contain different logical finally uses the projection profiles of the blocks to find the text
units). The physical structure of a document is more clearly de- lines. All of these methods, as is typical for the top-down tech-
fined than the logical structure. It contains fewer elements, that are nique, work only for Manhattan layouts, i.e., for pages with clear
all well-defined (characters, words, lines etc.) and are independent horizontal and vertical white gaps between and within the blocks.
of their contents. The series of procedures that detect both the Even with this constraint, it is acknowledged that the top-down
physical and the logical structure of a document, and map the two techniques are important, as significant proportion of the docu-
structures onto each other is termed document layout analysis ments have Manhattan layouts. In addition top-down analysis
and understanding. methods are generally fast, as finding the white gaps is usually a
Document layout analysis finds the components of the page, linear algorithm.
i.e., the words, lines and blocks, and forms one of the first steps of In contrast to the above, bottom-up techniques use local infor-
all the processing. Results from this stage are important for all the mation to build up higher information units. O’Gorman [8] has
other processes performed after layout analysis: the word and line proposed the “docstrum” method, where the relationship of the
information is essential for optical character recognition (OCR), objects are expressed with polar coordinates (distance and angle).
the block information is essential for the extraction of the logical The image is segmented by finding the k nearest-neighbor pairs
structure of the document, i.e., for understanding the document between the components. Text orientation and the spacing pa-
layout. rameters are estimated. With these features the algorithm works
Chemical Literature Data Extraction (CLiDE) is a project con- for most layouts. Tsujimoto and Asada [9] have proposed a seg-
cerned with chemical document processing that is under devel- mentation method, that performs the calculations on the run-
opment at the University of Leeds [1], [2], [3]. The goals of CLiDE length representation of the image, and is more efficient than a bit-
are to process whole pages of scanned chemical documents (e.g., map representation. Connected components are extracted, and
journals or books) and to extract information from both the text segments iteratively merged and classified. Saitoh et al. [10] have
and the graphic. proposed a similar method, but the reduction of the image is done
In this paper, we discuss the layout analysis method used in in both, x and y, directions, i.e., the pixels in the reduced image
CLiDE. Its originality lies in a novel definition of the distance be- correspond to a square in the original image. A method for detec-
tween the segments and in the detection of the layout hierarchy as tion of the text line orientation according to geometrical property-
a minimal-cost spanning tree. The advantageous features of this heuristics has been given as well. Application of a classical
method, namely, the run-length-smearing algorithm (RLSA)
(introduced by Wong et al. [11]), was used by Fan et al. [12]. After
———————————————— the RLSA has been performed, the close text lines are merged and
• The authors are with the Institute for Computer Applications in Molecular the resulting blocks are classified according to a feature-based
Sciences, School of Chemistry, University of Leeds, Leeds, England. classification scheme. Bottom-up methods are usually widely ap-
E-mail: {aniko, jcp, johnson}@mi.leeds.ac.uk. plicable to various layouts, but they are generally at least quad-
Manuscript received Oct. 3, 1994; revised Dec. 13, 1996. Recommended for accep- ratic in time and space, as they calculate “distances” for all unit
tance by R. Kasturi.
For information on obtaining reprints of this article, please send e-mail to: 1. Connected component of a bit map is a set of black pixels, where a
[email protected], and reference IEEECS Log Number P97011. route exists between any two pixels of the set through adjacent black
pixels.
0162-8828/97/$10.00 © 1997 IEEE
274 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997

(connected component) pairs in the page, although reduction of 3.1 Definition of the Distance
the size of the image reduces the processing time significantly. A The distance between any two components (either word, line,
combination of the top-down and bottom-up techniques was ap- block or connected components) of a page is expressed with the
plied to document segmentation, by Taylor et al. [13], where distance between their enclosing boxes as shown in Fig. 2. Let us
documents are first segmented with the recursive X-Y cut algo- define this as the maximum distance. The formal definition of the
rithm and then results are analysed using the RLSA. maximum distance is given below.
Since the Manhattan constraint is not acceptable for chemical
documents, it was felt that a fast bottom-up algorithm, that can
process wide variety of complicated text and graphic layouts is
needed.

3 THE BOTTOM-UP METHOD IMPLEMENTED IN CLIDE


Units of the physical page structure can be viewed in a hierarchical
structure and thus can be represented with an n-ary tree (Fig. 1).
With this representation the reading order can be deduced from
the level of depth of the layer. Dashed arrows on the figure mark
the reading order (either from top to bottom or from left to the Fig. 2. The maximum distance for any i and j boxes.
right). For example, strips are read from top to bottom on the
page, while columns are read from left to the right within the
strips. If the nodes are ordered according to the arrows at each Let us suppose that "i Œ[1, n] CCi are the connected compo-
level, then the preorder traversal [14] of this tree gives the correct nents of the page. The distance between any two of these "i, "j :
reading order of a page. i, j Œ [1, n] is defined as:

c h e c h c hj
D i , j = max dx i , j , dy i , j (1)

where

c h |RS|0Lci, jh - Rci, jh otherwise


dx i , j =
if Lci , j h < Rci , j h
(2)
T
Lci , j h = maxeCC . left c h, CC . left c hj
i x j x (3)

and
c h e
R i , j = min CCi . rightx c h, CC . right c hj
j x (4)

The definition of dy(i, j) is similar to dx(i, j), but taking the top
and bottom instead of the leftmost and rightmost points of the
enclosing box.

3.2 The Applied Kruskal’s Algorithm


Fig. 1. Physical layout structure of a page. Let us consider the connected components of an image to be the
vertices of an undirected graph, and the distances between any
two of them as the edges. A complete graph is therefore G = (V, E),
The document layout analysis method implemented in CLiDE
builds up the tree structure given in Fig. 1 in a bottom-up manner,
n FI
with |V| = n and E = 2 , where n is the number of the con-
i.e., it starts by processing the connected components of the image,
HK
and results in a list of words, text lines, text and graphic blocks, nected components. Let us call this graph the page graph.
columns and strips. After the image has been loaded, the con- Each edge (i, j) (that is, between i and j vertices) in this graph
nected components of the page are found and the noise-like con- has a cost attached to it, that is, the distance of the i and j con-
nected components are removed. Document layout analysis starts nected components. A spanning tree [14] of a connected graph is a
with calculation of the distances between the pairs of the connected tree that contains all the vertices of the graph and those vertices
components. If one thinks about the connected components as are all connected. A minimal-cost spanning tree of a graph is the
being the vertices of a graph, and the distances between them as the one spanning tree, for which the sum of the edges is minimal
weighted edges of the same, then words, lines, blocks, etc., can be compared to the other spanning trees of the same graph. There are
derived from the minimal-cost spanning tree, built with Kruskal’s several well known methods for generating a minimal cost span-
algorithm [14]. The word distance is emphasised, because of its ning tree of a connected graph [14]. One of these is the Kruskal
novel definition in the implemented algorithm. algorithm.
The employed method does not need any preliminary separa- A minimal spanning tree of the page graph, generated with the
tion of the components and it does not assume that there are only Kruskal algorithm [14] maps exactly the closest connections be-
characters on the page. However, it does assume that the text lines tween the physical units of the page. In addition, there is a one-to-
of the page are horizontal and that the page is scanned with very one correspondence between the physical units of the page and
little skew, or that skew correction has been performed on the the subtrees of a Kruskal minimal spanning tree of the page graph.
whole image. The method is not particularly sensitive to skew of The reason for this is the method used by Kruskal to build the
o minimal-cost spanning tree.
the page and can correctly segment images with a tolerance of ±5
of skew. The minimal-cost spanning tree is built by always inserting the
smallest of the remaining unused distances. Hence in each step of
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997 275

the algorithm, the actual state contains some number of compo- this method makes each node encountered along the path a direct
nents, that have the smallest internal distance at the current level child of the root [14], therefore speeding up the finding of a root
(initially all the vertices are in different components). Therefore for a node in following searches. This speeds up the Kruskal algo-
these components have the largest cohesion at the current level. At rithm to 2(ea(e)), where e = |E|, i.e., e is the number of the edges
the state when the next smallest distance doubles or triples the size of the graph, and a(x) is the inverse of Ackermann’s function [14].
of the last internal distance, there is a change in the layer. For ex-
ample, the component had only contained characters of a word,
while with the new distance, that is much more than the maxi-
mum of the previous ones was, it contains another word, building
together a line (see Fig. 3—valid internal word distances are la-
belled in the order of insertion into the graph: d1, d2, etc.; invalid
1, 2 , 3
distances are marked with a dashed cross. Distance d12 are inva-
1, 2 , 3
lid in all three cases, because d12 >> d11 ). Applying this reason-
ing generally to all the layers of the tree, with some additional Fig. 4. Overlapping lines are not merged together.
heuristics about coefficient of the layers, components of the physi-
cal layout can be identified. The next problem is to reduce the number of the edges in the
graph of the page, as calculation of all the edges still requires
2
2(n ), where n = |V|, i.e., n is the number of the connected com-
ponents (i.e., vertices). We propose to calculate only the edges that
are probably necessary instead of all of them. Calculating all of the
distances is unnecessary, because it is known that components that
do not have vertical intersection are not in the same word (with
the exception: ‘i,’ ‘j’ etc., but these can be merged with the bulk of
the word in an additional loop). Also components that are placed
horizontally at a larger distance from each other than certain pro-
portion of their height, are not adjacent letters of a word, or are not
in the same word at all. If the connected components are sorted by
their y-coordinates initially, the calculation of the necessary dis-
tances can be done with a linear algorithm. After a layer has been
extracted (e.g., the words have been found), the graph may be
regenerated on the new elements, whose cardinality is much
smaller than that of the previous layer. Hence for the detection of
Fig. 3. Generation of word components of the image. words the vertices of the graph are the connected components, for
the detection of lines the vertices are the words, for detection of
The above explanation requires that the hierarchically ordered blocks the vertices are the lines etc.
physical units have ordered internal cohesion. This condition is It is especially important to compute only the probably necessary
not satisfied in all types of layouts. For example, in some docu- edges in the case of the first two levels (words and lines), where
ments there exist lines with zero distance between them, while the number of the vertices of the graph is much larger than in the
there is always a non-zero distance between the words of a line upper levels. Assuming that a document has only horizontal text
(see Fig. 4), i.e., in these cases the internal distance of a block is lines, unnecessary edges for detection of the words (lines) are:
smaller than the internal distance of a line. This clearly shows that • edges between characters (words) that do not have vertical
the spatial (i.e., the distance) criterion alone is not enough to map intersection, or the value of their vertical intersection is
the physical layout of the page. The case shown in Fig. 4 does not smaller than some threshold (30% (50%) of their height for
cause any problem to the human reader, because of the adopted the words (lines));
perception that words belong to the same line only if they are • edges between characters (words) that are horizontally
written so. Therefore one of our heuristics for word detection re- placed at larger distance from each other than some thresh-
quires that words which belong to the same line have a vertical old (20% (150%) of their height for the words (lines));
2
intersection which is at least 70% of their heights. Another failure • edges between characters (words) that are separated with a
mode is if in the sequence, the distances of large-font within a graphic line (vertical or horizontal page separator).
word come in between the distances of small-font characters
within and between words, the steepness will be lost. If this case is If the components are sorted initially by the ymax value of their
typical for the processed document types appropriate heuristics enclosing boxes and the above conditions, with the strict thresh-
should be applied. olds, are applied, then the distance calculation—i.e., calculation of
the edges of the graph—can be performed in 2(n) time and space.
3.3 Using the Path-Compression Algorithm and Some Thus, both phases, the detection of the words and the detection of
Heuristics for Reducing Processing Time the lines, are linear with respect to the number of the connected
The total time required for the Kruskal algorithm depends upon components for the detection of the words, and with respect to the
the method by which the insertion of the next minimal edge is number of the words for the detection of the lines.
performed. A common approach is to use a path-compression In contrast to the case of the first two layers, the computational
algorithm [14]. In following a path from some node to the root, complexity of the other phases of layout detection (the detection of
the blocks, detections of the columns and detections of the strips)
2. Vertical intersection of boxes i, j is defined as the intersection of does not greatly influence the total processing time, as the number
the two y-intervals, that are determined by ymin and ymax coordinates of of vertices of the graph has been drastically reduced. The distance
i i j j between every pair of components is calculated, and the pure
the boxes, i.e., ymin , ymax and ymin , ymax .
276 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997

minimal spanning tree generation using Kruskal’s algorithm is 5 SUMMARY


carried out. The only condition for each layer is that originating A new bottom-up document layout analysis algorithm has been
from the adopted way of representing the unit, i.e., elements of presented. It detects the layout hierarchy as a minimal-cost span-
blocks and columns have to have horizontal intersection, while ning tree. A new distance definition, the maximum distance of the
elements of strips have to have vertical intersection. As the com- components, corresponds to the cost of an edge of the spanning
putational complexity of Kruskal’s algorithm with path- tree. With additional heuristics the computational complexity of
compression is 2(ea(e)), where e is the number of edges of the the algorithm is less than that of previous bottom-up techniques,
graph and is quadratic with respect to the number of vertices, this as it is linear with respect to the number of the connected compo-
is a good upper limit for the complexity in this situation (as we nents. It enables both Manhattan and a wide variety of other
have restrictive conditions). documents to be processed as quickly as with the top-down meth-
To summarize, using the the above heuristics and path com- ods. These two aspects, generality and speed, are the major ad-
pression method, the layout analysis can be performed in linear vantages of the CLiDE document layout analysis method. Experi-
time with respect to the number of the connected components. mental results amply demonstrate the accuracy and effectiveness
of the method.
3.4 Classification of the Segments
In the first three phases of the layout analysis, i.e., the generation TABLE 1
of the words, the lines and the blocks, the segments are classified SOME STATISTICAL RESULTS OF PAGE LAYOUT ANALYSIS
into two groups: text and graphic, where graphic refers to line image #(CC’s, calc.dist.pairs, proc speed
drawing (the classification could be extended to half-tone pictures, words, lines, blk., col., strips) time cc/
but these are not typical in chemical literature). This classification sec.
is based of the classification features proposed by Tsujimoto and only text on page:
JACS1 (5364, 4104, 1075, 101, 20, 9, 4) 2.77 1936.79
Asada [9], e.g., the height of the component, aspect ratio of the
JACS2 (5678, 4448, 1080, 111, 30, 9, 4) 2.92 1944.52
enclosing box and the percentage of black pixels per unit. graphic and text on page:
The class of a segment is inherited from the previous, lower JACS3 (4592, 3681, 787, 96, 25, 9, 4) 2.35 1954.04
layers, i.e., if a word has been classified as graphic, then by ex- JACS4 (2995, 2413, 562, 195, 18, 7, 5) 1.53 1953.34
panding it into a line it will automatically become a graphic line, JOC1 (3623, 2887, 603, 114, 33, 8, 5) 1.73 2094.22
thereafter a graphic block. It is important to classify the segments JOC2 (3976, 3157, 706, 117, 35, 5, 2) 2.15 1849.30
JOC3 (3901, 3188, 649, 105, 37, 2, 1) 1.92 2031.77
already at the words level, as the necessary edges heuristics are dif-
mainly graphic, but some text is
ferent for text and graphics. Text words must have a significant present on page:
vertical intersection to be considered as belonging to the same line, JMC1 (1139, 1037, 171, 59, 16, 3, 3) 0.54 2109.26
while graphic words are merged with any other word which is Processing time is given on Sun 4/25 platform.
close enough to their enclosing boxes.

4 RESULTS
The described document layout analysis method has been imple-
mented in C++ and compiled on Sun 4/25 workstation (21 MHz)
and on SG Indy (R4600 PC, 100 MHz). It has been tested on 98
document pages, covering books (31 pages), reports (25 pages) and
journals (37 pages, from eight different chemistry journals). The
considered pages included chemical reaction drawings with text
insets; title pages with blocks of variable font sizes; double column
pages with full-page-width figures; etc. The algorithm has per-
formed with the average speed of 2,000 connected compo-
nents/sec of CPU time on the Sun (8,000 CC/sec of CPU time on
the SGI), which means a processing time of about two seconds for
a dense journal page image with 4,000 connected components on
the Sun (approx. 0.5 sec on the SGI). Detailed numerical results of
the layout analysis, cardinality of the found segments and the re- (a) (b)
quired processing time, on the Sun are given for some of the test
images in Table 1. Fig. 5. The original image (JOC, vol.59, no.9, page 2,328) on the left,
The accuracy of the method has proved to be good. The 98 test the found text and graphic (bold frames) blocks are shown on the right.
images were analysed with an error rate of 1% at the block level
(there were approx. 2,000 blocks in total, with only 20 segmenta- APPENDIX
tion and/or classification errors being made), however these er- Abbreviations of the journal names used in Fig. 5 and Table 1:
rors occurred in seven different images, meaning that 7.14% of the
images required manual correction. The segmentation errors were JACS Journal of the American Chemical Society
mainly due to accidental proximity of different types of objects JOC Journal of Organic Chemistry
(e.g., a graphic region being too close to a label region), while the JMC Journal of Medicinal Chemistry.
classification errors were always in cases of small graphics, con-
taining small dashes and letters only. Fig. 5 illustrates the results ACKNOWLEDGMENTS
of the block detection phase for one of the test images (JOC3 in We are thankful to Valerie Gillet and Zsolt Zsoldos for helping us
Table 1). to structure this article and to Chemical Abstract Services for pro-
viding us with the test images.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997 277

REFERENCES
[1] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel, and
A.P. Johnson, “Chemical Structure Recognition and Generic Text
Interpretation in the CLiDE Project,” Proc. Online Information 92,
London, England, 1992.
[2] F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel, and A.P. John-
son, “Chemical Literature Data Extraction: Bond Crossing in Sin-
gle and Multiple Structures,” Proc. Int’l Chemical Information Conf.,
Annecy, France, 1992.
[3] P. Ibison, M. Jacquot, F. Kam, A.G. Neville, R.W. Simpson, C.
Tonnelier, T. Venczel, and A.P. Johnson, “Chemical Literature
Data Extraction: The CLiDE Project,” J. Chem. Inf. Comput. Sci.,
vol. 33, no. 3, pp. 338-344, 1993.
[4] Y.Y. Tang, C.D. Yan, and C.Y. Suen, “Document Processing for
Automatic Knowledge Acquisition,” IEEE Trans. Knowledge and
Data Engineering, vol. 6, no. 1, pp. 3-21, 1994.
[5] M. Krishnamoorthy, G. Nagy, S. Seth, and M. Viswanathan,
“Syntactic Segmentation and Labeling of Digitized Pages from
Technical Journals,” IEEE Trans. Pattern Analysis and Machine In-
telligence, vol. 15, no. 7, pp. 737-747, 1993.
[6] T. Pavlidis and J. Zhou, “Page Segmentation and Classification,”
CVGIP: Graphical Models and Image Processing, vol. 54, no. 6, pp.
484-496, 1992.
[7] D.J. Ittner and H.S. Baird, “Language-Free Layout Analysis,” Proc.
Second Int’l Conf. Doc. Anal. and Recogn. (ICDAR’93), pp. 336-340,
Japan, Oct. 1993.
[8] L. O’Gorman, “The Document Spectrum for Page Layout Analy-
sis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15,
no. 11, pp. 1,162-1,173, 1993.
[9] S. Tsujimoto and H. Asada, “Major Components of a Complete
Text Reading System,” Proc. IEEE, vol. 80, no. 7, pp. 1,133-1,149,
1992.
[10] T. Saitoh, T. Yamaai, and M. Tachikawa, “Document Image Seg-
mentation and Layout Analysis,” IEICE Trans. Information and Sys-
tems, vol. 77, no. 7, pp. 778-784, 1994.
[11] K.Y. Wong, R.G. Casey, and F.M. Wahl, “Document Analysis
System,” IBJ J. Research and Development, vol. 26, no. 6, pp. 647-
656, 1982.
[12] K.-C. Fan, C.-H. Liu, and Y.-K. Wang, “Segmentation and Classi-
fication of Mixed Text/Graphics/image Documents,” Pattern Rec-
ognition Letters, vol. 15, no. 12, pp. 1,201-1,209, 1994.
[13] S.L. Taylor, D.A. Dahl, M. Lipshutz, C. Weir, L.M. Norton, R.W.
Nilson and M.C. Linebarger, “Integrating Natural Language Un-
derstanding with Document Structure Analysis,” Artificial Intelli-
gence Rev., vol. 8, no. 2, pp. 255-276, 1994.
[14] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, Data Structures and
Algorithms. Reading, Mass.: Addison-Wesley Publishing Com-
pany, 1983.

You might also like