A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
A Fast Algorithm For Bottom-Up Document Layout Analysis: Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
A Fast Algorithm for Bottom-Up Document method are: no necessity for a priori knowledge about the charac-
ter size or line-spacing; detection of all the elements (words, lines,
Layout Analysis blocks, columns, stripes); efficient processing of multicolumned
images involving complicated text and graphic arrangements.
Anikó Simon, Jean-Christophe Pret, and A. Peter Johnson
2 BACKGROUND
Abstract—This paper describes a new bottom-up method for
document layout analysis. The algorithm was implemented in the There are two basic approaches to document layout analysis: the
CLiDE (Chemical Literature Data Extraction) system top-down and the bottom-up methods. The top-down methods
(http://chem.leeds.ac.uk/ICAMS/CLiDE.html), but the method look for global information on the page, e.g., black and white
described here is suitable for a broader range of documents. It is stripes, and on the basis of this split the page into columns, the
based on Kruskal’s algorithm and uses a special distance-metric columns into blocks, the blocks into lines and the lines into words.
between the components to construct the physical page structure. The
method has all the major advantages of bottom-up systems: The bottom-up techniques start with local information (concerning
1
independence from different text spacing and independence from the black pixels, or the connected components ) and determines
different block alignments. The algorithms computational complexity is first the words, then merge the words into lines, and the lines into
reduced to linear by using heuristics and path-compression. blocks. The two methods can be combined in various ways. A
Index Terms—Document analysis, physical page layout, bottom-up survey was given by Tang et al. [4] in 1994, which was a thorough
layout analysis, Kruskal’s algorithm, spanning tree, chemical summary of the published methods up to 1991.
documents. The global information in the case of the top-down method
given by Krishnamoorthy et al. [5] is the white gap between the
———————— ✦ ———————— segments, that is used in the proposed recursive X-Y cut algo-
rithm. Pavlidis and Zhou [6] introduced a method, that even
1 INTRODUCTION though employs the white streams and projection profile analysis,
ALL documents have some structure, either perceived by the still does not require preliminary skew-correction, as the analysis
reader or defined by the basic publishing units such as letters, is done on short-height segments. The implied assumption for this
words, lines and blocks. The first type of structure is referred to as method is that the page does not have severe skew. A robust,
the logical structure of the document, while the second is termed multi-lingual top-down method was proposed by Ittner and Baird
the physical structure of the page. The logical structure of a [7]. After correction of the skew of the image, this method detects
document can vary for both subjective reasons (since different the blocks on the basis of the white streams. It generates the mini-
people can interpret the same document differently) and objective mal spanning tree for detection of the text line orientation, and
reasons (different types of document contain different logical finally uses the projection profiles of the blocks to find the text
units). The physical structure of a document is more clearly de- lines. All of these methods, as is typical for the top-down tech-
fined than the logical structure. It contains fewer elements, that are nique, work only for Manhattan layouts, i.e., for pages with clear
all well-defined (characters, words, lines etc.) and are independent horizontal and vertical white gaps between and within the blocks.
of their contents. The series of procedures that detect both the Even with this constraint, it is acknowledged that the top-down
physical and the logical structure of a document, and map the two techniques are important, as significant proportion of the docu-
structures onto each other is termed document layout analysis ments have Manhattan layouts. In addition top-down analysis
and understanding. methods are generally fast, as finding the white gaps is usually a
Document layout analysis finds the components of the page, linear algorithm.
i.e., the words, lines and blocks, and forms one of the first steps of In contrast to the above, bottom-up techniques use local infor-
all the processing. Results from this stage are important for all the mation to build up higher information units. O’Gorman [8] has
other processes performed after layout analysis: the word and line proposed the “docstrum” method, where the relationship of the
information is essential for optical character recognition (OCR), objects are expressed with polar coordinates (distance and angle).
the block information is essential for the extraction of the logical The image is segmented by finding the k nearest-neighbor pairs
structure of the document, i.e., for understanding the document between the components. Text orientation and the spacing pa-
layout. rameters are estimated. With these features the algorithm works
Chemical Literature Data Extraction (CLiDE) is a project con- for most layouts. Tsujimoto and Asada [9] have proposed a seg-
cerned with chemical document processing that is under devel- mentation method, that performs the calculations on the run-
opment at the University of Leeds [1], [2], [3]. The goals of CLiDE length representation of the image, and is more efficient than a bit-
are to process whole pages of scanned chemical documents (e.g., map representation. Connected components are extracted, and
journals or books) and to extract information from both the text segments iteratively merged and classified. Saitoh et al. [10] have
and the graphic. proposed a similar method, but the reduction of the image is done
In this paper, we discuss the layout analysis method used in in both, x and y, directions, i.e., the pixels in the reduced image
CLiDE. Its originality lies in a novel definition of the distance be- correspond to a square in the original image. A method for detec-
tween the segments and in the detection of the layout hierarchy as tion of the text line orientation according to geometrical property-
a minimal-cost spanning tree. The advantageous features of this heuristics has been given as well. Application of a classical
method, namely, the run-length-smearing algorithm (RLSA)
(introduced by Wong et al. [11]), was used by Fan et al. [12]. After
———————————————— the RLSA has been performed, the close text lines are merged and
• The authors are with the Institute for Computer Applications in Molecular the resulting blocks are classified according to a feature-based
Sciences, School of Chemistry, University of Leeds, Leeds, England. classification scheme. Bottom-up methods are usually widely ap-
E-mail: {aniko, jcp, johnson}@mi.leeds.ac.uk. plicable to various layouts, but they are generally at least quad-
Manuscript received Oct. 3, 1994; revised Dec. 13, 1996. Recommended for accep- ratic in time and space, as they calculate “distances” for all unit
tance by R. Kasturi.
For information on obtaining reprints of this article, please send e-mail to: 1. Connected component of a bit map is a set of black pixels, where a
[email protected], and reference IEEECS Log Number P97011. route exists between any two pixels of the set through adjacent black
pixels.
0162-8828/97/$10.00 © 1997 IEEE
274 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997
(connected component) pairs in the page, although reduction of 3.1 Definition of the Distance
the size of the image reduces the processing time significantly. A The distance between any two components (either word, line,
combination of the top-down and bottom-up techniques was ap- block or connected components) of a page is expressed with the
plied to document segmentation, by Taylor et al. [13], where distance between their enclosing boxes as shown in Fig. 2. Let us
documents are first segmented with the recursive X-Y cut algo- define this as the maximum distance. The formal definition of the
rithm and then results are analysed using the RLSA. maximum distance is given below.
Since the Manhattan constraint is not acceptable for chemical
documents, it was felt that a fast bottom-up algorithm, that can
process wide variety of complicated text and graphic layouts is
needed.
c h e c h c hj
D i , j = max dx i , j , dy i , j (1)
where
and
c h e
R i , j = min CCi . rightx c h, CC . right c hj
j x (4)
The definition of dy(i, j) is similar to dx(i, j), but taking the top
and bottom instead of the leftmost and rightmost points of the
enclosing box.
the algorithm, the actual state contains some number of compo- this method makes each node encountered along the path a direct
nents, that have the smallest internal distance at the current level child of the root [14], therefore speeding up the finding of a root
(initially all the vertices are in different components). Therefore for a node in following searches. This speeds up the Kruskal algo-
these components have the largest cohesion at the current level. At rithm to 2(ea(e)), where e = |E|, i.e., e is the number of the edges
the state when the next smallest distance doubles or triples the size of the graph, and a(x) is the inverse of Ackermann’s function [14].
of the last internal distance, there is a change in the layer. For ex-
ample, the component had only contained characters of a word,
while with the new distance, that is much more than the maxi-
mum of the previous ones was, it contains another word, building
together a line (see Fig. 3—valid internal word distances are la-
belled in the order of insertion into the graph: d1, d2, etc.; invalid
1, 2 , 3
distances are marked with a dashed cross. Distance d12 are inva-
1, 2 , 3
lid in all three cases, because d12 >> d11 ). Applying this reason-
ing generally to all the layers of the tree, with some additional Fig. 4. Overlapping lines are not merged together.
heuristics about coefficient of the layers, components of the physi-
cal layout can be identified. The next problem is to reduce the number of the edges in the
graph of the page, as calculation of all the edges still requires
2
2(n ), where n = |V|, i.e., n is the number of the connected com-
ponents (i.e., vertices). We propose to calculate only the edges that
are probably necessary instead of all of them. Calculating all of the
distances is unnecessary, because it is known that components that
do not have vertical intersection are not in the same word (with
the exception: ‘i,’ ‘j’ etc., but these can be merged with the bulk of
the word in an additional loop). Also components that are placed
horizontally at a larger distance from each other than certain pro-
portion of their height, are not adjacent letters of a word, or are not
in the same word at all. If the connected components are sorted by
their y-coordinates initially, the calculation of the necessary dis-
tances can be done with a linear algorithm. After a layer has been
extracted (e.g., the words have been found), the graph may be
regenerated on the new elements, whose cardinality is much
smaller than that of the previous layer. Hence for the detection of
Fig. 3. Generation of word components of the image. words the vertices of the graph are the connected components, for
the detection of lines the vertices are the words, for detection of
The above explanation requires that the hierarchically ordered blocks the vertices are the lines etc.
physical units have ordered internal cohesion. This condition is It is especially important to compute only the probably necessary
not satisfied in all types of layouts. For example, in some docu- edges in the case of the first two levels (words and lines), where
ments there exist lines with zero distance between them, while the number of the vertices of the graph is much larger than in the
there is always a non-zero distance between the words of a line upper levels. Assuming that a document has only horizontal text
(see Fig. 4), i.e., in these cases the internal distance of a block is lines, unnecessary edges for detection of the words (lines) are:
smaller than the internal distance of a line. This clearly shows that • edges between characters (words) that do not have vertical
the spatial (i.e., the distance) criterion alone is not enough to map intersection, or the value of their vertical intersection is
the physical layout of the page. The case shown in Fig. 4 does not smaller than some threshold (30% (50%) of their height for
cause any problem to the human reader, because of the adopted the words (lines));
perception that words belong to the same line only if they are • edges between characters (words) that are horizontally
written so. Therefore one of our heuristics for word detection re- placed at larger distance from each other than some thresh-
quires that words which belong to the same line have a vertical old (20% (150%) of their height for the words (lines));
2
intersection which is at least 70% of their heights. Another failure • edges between characters (words) that are separated with a
mode is if in the sequence, the distances of large-font within a graphic line (vertical or horizontal page separator).
word come in between the distances of small-font characters
within and between words, the steepness will be lost. If this case is If the components are sorted initially by the ymax value of their
typical for the processed document types appropriate heuristics enclosing boxes and the above conditions, with the strict thresh-
should be applied. olds, are applied, then the distance calculation—i.e., calculation of
the edges of the graph—can be performed in 2(n) time and space.
3.3 Using the Path-Compression Algorithm and Some Thus, both phases, the detection of the words and the detection of
Heuristics for Reducing Processing Time the lines, are linear with respect to the number of the connected
The total time required for the Kruskal algorithm depends upon components for the detection of the words, and with respect to the
the method by which the insertion of the next minimal edge is number of the words for the detection of the lines.
performed. A common approach is to use a path-compression In contrast to the case of the first two layers, the computational
algorithm [14]. In following a path from some node to the root, complexity of the other phases of layout detection (the detection of
the blocks, detections of the columns and detections of the strips)
2. Vertical intersection of boxes i, j is defined as the intersection of does not greatly influence the total processing time, as the number
the two y-intervals, that are determined by ymin and ymax coordinates of of vertices of the graph has been drastically reduced. The distance
i i j j between every pair of components is calculated, and the pure
the boxes, i.e., ymin , ymax and ymin , ymax .
276 IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997
4 RESULTS
The described document layout analysis method has been imple-
mented in C++ and compiled on Sun 4/25 workstation (21 MHz)
and on SG Indy (R4600 PC, 100 MHz). It has been tested on 98
document pages, covering books (31 pages), reports (25 pages) and
journals (37 pages, from eight different chemistry journals). The
considered pages included chemical reaction drawings with text
insets; title pages with blocks of variable font sizes; double column
pages with full-page-width figures; etc. The algorithm has per-
formed with the average speed of 2,000 connected compo-
nents/sec of CPU time on the Sun (8,000 CC/sec of CPU time on
the SGI), which means a processing time of about two seconds for
a dense journal page image with 4,000 connected components on
the Sun (approx. 0.5 sec on the SGI). Detailed numerical results of
the layout analysis, cardinality of the found segments and the re- (a) (b)
quired processing time, on the Sun are given for some of the test
images in Table 1. Fig. 5. The original image (JOC, vol.59, no.9, page 2,328) on the left,
The accuracy of the method has proved to be good. The 98 test the found text and graphic (bold frames) blocks are shown on the right.
images were analysed with an error rate of 1% at the block level
(there were approx. 2,000 blocks in total, with only 20 segmenta- APPENDIX
tion and/or classification errors being made), however these er- Abbreviations of the journal names used in Fig. 5 and Table 1:
rors occurred in seven different images, meaning that 7.14% of the
images required manual correction. The segmentation errors were JACS Journal of the American Chemical Society
mainly due to accidental proximity of different types of objects JOC Journal of Organic Chemistry
(e.g., a graphic region being too close to a label region), while the JMC Journal of Medicinal Chemistry.
classification errors were always in cases of small graphics, con-
taining small dashes and letters only. Fig. 5 illustrates the results ACKNOWLEDGMENTS
of the block detection phase for one of the test images (JOC3 in We are thankful to Valerie Gillet and Zsolt Zsoldos for helping us
Table 1). to structure this article and to Chemical Abstract Services for pro-
viding us with the test images.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, VOL. 19, NO. 3, MARCH 1997 277
REFERENCES
[1] P. Ibison, F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel, and
A.P. Johnson, “Chemical Structure Recognition and Generic Text
Interpretation in the CLiDE Project,” Proc. Online Information 92,
London, England, 1992.
[2] F. Kam, R.W. Simpson, C. Tonnelier, T. Venczel, and A.P. John-
son, “Chemical Literature Data Extraction: Bond Crossing in Sin-
gle and Multiple Structures,” Proc. Int’l Chemical Information Conf.,
Annecy, France, 1992.
[3] P. Ibison, M. Jacquot, F. Kam, A.G. Neville, R.W. Simpson, C.
Tonnelier, T. Venczel, and A.P. Johnson, “Chemical Literature
Data Extraction: The CLiDE Project,” J. Chem. Inf. Comput. Sci.,
vol. 33, no. 3, pp. 338-344, 1993.
[4] Y.Y. Tang, C.D. Yan, and C.Y. Suen, “Document Processing for
Automatic Knowledge Acquisition,” IEEE Trans. Knowledge and
Data Engineering, vol. 6, no. 1, pp. 3-21, 1994.
[5] M. Krishnamoorthy, G. Nagy, S. Seth, and M. Viswanathan,
“Syntactic Segmentation and Labeling of Digitized Pages from
Technical Journals,” IEEE Trans. Pattern Analysis and Machine In-
telligence, vol. 15, no. 7, pp. 737-747, 1993.
[6] T. Pavlidis and J. Zhou, “Page Segmentation and Classification,”
CVGIP: Graphical Models and Image Processing, vol. 54, no. 6, pp.
484-496, 1992.
[7] D.J. Ittner and H.S. Baird, “Language-Free Layout Analysis,” Proc.
Second Int’l Conf. Doc. Anal. and Recogn. (ICDAR’93), pp. 336-340,
Japan, Oct. 1993.
[8] L. O’Gorman, “The Document Spectrum for Page Layout Analy-
sis,” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 15,
no. 11, pp. 1,162-1,173, 1993.
[9] S. Tsujimoto and H. Asada, “Major Components of a Complete
Text Reading System,” Proc. IEEE, vol. 80, no. 7, pp. 1,133-1,149,
1992.
[10] T. Saitoh, T. Yamaai, and M. Tachikawa, “Document Image Seg-
mentation and Layout Analysis,” IEICE Trans. Information and Sys-
tems, vol. 77, no. 7, pp. 778-784, 1994.
[11] K.Y. Wong, R.G. Casey, and F.M. Wahl, “Document Analysis
System,” IBJ J. Research and Development, vol. 26, no. 6, pp. 647-
656, 1982.
[12] K.-C. Fan, C.-H. Liu, and Y.-K. Wang, “Segmentation and Classi-
fication of Mixed Text/Graphics/image Documents,” Pattern Rec-
ognition Letters, vol. 15, no. 12, pp. 1,201-1,209, 1994.
[13] S.L. Taylor, D.A. Dahl, M. Lipshutz, C. Weir, L.M. Norton, R.W.
Nilson and M.C. Linebarger, “Integrating Natural Language Un-
derstanding with Document Structure Analysis,” Artificial Intelli-
gence Rev., vol. 8, no. 2, pp. 255-276, 1994.
[14] A.V. Aho, J.E. Hopcroft, and J.D. Ullman, Data Structures and
Algorithms. Reading, Mass.: Addison-Wesley Publishing Com-
pany, 1983.