A Machine-Learning Approach For Analyzing Document
A Machine-Learning Approach For Analyzing Document
net/publication/220600742
CITATIONS READS
8 768
3 authors, including:
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Chien-Hsing Chou on 26 July 2019.
Pattern Recognition
journal homepage: w w w . e l s e v i e r . c o m / l o c a t e / p r
A R T I C L E I N F O A B S T R A C T
Article history: The purpose of document layout analysis is to locate textlines and text regions in document images
Received 16 April 2007 mostly via a series of split-or-merge operations. Before applying such an operation, however, it is nec-
Received in revised form 3 March 2008 essary to examine the context to decide whether the place chosen for the operation is appropriate. We
Accepted 12 March 2008
thus view document layout analysis as a matter of solving a series of binary decision problems, such as
whether to apply, or not to apply, a split-or-merge operation to a chosen place. To solve these problems,
Keywords:
Binary decision
we use support vector machines to learn whether or not to apply the previously mentioned operations
Document layout analysis from training documents in which all textlines and text regions have been located and their identifies
Reading order labeled. The proposed approach is very effective for analyzing documents that allow both horizontal and
Support vector machine vertical reading orders. When applied to a test data set composed of eight types of layout structure, the
Taboo box approach's accuracy rates for identifying textlines and text regions are 98.83% and 96.72%, respectively.
Textline © 2008 Elsevier Ltd. All rights reserved.
Text region
1. Introduction top-down approach, one example is the recursive X--Y cut method [1]
that relies on projection profiles to cut a textual region into several
Document layout analysis involves operations that divide a docu- sub-regions; however, it may fail in text regions that lack a fully
ment into textlines composed of homogeneous characters, and into extended horizontal or vertical cut. Similarly, methods that exploit
text regions composed of homogeneous textlines. Analyzing the lay- maximal white rectangles [2] or white streams [3] may also fail to
out structure of some documents, such as Chinese and Japanese doc- find large enough white margins. For this reason, Lee and Ryu [4]
uments, is particularly challenging because they allow two reading proposed a multi-scale analysis method that examines a document in
orders. The reading order is defined by the order of characters in a various scales. All of the above methods were designed for Western
textline. In Western documents, the reading order is always hori- documents only.
zontal, while in Chinese and Japanese documents, it can be horizon- In the bottom-up approach, examples are the document spectrum
tal or vertical. Sometimes, such documents may contain both types method [5], the minimal-cost spanning tree method [6], and the
of textlines. In these cases, the complexity of layout analysis is of- component-based algorithm [7]. These methods construct textlines
ten increased, since it can be conducted in two possible directions, based on the distances between connected components (referred to
whereas in documents with a single reading order there is no such as components hereafter). Many bottom-up methods were also de-
freedom. signed for documents with a single reading order. Xi et al. [8], on the
other hand, applied the spanning tree method to documents with
two reading orders. They used the spanning tree as a pre-classifier
1.1. Background to gather components into sub-graphs. In this approach, one cru-
cial step involves cutting away a vertical sub-graph that has been
The various layout analysis methods proposed thus far can be wrongly merged into a horizontal textline, or vice versa. Relying on
categorized as top-down, bottom-up, or hybrid approaches. In the some heuristics to solve this problem, the authors achieved an ac-
curacy rate of 87.2% in the analysis of 25 documents. Chen et al. [9]
developed a rule-based bottom-up method for documents with two
∗ Corresponding author. Tel.: +886 02 2788 3799; fax: +886 02 2782 4814. reading orders. The rules for merging components into textlines are
E-mail addresses: [email protected] (C.-C. Wu), based on the concept of "nearest-neighbor connect-strength'', which
[email protected] (C.-H. Chou), [email protected] (F. Chang). varies according to the size similarity, distance, and offset of the
components. The accuracy rate reported for this approach was 83.2% step removes textlines grown under a false reading-order assump-
using an un-specified number of documents. tion, and the second eliminates overlap between the remaining
Okamoto and Takahashi [10] proposed a hybrid segmentation textlines.
method for documents with two reading orders. It partitions a doc- The advantage of dividing the problem into several sub-problems
ument into blocks based on field separators and white streams, and is that they are simple and can be solved one at a time. If we tried to
then merges the components of each block into textlines. Unfortu- solve all the sub-problems simultaneously, we would have to deal
nately, the authors did not specify the method's parameters or detail with mutually conflicting examples, which would complicate the
its performance. learning process. Thus, when we grow textlines initially, we do not
The methods of Liu et al. [11] and Chang et al. [12] are adaptive in worry whether they are under-extended, over-extended, or whether
the sense that split-or-merge operations are performed using esti- they should be there at all. These issues can be dealt with at a later
mated parameter values. Liu et al.'s approach applies the operations stage when we have more information.
in both top-down and bottom-up directions, whereas Chang et al.'s All our sub-problems involve a decision about whether to per-
method only conducts split operations when low-level objects have form a certain split-or-merge operation. For this reason, we employ
been merged into high-level ones. Although the latter method did support vector machines (SVM) [18,19] to construct the decision
not have a unified approach for estimating its parameters, it achieved function. In the learning phase, we prepare both positive and nega-
a rather good performance on its data set. A performance report was tive examples. The former are examples to which we apply a certain
not provided for the former method. operation P, and the latter are examples to which we do not apply
Since all methods split or merge objects based on certain param- P. The learning procedure produces a decision function, or classifier,
eters, parameter estimation is crucial in layout analysis. Machine- which is used in the layout analysis to decide whether or not to
learning techniques can be especially useful in this regard. Even so, conduct P.
very few of the methods proposed thus far have adopted a learn-
ing approach. Learning methods have been used to separate tex- 1.3. The steps in our procedure
tual areas from graphical areas [13], and to classify text regions
as headline, main text, etc. [14,15]. In surveys of learning methods Our procedure for layout analysis involves four steps, which we
used in document image analysis, Marinai et al. [16] consider neural describe briefly below.
network methods for document analysis and recognition, while Liu In Step I, we form both horizontal and vertical textlines out of
et al. [17] consider learning methods for handwritten digit recog- certain components. To avoid forming many textlines that are es-
nition. To the best of our knowledge, the only work that applies a sentially the same, we impose restrictions on the growth of new
machine-learning technique to layout analysis is that of Laven et textlines. Following this operation, we apply another operation to
al. [15]. They derived a logistic regression classifier from a train- find similar textlines and merge them.
ing data set and used it to analyze journal formats. The error rate In Step II, we remove textlines with false reading orders. We do
was 56% for one test data set and 25.7% for another test data set so by first grouping those textlines that are compatible in terms of
(cf. [15, Table 2]). size, but differ in reading order. Then, based on certain features that
are extracted from each group, we identify and remove the textlines
1.2. Our contribution in each group that were grown according to a wrong reading-order
assumption.
In this paper, our objective is to provide a method for analyz- In Step III, we remove areas of overlap between the remaining
ing documents composed of two reading orders. To resolve such a textlines. Two types of textline overlap are possible: orthogonal and
hard problem, we feel that we need to go beyond the rule-based parallel. For both types, we have to assign the overlapping zone to
approach and take advantage of a more structured procedure that one textline only.
can learn from examples. Of course, learning by itself does not In Step IV, we form text regions out of related textlines. Then,
guarantee a good solution. The success of learning depends on in another step, we identify the multiple column structure in the
a good mechanism and also on the problem it has to solve. We related textlines.
divide our problem into several sub-problems, which arise natu- The flowchart for the above procedure is shown in Fig. 1.
rally when we employ a bottom-up procedure. At the beginning The remainder of this paper is organized as follows. Sections 2,
of the procedure, we grow components into textlines; and at the 3, 4, and 5 are devoted the four steps, respectively. Section 6 con-
end, we grow textlines into text regions. Two intermediate steps tains the experimental results. Finally, in Section 7, we present our
are also necessary to resolve conflicts between textlines. The first conclusions.
3202 C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213
Our experimental results show that when we set to 0.8, the final 3.1.1. Marking taboo boxes
accuracy rate of the layout analysis (expressed as an F1 score, which Figs. 4 and 5 show how we mark taboo boxes. The left-hand panel
we define in Section 6) is the same as when we set to 1; however, of each figure shows a document with an area highlighted in green,
we only spend 1/4 of the time forming textlines. For this reason, while the right-hand panel shows the document with the green area
we set the value of as 0.8. Note that the restriction on forming expanded. For a component C in the right-hand panel, let SL , SR , SA ,
textlines is designed to save computing time; it does not eliminate and SB be the distances between C and the boxes located in the areas
all duplicated textlines. to the left, right, above and below C, respectively. If no box is located
Our second operation merges two textlines if they are judged as on one side of C, say the left side, then SL is the distance between C
essentially the same; that is, if they overlap, have the same reading and the left margin of the document.
order, and they are similar. To decide whether two textlines are The rules for marking taboo boxes are as follows:
similar, we again employ a learning procedure. We examine any two
If min(SL , SR ) > max(SA , SB ), then C probably belongs to a vertical
labeled textlines, A and B, in the training data set that belong to
textline, and we draw taboo boxes on the left and the right of
the same labeled text region, and extract the following feature from
C (Fig. 4).
them:
If min(SA , SB ) > max(SL , SR ), then C probably belongs to a hori-
min(A
height ,Bheight ) zontal textline, and we draw taboo boxes above and below C.
4. max(A .
height ,Bheight ) If SL > max(SA , SB ) and min(SA , SB ) > SR , then C probably starts
a horizontal textline, and we draw taboo boxes above, below,
We then use an SVM learning procedure to train a binary decision and on the left of C.
function, whose non-negative values indicate that two textlines are If SR >max(SA ,SB ) and min(SA ,SB ) > SL , then C probably ends a
similar, and whose negative values indicate that they are dissimilar. horizontal textline, and we draw taboo boxes above, below, and
on the right of C.
3. Step II: removal of textlines with false reading orders If SA > max(SL , SR ) and min(SL , SR ) > SB , then C probably starts
a vertical textline, and we draw taboo boxes above, on the left,
Since we are dealing with documents that have two reading or- and on the right of C (Fig. 5).
ders, we must form textlines in both the horizontal and the vertical If SB > max(SL , SR ) and min(SL , SR ) > SA , then C probably ends a
directions. Some of these textlines may be spurious due to the false vertical textline, and we draw taboo boxes below, on the left,
assumption about their reading order; however, we can only iden- and on the right of C.
tify them after we have completed the formation process. In Section
3.1, we explain how to detect and remove such textlines. After this Fig. 6a shows all the horizontal taboo boxes (i.e., the boxes on the
step, we may find that some of the remaining textlines overlap. We right and left of components) in the document in Fig. 5, while Fig. 6b
discuss the solution to this problem in Section 3.2. shows all the vertical taboo boxes (i.e., the boxes above and below
the components) in the same document. Recall that we partition the
3.1. Finding compatible textlines sets of textlines into disjoint groups. Let GP be one such group. We
determine the majority reading order of GP as follows. Let BGP be
Two textlines are said to have conflicting reading orders if they the box that encloses GP, Vtaboo be the total area of the vertical
intersect and are compatible in terms of size, but their reading or- taboo boxes that overlap with BGP , and Htaboo be the total area of
ders are different. Recall that, in Section 2, we constructed a decision horizontal taboo boxes that overlap with BGP . If Vtaboo > Htaboo , the
function to decide whether two textlines are similar. The function majority reading order of GP is horizontal; otherwise, it is vertical.
can also be used here to decide whether two textlines are compati- We call textlines with the majority reading order major textlines;
ble. However, since we are now dealing with textlines composed of other textlines are called minor textlines.
opposite reading orders, we have to modify the involved feature ac-
cordingly. Suppose we are given a horizontal textline A and a vertical 3.2. Removing false reading orders
textline B; the feature we extract from them is
min(A In most cases, the majority reading order of GP is the reading
height ,Bwidth ) order of the textlines deemed to be genuine. Thus, we should only
5. max(A .
height ,Bwidth ) retain the major textlines in GP and remove the minor ones. How-
Next, we need to put all textlines into disjoint groups. We start ever, this solution may cause a problem if two text regions are very
by adding an arbitrary textline to a group G, and then recursively close together, as shown in Fig. 7a. When textlines are formed and
add textlines whose reading orders conflict with some members grouped, we end up with a single group because a textline, T, over-
of G. When this is done, we form another group, and so on until extends from one region to the other so that all textlines are pulled
all textlines have been partitioned into disjoint groups. Note that into the same group. In this case, the majority reading order is ver-
the above operation will always result in the same disjoint groups, tical. After removing all horizontal textlines, we find a few compo-
irrespective of the choice of the first textline. Then, we decide the nents that do not belong to any textlines (Fig. 7b). We call these
valid reading order for each group. The following feature is very components orphans.
useful for this purpose. To resolve the problem caused by orphans, we adopt the follow-
Given a group of textlines obtained in the above manner, we ex- ing remedy. Whenever the removal of minor textlines causes the oc-
amine the components that comprise each textline in the group. We currence of orphans, we retain those textlines (Fig. 7c), and look for
want each component to "guess'', from its own viewpoint, whether possible points to break all the textlines into two groups. The cut-
it belongs to a horizontal or a vertical textline. If it believes that it ting points must lie on the retained textlines or on the textlines that
belongs to a horizontal textline, we mark taboo boxes in the vertical intersect with them. Since these points are located in gaps between
direction (i.e., above and below the component); otherwise, we mark consecutive components, we examine all the gaps in the textlines.
taboo boxes in the horizontal direction (i.e., on the left and right of After performing the cut operation, we examine all the textlines
the component). In the special case where a component starts or whose reading orders still conflict with those of other textlines. Let
ends a textline, we mark one more taboo box before or after the U denote such a textline. If the removal of U would not create any
textline, respectively. orphans, we remove it; otherwise, we retain U.
3204 C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213
Fig. 4. The green area in (a) is expanded in (b). Taboo boxes (red) are drawn on the left and right of the component (yellow).
Fig. 5. The green area in (a) is expanded in (b). Taboo boxes are drawn above, on the left, and on the right of the component (yellow).
Fig. 6. (a) All horizontal taboo boxes are highlighted in yellow; (b) all vertical taboo boxes are highlighted in brown.
In the above example, making a cut at an appropriate gap in a create orphans; therefore, we retain those textlines, as shown in
textline T (the red rectangle in Fig. 7c) separates the upper and lower Fig. 7d, which is the solution we desire.
text regions. All the vertical textlines in the region above the cut To ensure that cuts are made at the appropriate gaps, we use
can now be removed, since doing so does not create any orphans. an SVM learning procedure to train a decision function, which de-
However, removal of the horizontal textlines in that region would cides whether or not to perform a cut at a particular gap. To prepare
C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213 3205
Fig. 7. (a) Textlines with conflicting reading orders. (b) When all minority textlines are removed, certain components become orphans. (c) The result of retaining the minority
textlines whose removal would create orphans. The textline T is over-extended and should be cut at the red rectangle. (d) The result of the cut operation and removal of
the textlines that do not contain orphans.
Fig. 8. T is an over-extended textline Twidth is the width of T, and G is the height of the gap marked by the red rectangle. MWR is a rectangle whose height is G. Its width
is five times that of Twidth , and its central line coincides with that of the gap LWR is a similar rectangle whose right margin coincides with the central line of the gap; and
RWR is another similar rectangle whose left margin coincides with the central line of the gap.
positive and negative examples, we first form textlines for the doc- textlines in the data set, since some of them are over-extended. We
uments in the training data set by using the method described in then examine all the gaps between consecutive components in the
Section 2. Note that these textlines are not the same as the labeled textlines thus formed. If a gap does not fall within any labeled text
3206 C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213
6. T G ,
width
|LWR∩(white pixels)|
7. Rl = ,
|LWR|
|MWR∩(white pixels)|
8. Rm = ,
|MWR|
|RWR∩(white pixels)|
9. Rr = ,
|RWR|
10. max(Rl , Rm , Rr ),
|LWR∩(taboo boxes)|
11. Rtl = ,
|LWR|
|MWR∩(taboo boxes)|
12. Rtm = ,
|LWR|
|RWR∩(taboo boxes)|
13. Rtr = ,
|LWR|
14. max(Rtl , Rtm , Rtr ),
4. Step III: elimination of textline overlaps Fig. 9. The intersection of V and H results in the overlapping zone OVLP; GAP1, GAP2,
GAP3, and GAP4 are the gaps between OVLP and the components above it, on the
right of it, below it, and on the left of it, respectively.
Before forming text regions out of textlines, we must ensure
that no overlaps exist between the textlines. The overlaps caused
by textlines with false reading orders are removed by the method
described in the previous section. Here, we deal with the remain-
ing overlaps, which can be divided into two types: orthogonal over-
laps (i.e., overlaps between textlines with opposite reading orders),
and parallel overlaps (i.e., overlaps between textlines with the same
reading order). In both cases, the overlapping zone must be assigned
to one textline only.
max(A ,B )
54. min(A width,B width) ,
width width
55--63. Nine features extracted from the GAP and its neighborhood.
These features are also similar to features 6--14.
5. Step IV: formation of text regions Fig. 12. (a) VDAB is the vertical distance between textlines A and B.
Fig. 13. (a) DAB is the distance between column A and column B, and RG denotes the region. (b) The vertical projection profile of RG.
B be the right-hand part, and DAB be the distance between A and B. Table 1
The eight types of layout structure and the number of documents in the training
We then compute the following three features:
data set for each type of structure, where 'H' stands for horizontal and 'V' for
vertical
DAB
72. , Type of layout structure Number of samples
STDheight
min(A ,B ) 1. H-headlines, H-textlines, and rectangle-shaped contents 192
73. width width , 2. H-headlines, H-textlines, and L-shaped contents 67
STDheight
3. H-headlines, V-textlines, and rectangle-shaped contents 62
RG 4. H-headlines, V-textlines, and L-shaped contents 188
height
74. , 5. V-headlines, V-textlines, and rectangle-shaped contents 182
STDheight
6. V-headlines, V-textlines, and L-shaped contents 115
7. Mixture of text and pictures 196
8. Official documents of the R.O.C. 50
where STDheight is the standard height of textlines in RG, defined
as follows: let S = {Theight : T is a textline in RG}. We let each mem- Total number of training documents 1052
Fig. 14. The eight types of document layout listed in Table 1 are shown in (a)--(h), respectively.
Table 2 truth files for each of the eight types of documents in Table 1; a
The performance of SVM decision functions on all sub-problems
total of 200 test documents. Before testing, we trained SVM decision
Step Sub-problem Involved Cross-validation functions on all the feature vectors extracted from the training data
features accuracy rate (%) set, instead of four-fifths of them as before. We then applied the
I Forming textlines 1--3 99.99 decision functions to the test documents and registered, in the output
Finding similar textlines 4 99.98 files, the textlines and text regions they found. To evaluate the test
II Finding compatible textlines 5 Not applieda results, we compared the output files with the ground-truth files as
Removing false reading orders 6--14 98.20 follows.
III Eliminating orthogonal textline overlaps 15--53 98.30 A bounding box Boutput in an output file is said to match a box
Eliminating parallel textline overlaps 54--63 96.40 Bgroud-truth in the corresponding ground-truth file if both the x-
IV Finding related textlines in orthog- 64--68 99.76 and y-coordinates of their corner points differ by less than 5 pixels
onal directions (cf. Section 3.3 in [15]). We define I(Bground-truth , Boutput ) to be
Finding related textlines in paral- 69--71 99.83
lel directions Bground-truth ∩ Boutput if the two boxes match; or a null set, if they
Formation of multiple columns 72--74 99.99 do not. Next, we define the following measures:
a This
sub-problem uses the decision function that was trained for the previous
sub-problem.
I(Bground-truth , Boutput )
Recall rate =
Bground-truth
(III) elimination of textline overlaps (Section 4); and (IV) formation
of text regions (Section 5).
To evaluate the end-to-end performance of our layout-analysis I(Bground-truth , Boutput )
Precision rate =
solution, we prepared 25 additional documents and their ground- Boutput
3210 C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213
Table 3 Table 6
Recall rates, precision rates, and F1 scores for textlines and text regions The results of the compared layout analysis approaches
Type of object Recall rate (%) Precision rate (%) F1 (%) Type of object Recall rate (%) Precision rate (%) F1 (%)
Textline 98.96 98.71 98.83 The learning approach Textline 98.96 98.71 98.83
Text region 96.91 96.54 96.72 Text region 96.91 96.54 96.72
Step Recall rate (%) Precision rate (%) F1 (%) 200 test documents. The results show that even though the learning
approach does not lead by much in textlines, it makes significant
I 34.41 7.13 11.8
II 38.45 36.09 37.23
headway in text regions. The two examples shown in Figs. 15 and
III 92.27 90.01 91.12 16 provide the reason: layout errors may only affect a few textlines,
IV 98.96 98.71 98.83 but they can damage very large text regions, thereby decreasing the
accuracy rates significantly in text regions.
Fig. 15. Left panel: an error (located in the green area) produced by the rule-based approach; right panel: the correct result obtained by the learning approach.
Fig. 16. Upper panel: an error (located in the green area) produced by the rule-based approach; lower panel: the correct result obtained by the learning approach.
less important the feature is deemed to be. This idea was adopted Fig. 18 shows another ambiguous situation, where a horizontal
in the context of SVM by Guyon et al. [24] and Rakotomamonjy textline, referred to as K, incorporates two smaller characters, located
[25]. In Table 7, we present the merit of each feature to the corre- in the green area. These characters must be read vertically; thus, K
sponding SVM problem. The features are ordered according to their is not an ordinary textline, since it contains two types of characters.
merits. It cannot be divided into two textlines either, because no clues exist
between the two parts. Our method again merges the two textlines
6.4. Unsolved problems to form a single textline.
In Fig. 19, the green area contains a watermark, which inter-
We conclude this section by considering some examples that our feres with our textline-formation process. However, our method
method could not analyze correctly. In Fig. 17, we have two horizon- cannot remove a graphical object that overlaps with textual
tal headlines. Our mistake is caused by the lower headline, denoted objects.
as L. The green area contains three Arabic digits that are much higher In the last example (Fig. 20a), the text region consists of only
than the other characters. We face an ambiguous situation here: L is two vertical textlines. Unfortunately, the total area of the vertical
not an ordinary textline because its characters do not have the same taboo boxes exceeds that of the horizontal taboo boxes (Figs. 20b and
height. Furthermore, it cannot be divided into two textlines, since no c), because some gaps within the vertical textlines are larger than
clues (e.g., sufficiently large gaps and/or taboo boxes) exist between the gaps between the textlines. In this instance, our method judges
the digits and their neighboring characters. Our method combines L that the majority reading order of the region is horizontal. Because
with the upper headline to form a single textline. of this wrong decision, all subsequent operations are inappropriate,
3212 C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213
Table 7
The 74 features, referred to by their indices, are ordered according to their merits
with respect to the SVM problems in which they are involved as variables
Index 3 2 1
Merit 613 432 321
Step I. Sub-problem: finding similar textlines
4
Not applieda
Step II. Sub-problem: finding compatible textlines
Fig. 19. A watermark overlaps one of the text regions.
5
Not applieda
Step II. Sub-problem: removing false reading orders
Index 8 9 10 7 12 13 14 11 6
Merit 613 540 362 312 276 265 264 252 150
Step III. Sub-problem: eliminating orthogonal textline overlaps
Index 25 36 60 18 16 43 15 42 44 32 49 39 52
Merit 122 105 99 89 87 59 58 51 42 41 38 38 28
Index 45 53 21 34 23 24 26 33 41 35 17 51 27
Merit 26 25 22 22 22 22 22 22 18 15 12 7 7
Index 28 19 40 37 31 38 29 30 22 48 47 20 46
Merit 7 6 5 4 4 3 2 2 2 2 1 1 1
Step III. Sub-problem: eliminating parallel textline overlaps
Index 54 57 63 56 60 59 58 55 62 61
Merit 85 55 38 36 34 29 27 22 19 3
Step IV. Sub-problem: finding related textlines in orthogonal direction
Index 64 68 67 65 66
Merit 6 5 4 2 1
Step IV. Sub-problem: finding related textlines in parallel direction
Index 71 70 69
Merit 535 486 329
Step IV. Sub-problem: formation of multiple columns
Index 74 72 73
Merit 3031 1723 934
Note that the decimal parts of the merit scores have been omitted.
a Since only one feature is involved in this problem, there is no need to compute Fig. 20. (a) A text region composed of only two vertical textlines. (b) The horizontal
the merit of that feature. taboo boxes are highlighted in yellow. (c) The vertical taboo boxes are highlighted
in brown. (d) The incorrect result of our analysis. (e) The correct result.
After completing one step with the help of an SVM decision function,
we train another decision function, based on the information ob-
tained in the previous steps. This "learning-while-doing'' strategy is
very effective in terms of the time spent constructing solutions and
Fig. 17. Two horizontal headlines; the lower one contains two types ofcharacters. the accuracy of the solutions. We believe the same strategy could be
used for many types of documents, not just those we experimented
with in this paper.
Acknowledgment
resulting in a messy outcome (Fig. 20d). The correct result is shown [1] M. Krishnamoorthy, G. Nagy, S. Seth, M. Viswanathan, Syntactic segmentation
and labeling of digitized pages from technical journals, IEEE Trans. Pattern Anal.
in Fig. 20e.
Mach. Intell. 15 (7) (1993) 737--747.
[2] D.J. Ittner, H.S. Baird, Language-free layout analysis, in: Second International
7. Conclusion Conference on Document Analysis and Recognition, 1993, pp. 336--340.
[3] T. Pavlidis, J. Zhou, Page segmentation and classification, Graphical Models
Image Process. 54 (6) (1992) 484--496.
Since we view document layout analysis as a series of split-or- [4] S.-W. Lee, D.-S. Ryu, Parameter-free geometric document layout analysis, IEEE
merge operations, we have to learn how to proceed step-by-step Trans. Pattern Anal. Mach. Intell. 23 (11) (2001) 1240--1256.
when we start to construct our solutions. Our approach involves [5] L. O'Gorman, The document spectrum for page layout analysis, IEEE Trans.
Pattern Anal. Mach. Intell. 15 (11) (1993) 1162--1173.
four steps, namely, textline formation, removal of false reading or- [6] A. Simon, J.-C. Pret, A.P. Johnson, A fast algorithm for bottom-up document
ders, elimination of textline overlaps, and text-region formation. layout analysis, IEEE Trans. Pattern Anal. Mach. Intell. 19 (3) (1997) 273--277.
C.-C. Wu et al. / Pattern Recognition 41 (2008) 3200 -- 3213 3213
[7] F. Liu, Y. Luo, M. Yoshikawa, D. Hu, A new component based algorithm for [16] S. Marinai, M. Gori, G. Soda, Artificial neural networks for document analysis
newspaper layout analysis, in: Sixth International Conference on Document and recognition, IEEE Trans. Pattern Anal. Mach. Intell. 27 (1) (2005) 23--35.
Analysis and Recognition, Seattle, USA, 2001, pp. 1176--1180. [17] C.L. Liu, K. Nakashima, H. Sako, H. Fujisawa, Handwritten digit recognition:
[8] J. Xi, J. Hu, L. Wu, Page segmentation of Chinese newspapers, Pattern Recognition benchmarking of state-of-the-art techniques, Pattern Recognition 36 (10) (2003)
35 (12) (2002) 2695--2704. 2271--2285.
[9] M. Chen, X. Ding, J. Liang, Analysis, understanding and representation of [18] C. Cortes, V. Vapnik, Support-vector network, Machine SVM 20 (1995) 273--297.
Chinese newspaper with complex layout, in: International Conference on Image [19] V. Vapnik, The Nature of Statistical SVM Theory, Springer, New York, 1995.
Processing, Vancouver, USA, 2000, pp. 590--593. [20] F. Chang, C.-J. Chen, C.-J. Lu, A linear-time component-labeling algorithm using
[10] M. Okamoto, M. Takahashi, A hybrid page segmentation method, in: Second contour tracing technique, Comput. Vision Image Understanding 93 (2) (2004)
International Conference on Document Analysis and Recognition, Tsukuba, 206--220.
Japan, 1993, pp. 743--748. [21] C.J. van Rijsbergen, Information Retrieval, Butterworths, London, 1979.
[11] J. Liu, Y.Y. Tang, C.Y. Suen, Chinese document layout analysis based on adaptive [22] D.D. Lewis, Evaluating and optimizing autonomous text classification systems,
split-and-merge and qualitative spatial reasoning, Pattern Recognition 30 (8) in: Proceedings of the 18th Annual International ACM SIGIR Conference on
(1997) 1265--1278. Research and Development in Information Retrieval (SIGIR 95), 1995, pp.
[12] F. Chang, S.-Y. Chu, C.-Y. Chen, Chinese document layout analysis using adaptive 246--254.
regrouping strategy, Pattern Recognition 38 (2) (2005) 261--271. [23] R. Kohavi, G. John, Wrappers for feature subset selection, Artif. Intell. 97 (1997)
[13] K. Etemad, D. Doermann, R. Chellappa, Multiscale segmentation of unstructured 273--324.
document pages using soft decision integration, IEEE Trans. Pattern Anal. Mach. [24] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification
Intell. 19 (1) (1997) 92--96. using support vector machines, Machine SVM 42 (2000) 389--422.
[14] A. Dengel, F. Dubiel, Computer understanding of document structure, Int. J. [25] A. Rakotomamonjy, Variable selection using SVM-based criteria, J. Machine SVM
Imaging Syst. Technol. 7 (1996) 271--278. Res. 3 (2003) 1357--1370.
[15] K. Laven, S. Leishman, S. Roweis, A statistical SVM approach to document
image analysis, in: Eighth International Conference on Document Analysis and
Recognition, Seoul, Korea, 2005, pp. 357--361.
About the author---CHUNG-CHIH WU received the B.S. degree from the Department of Information Engineering and Computer Science, Feng Chia University, Taiwan, in
2001, and the M.S. degree from the Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan, in 2003. He joined the Institute
of Information Science, Academia Sinica as a research assistant in 2004. His research interests include document layout analysis and image processing.
About the Author---CHIEN-HSING CHOU received the B.S. and M.S. degrees from the Department of Electrical Engineering, Tamkang University, Taiwan, in 1997 and
1999, respectively, and the Ph.D. degree from the Department of Electrical Engineering from Tamkang University, Taiwan, in 2003. He is currently a postdoctoral fellow at
the Institute of Information Science, Academia Sinica, Taiwan. His research interests include pattern recognition, neural networks, and image processing.
About the Author---FU CHANG received the B.A. degree in Philosophy from National Taiwan University in 1973, the M.S. degree in Mathematics from North Carolina
State University in 1978, and the Ph.D. degree in Mathematical Statistics from Columbia University in 1983. He worked as assistant professor in the Department of Applied
Mathematics, Operations Research and Statistics, State University of New York at Stony Brook (1983--1984), member of Technical Staff at Bell Communications Research, Inc.
(1984--1986) and at AT&T Bell Laboratories (1986--1990). He joined the Institute of Information Science as associate research fellow in 1990. His current research activities
are focused on machine learning, document analysis and recognition, and cognitive science.