0% found this document useful (0 votes)
89 views

Document Image Layout Analysis Via Explicit Edge Embedding Network

This document proposes a new document layout analysis framework called the Explicit Edge Embedding Network (E3Net). The framework incorporates edge information extracted from document images to provide useful structural information. It uses a fully convolutional network backbone with an edge embedding block to explicitly include edge features. It also uses a dynamic skip connection block to learn representations from both color and edge features. The framework is evaluated on three document layout datasets and is shown to outperform previous approaches.

Uploaded by

roshan9786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views

Document Image Layout Analysis Via Explicit Edge Embedding Network

This document proposes a new document layout analysis framework called the Explicit Edge Embedding Network (E3Net). The framework incorporates edge information extracted from document images to provide useful structural information. It uses a fully convolutional network backbone with an edge embedding block to explicitly include edge features. It also uses a dynamic skip connection block to learn representations from both color and edge features. The framework is evaluated on three document layout datasets and is shown to outperform previous approaches.

Uploaded by

roshan9786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Document Image Layout Analysis via Explicit Edge Embedding Network

Xingjiao Wu1 *, Yingbin Zheng2∗ , Tianlong Ma1†, Hao Ye2 , Liang He1†
1
East China Normal University, Shanghai, China 2 Videt Lab, Shanghai, China

Abstract

Layout analysis from a document image plays an impor-


tant role in document content understanding and informa-
tion extraction systems. While many existing methods focus
on learning knowledge with convolutional networks directly
from color channels, we argue the importance of high-
frequency structures in document images, especially edge
information. In this paper, we present a novel document lay-
out analysis framework with the Explicit Edge Embedding
Network. Specifically, the proposed network contains the Figure 1: Left: the original document images. Middle:
edge embedding block and dynamic skip connection block ground-truth of the layouts (segmentation label colors are:
to produce detailed features, as well as a lightweight fully figure , table , and text ). Right: edges extracted by the
convolutional subnet as the backbone for the effectiveness Laplacian edge detectors [40].
of the framework. The edge embedding block is designed to
explicitly incorporate the edge information from the docu-
ment images. The dynamic skip connection block aims to Accurately estimating the content categories in a doc-
learn both color and edge representations with learnable ument is still a challenging task due to the gap between
weights. In contrast to the previous methods, we harness the the high-level semantic and low-level visual contents of the
model by using a synthetic document approach to overcome documents. While many existing methods focus on learn-
data scarcity. The combination of data augmentation and ing knowledge with convolutional networks directly from
edge embedding is important toward a more compact rep- color channels, we argue the importance of high-frequency
resentation than directly using the training images with only structures in document images, especially edge informa-
color channels. We conduct experiments using the proposed tion. The edges can provide skeleton information that is
framework on three document layout analysis benchmarks useful to understand the document structure. Specifically,
and demonstrate its superiority in terms of effectiveness and the edges usually contain the classification attributes of im-
efficiency over previous approaches. age regions and make the characteristics of document layout
more prominent. An example is shown in Fig. 1, where the
edge of text regions has a dense texture, the edge of fig-
1. Introduction ures is relatively smooth, and the edge of tables contains
more straight lines. Inspired by these observations, this pa-
Document layout analysis (DLA) aims to divide a docu- per works toward an effective layout analysis framework by
ment image into different regions, such as text, figures, and incorporating explicit edge knowledge. We design the Ex-
tables. Analysis of the layout from the document image plicit Edge Embedding Network (E3 Net), which superim-
plays an important role in document content understanding poses the edge information onto the image channel to gen-
and information extraction applications, such as document erate a more efficient image input block.
understanding [48], knowledge extraction [7, 39], handwrit- We employ the Fully Convolutional Network (FCN,
ing recognition [6], and biomedical event extraction [50]. A [24]) as the backbone. FCN is composed of layers that
modern DLA system usually consists of page segmentation represent high-level and low-level information through the
and logical structure analysis steps, and great progress has encoding part and then superimposes these features from
been achieved in recent years [5]. these response maps onto the decoder. The low-level feature
* These authors contributed equally to this work. maps tend to contain detailed information, while the high-
† Co-corresponding authors. level feature maps have more semantic information. Explor-

1
ing a network structure that can effectively connect high- chitecture in detail. In Section 4, we demonstrate the quali-
level features and low-level features is also crucial. Inspired tative and quantitative study of the framework. Finally, we
by the adaption branches strategy [44], we consider dynam- conclude our work in Section 5.
ically learning the connection structure from the edge rep-
resentation, and thus we propose the dynamic skip connec- 2. Related Work
tion block. The core idea of the dynamic skip connection
block is to calculate the information gain of the encoding Early DLA work can be divided into two categories, i.e.,
features and add them to the decoding layer by using differ- top-down and bottom-up strategies [5].
ently weighted overlaps. We report the impacts of different The top-down strategy iteratively divides pages into
components as well as the comparison with the state-of-the- columns, blocks, lines of text, and words. The represen-
art approaches on three challenging DLA benchmarks, i.e., tative works belonging to the top-down strategy include
DSSE-200 [47], CS-150 [9], and ICDAR2015 [2]. texture-based analysis [3], run-length smearing [35], DLA
In addition, we augment the data to ensure the univer- projection-profile [30] and white space analysis [31]. The
sality of the models better. The LaTex is used as a com- bottom-up strategy [37, 27, 25, 38] dynamically obtains
position engine with contents (images, tables, and text) we document analysis results from a small, granular data level.
prepared to synthetic data. However, the LaTex style can- It first uses some local features inherent from the text (such
not generate unnormal printing scale texts. The difference as black and white pixel spacing or connection spacing) to
from the previous work is that we use some images that in- detect individual words and then groups the words into lines
clude unconventional texts to replace text to overcome the of text and paragraphs. The top-down methods and the
limitation of the LaTex. To generate more image styles, bottom-up methods deal with common rectangular image
we use the MS COCO dataset [22] as the image material. layouts that are successful. However, for complex layouts,
Since the MS COCO images are annotated, we can easily these methods do not seem to be as effective.
obtain the corresponding image description to easily gener- With recent advances in deep convolutional neural net-
ate the title and description of the image. We use the data works, several methods based on neural networks have been
synthetic method to generate many samples and provide a proposed [14, 46, 20, 41, 52, 51, 34, 19, 45, 43]. For
closed-loop iteratively update the sample library to realize example, He et al. [14] used a multiscale network for se-
automatic learning of the model. mantic page segmentation and element contour detection
Our contributions are summarized as follows. based on three types of document elements (text blocks,
tables, and figures). Recently, the DLA task can also be
• For the layout task, we propose explicitly embedding considered a semantic segmentation task, which is to per-
edge information onto the image channels to generate form a pixel-level understanding of the segmentation ob-
a more efficient image input module. To focus on fea- ject [46, 51, 34, 19]. Xu et al. [46] trained the multitask
ture learning, we utilize the edge by generating three FCN to segment the document image into different regions,
edge channels. We propose explicitly embedding edge and Soullard et al. [34] used the FCN for historical news-
information onto the image channel to generate a more paper images. Zheng et al. [51] further included a deep
efficient image input module for the layout task. generative model for graphic design layouts to synthesize
• To obtain a universal and effective layout analysis layout designs. Li et al. [19] proposed the cross-domain
model, we employ the dynamic skip connection on the DOD model to learn the model for the target domain us-
FCN backbone for the learning of edge representation ing labeled data from the source domain and unlabeled data
and improve the data synthetic method for training data from the target domain. Here many of these papers employ
generation. the FCN [24] for semantic segmentation of the document
pages. With the help of a full convolution structure, FCN
• Extensive evaluations demonstrate the superior perfor- can adapt to any size of the image with the pooling oper-
mance of the proposed E 3 N et. Notably, we achieve ation, which balances the speed and accuracy. However,
state-of-the-art results on three document layout anal- it also causes the spatial information from the image to be
ysis datasets compared with the existing methods. We weakened during the propagation process. To compensate
also conduct an ablation study to evaluate the effect of for this problem, we use a skip connection structure to en-
the edge embedding block and the dynamic skip con- hance spatial information.
nection block. Our whole system can process approxi-
mately 8 document images per second. Data augmentation. In addition to improving the net-
work structure, many researchers focus on the expansion
The rest of this paper is organized as follows. Sec- of data. Some large-scale datasets with additional tools are
tion 2 introduces the background of document layout anal- proposed, and good results have been achieved through the
ysis. Section 3 discusses the model design and network ar- migration of these datasets. Yang et al. [47] proposed a
+ + +

16 16
I/ I/
8 256 256 256 256 8
I/ I/
conv4

4
I/

I/
128 128 128 128

2
I/
2
I

I/
conv3
64 64 64
32
6 32 32 conv2
RGB EE E conv1

W1
W2
W3

2
I/
4
I/
8
I/
32
128 64

Figure 2: Architecture of the Explicit Edge Embedding Network (E 3 N et).

synthetic dataset and an end-to-end multimodal FCN with


text embedding for extracting semantic structures from doc-
uments. Andreas et al. [16] introduced a very challenging
dataset of historic German documents of the task of rec-
ognizing handwritten documents. Li et al. [18] used Lay-
outGAN to augment the data by generating a different lay-
out. Haurilet et al. [13] introduced the SPaSe (slide page
segmentation) dataset, which contains dense, pixelwise an- Figure 3: The edgeoriginal Sobel
information Canny Laplace
from edge extraction meth-
3
notations of 25 classes for 2000 slides. Siegel et al. [32] ods that are used in the E N et. From left to right: original
proposed a method to induce high-quality labels to lever- image, edge maps extracted by Sobel, Canny, and Lapla-
age auxiliary data from arXiv and PubMed with no human cian.
intervention.
At present, data enhancement methods are mainly di-
vided into the following types, and one is generated by us- 3.1. Edge Embedding Block
ing existing auxiliary data. Such data mainly come from
The edges are direct characterizations of the image and
scientific documents, such as arXiv and other network re-
include some categorical properties. The use of edge in-
sources. The other is generated through the Generative
formation for improved image processing/analysis has at-
Adversarial Network (GAN), and finally generated by the
tracted many researchers to explore due to its excellent
LaTex. Using LaTex generation is a simple and effective
performance [23, 42, 1, 36, 12, 26].These edge extraction
method, but the text generated in this way is relatively sim-
methods can suppress the noise and ringing artifacts and
ple due to the LaTex limitation. The older font generation
smooth the staircases.
method ignores the relatively large font, and the font of the
To embed edge knowledge, we propose the edge embed-
text is limited. We use some novels as our text source; in
ding block (EEB), which superimposes edge information
this way, we can use the constraint that our basic unit of text
on the image channel to build a more effective image in-
is word-level, and we can control the position of the label
put block. Different edge extraction algorithms focus on
box more precisely during the process of text synthesis.
different edge information and edge strengths. As shown
3. Method in Fig. 3, the edges extracted by one operator cannot rep-
resent the overall edge information. In addition, to balance
As shown in Fig. 2, E 3 N et is composed of four parts: the number of channels, we propose using three different
the edge embedding block (EEB), an encoder structure, a edge extraction operators for edge extraction. In this paper,
decoder structure, and the dynamic skip connection block we use the Sobel edge detector [15], Laplacian edge detec-
(DSC). We show the encoder structure, the decoder struc- tor [40] and Canny edge detector [11] to locate sharp inten-
ture, and the dynamic skip connection block in Table 1. In sity changes and to find object boundaries in an image. Pre-
this section, we first introduce the edge embedding block vious work, such as [23], achieved good edge detection ef-
and then the dynamic skip connection block. Finally, we fects. However, the ablation study in the experiments shows
introduce the strategy of data synthesis. that our combination is more suitable for this architecture.
Algorithm 1: Augment input channels with edge each decoding layer is composed of the deconvolution, the
ReLU activation function, and the batch regularization.
Input: I (w × h × 3) : Original image channels
The dynamic skip connection block is a learnable con-
Output: E (w × h × 6) : Augmented channels
nection operation added based on U-Net [29]. The high-
1 Obtain image RGB channels IR , IG , IB ;
level features and low-level features will carry different in-
2 Obtain the grayscale Ig = Gray (IR , IG , IB );
formation intensities, but the traditional U-Net directly con-
3 Edge map by Sobel IS = Sobel (Ig );
nects to the encoder and decoder. This connected method
4 Edge map by Laplacian IL = Laplacian (Ig );
cannot distinguish high-level information and low-level in-
5 Edge map by Canny IC = Canny (Ig );
formation well. For more effective use of information in
6 Augmented channels
different dimensions, we connect the high-level features and
E = cat {IR , IG , IB , IS , IL , IC };
the low-level features using a learnable unit. The dynamic
7 return E.
skip connection block is structured with three parallel path-
ways. The first pathway is designed for low-level feature fu-
sion with the structure is: GAP(1)-FC(32, 4)-RELU-FC(4,
Adding edge information can reduce the image’s depen- 32)-Sigmoid. The second pathway structure is: GAP(1) -
dence on color to let the model focus on the learning of FC(64, 8) - RELU - FC(8, 64) - Sigmoid. The third pathway
features. The input channels are enhanced by explicitly is designed for high-level feature fusion , and its structure is:
appending edge information to the original image. Algo- GAP(1)-FC(128, 16)-RELU-FC(16, 128)-Sigmoid. Where
rithm 1 shows the step to generating the augmented chan- ‘GAP’ represents a Global Average Pooling layer, ‘RELU’
nels. We first obtain the RGB channel of the image and represents a rectified linear unit, ‘FC’ represents the fully
process the image into a grayscale image. Then, we use the connected layer and ‘Sigmoid’ represents the sigmoid acti-
Sobel, Laplacian, and Canny edge detectors to locate sharp vation function. The numbers in the parentheses of FC are
intensity changes and to find object boundaries in an image. the input parameter and the output parameter. Each pathway
Finally, we superimposed the RGB channel and the three outputs a regularized weight, and the sum of the weights of
edge maps as a 6-channel to accomplish the input. The EEB the three pathways is 1. After obtaining the weight coeffi-
consists of 6 channels, three of which are the RGB channels cients, we multiply the feature layer obtained by the encoder
of the image, and the remaining three channels are the edge by the corresponding weight coefficient and superimpose it
information of the image. on the corresponding feature layer of the decoder.
Loss. Cross-entropy loss is used as the loss function:
3.2. FCN with Dynamic Skip Connection
It is an important task of learning edge representation exlabel
L(x, label) = −wlabel log PN
xj
knowledge effectively. From the perspective of feature j=1 e
learning, low-level feature maps contain more detailed in-  XN  (1)
xj
formation, while high-level feature maps contain more se- = wlabel −xlabel + log e ,
j=1
mantic information. Traditional FCN cannot explicitly rep-
resent features because it only connects the encoder and the where x ∈ RN is the activation value without softmax, N is
decoder for superimposing the feature information [21]. We the feature dimension of x, label ∈ [0, C −1] is the scalar of
propose a dynamic skip connection block (DSC) to tackle the corresponding label, C is the number of classifications
this problem. The core idea of DSC is calculating the in- to be classified, and w ∈ RC is the label weight.
formation gain of the encoder feature and adding the infor-
3.3. Synthetic Document Data
mation gain to the decoder layer by using different weights.
Our method focuses on pixelwise segmentation with a fully The prerequisite for training a universal model is to pro-
convolutional network that uses an edge embedding block vide enough data. At present, the annotation data of the
and a dynamic skip connection block. We use a lightweight document layout analysis task are limited, so we improved
model as the backbone to maintain the model processing the data synthesis method proposed in [47]. Compared with
speed, and the backbone parameter amounts to only 1/6 of previous work, our data synthesis method introduces more
VGG16 [33]. The backbone is divided into two parts, the text elements. We add some special text images to make the
encoder and the decoder. The structure of the encoder is generated samples more natural and realistic. In addition,
shown in the first column of Table 1. It uses the 3 × 3 con- we propose a semiautomatic man-machine hybrid labeling
volution kernel and uses four max-pooling, and the encoder mode to provide more diverse data sources.
will reduce the image to 1/16 of the original. The decoder The document synthetic can be seen as a simple jigsaw
structure is shown in the third column of Table 1. The de- puzzle, and we will add table, figure, and text to an A4 for-
coder order is 256-128-64-32-16 from bottom to top, and mat document. We use LaTex to generate pdf by combining
Table 1: Configuration of the backbone. All convolu-
tional layers use padding to maintain the previous size. The
convolutional layer parameters are denoted as conv-(kernel
size)-(number of filters)-(dilation rate), and max-pooling
layers are conducted over a 2 pixel window with stride 2.
Here, deconv, conv, pool, and FC represent the deconvolu-
tion layer, convolution layer, max-pooling layer, and fully
connected layer, respectively.
Figure 4: Sample synthetic documents.
encoder ⇓ dynamic skip connection decoder ⇑
GAP deconv3-32-1 use synthetic data to train E 3 N et and test it after a random
conv3-6-1
FC-32-4 RELU
conv3-32-1 epoch. The test data are unlabeled; furthermore, we cannot
RELU BN(64)
conv3-32-1 obtain the specific indicators, but we can distinguish the ta-
FC-4-32 deconv3-16-1
max-pooling ble using edge detection. We use edge detection to obtain
Sigmoid conv1-4-1
a table area, and we use the table area as the masking la-
GAP bel. We will compare it with the classified prediction map.
conv3-64-1 FC-64-8 deconv3-64-1
We will choose high error rate images and annotate this im-
conv3-64-1 RELU RELU
age. We split these images into different elements (table and
max-pooling FC-8-64 BN(32)
Sigmoid figure) and put them to the data source to generate new data
samples for retraining. Specifically, we first input unlabeled
GAP data into the layout model and obtain the predicted result. In
conv3-128-1
FC-128-16 deconv3-128-1
conv3-128-1 addition, the unlabeled data will be input to a nonartificial
RELU RELU
conv3-128-1 intelligence algorithm (the table detection algorithm base
FC-16-128 BN(64)
max-pooling rule) to obtain the table area. We will compare the table in-
Sigmoid
formation predicted by the two algorithms. We will choose
conv3-256-1 those inconsistent data (i.e., data with a degree of differ-
deconv3-256-1
conv3-256-1
RELU ence of more than 60%) and manually label those data into
conv3-256-1
BN(128) the data pool. We split out the elements (table and figure)
max-pooling
in these data and put them into the data generation model to
generate more new data.

images, tables, and texts. We prepare the necessary material 4. Evaluation and Discussion
by collecting figures and tables from web resources. More-
over, to enrich the image information, we use some images We evaluate the proposed E 3 N et on three document lay-
from MS COCO [22] and randomly add the corresponding out analysis benchmarks: DSSE-200 [47], CS-150 [9], and
image title. Due to the limitations of the type of text gen- ICDAR2015 [2]. We first introduce the experimental con-
erated by LaTex, we not only directly use the novel as text figuration. Then we show the qualitative results and com-
sources to generate pdf but also include unconventional text pare E 3 N et with prior works. Finally, we consider the ab-
images as text sources. Using these text sources can over- lation study on DSSE-200 to evaluate the effect of the dy-
come the limitation of LaTex. Using the novel material to namic skip connection block and the edge embedding block.
constrain the minimum unit is wordlevel. Constraint mini-
mum units can avoid format problems that appear by using 4.1. Configurations
some network resources (for example, a paragraph without Categories and model training. There is no unified stan-
spaces or some meaningless text resources). Using LaTex dard on the classification for layout at present. Many pre-
to generate the pdf, we can easily obtain the corresponding vious works divided the layout into three categories: figure,
label. We synthesized images, as shown in Fig. 4, and we tables, and others. Some works are divided into the follow-
can see that our synthesis data are very similar to the real ing seven categories: figure, table, paragraph, background,
document images. caption, lists, and section. However, if only the figure and
The E 3 N et has been able to obtain good results in prac- tables are classified, we cannot effectively use the text and
tical problems. Furthermore, we want to provide a better background information. If there are too many categories,
user experience. Therefore, we propose a semiautomatic it will be more cumbersome for the layout work. This pa-
hybrid data annotation strategy using a human-machine hy- per makes a trade-off and considers the following four cate-
brid. Our training process can regard as a closed-loop. We gories: text, figure, table, and background. We fine-tune the
mance. We first define M as the n × n confusion matrix
with n categories. Accuracy (Acc) is the ratio of the pixels
that are correctly predicted in a given image, i.e.,
P
Mii
Acc = P i (2)
ij Mij

Precision (P) is the ratio that is actually a positive example


in the example that is divided into positive examples, i.e.,
n
1X Mii
P = Pi Pi = P (3)
n i=1 j Mji

Recall (R) measures the coverage. There are multiple pos-


itive examples of metrics that are divided into positive ex-
amples, i.e.,
n
1X Mii
R= Ri Ri = P (4)
n i=1 j Mij

F1 is an indicator used to measure the accuracy of a binary


model. It also takes into account the accuracy and recall
rate of the classification model. The F1 score can be seen
as a weighted average of model accuracy and recall:

2·P ·R
F1 = (5)
P +R
MIoU is the mean intersection-overuunion of each fore-
ground category, i.e.,
n
1 X M
M IoU = Pn Piin (6)
n + 1 i=0 j=0 Mij + j=0 Mji − Mii

Datasets. We employ three benchmarks for evaluation.


DSSE-200 [47] is a comprehensive dataset that includes var-
ious dataset styles. It contains 200 images, including pic-
Figure 5: Example real documents (top) and their cor- tures, PPT, brochure documents, old newspapers, and scan
responding segmentation predictions (bottom) on three files with light changes. CS-150 [9] is a dataset consist-
datasets. Segmentation label colors are: figure , table , ing of 150 papers. CS-150 is divided into three categories,
text , background (for DSSE-200 and ICDAR2015) and images, tables, and others, consisting of 1175 samples. IC-
DAR2015 focuses on appearance-based regions [2]. It con-
non-text (for CS-150).
sists of a magazine and journal that contain 7 training sets
and 70 tests. ICDAR2015 is not a simple rectangular seg-
mentation and is directly embedded in the paragraph. The
model by randomly select 10% of the target dataset as the sample dataset are illustrated in the top rows of Fig. 5.
train data, and then we reduce the learning rate to 1/10 of
the original learning rate. To prevent overfitting due to too 4.2. Qualitative Results
little data, we split the elements (tables and figures) of the
data and then put these elements into the LaTex document DSSE-200. We use the synthetic dataset to train the net-
synthesis engine for data expansion. We use these data to work and make predictions on the DSSE-200 dataset. The
fine-tune the model. overall performance of E 3 N et is accuracy 0.82, precision
0.79, recall 0.73, F1 0.76, and MIoU 0.57; the confusion
Metric. Several metrics are used to evaluate the perfor- matrix is illustrated in Fig. 6-Left. We can observe that
Table 2: Per-category comparison based on IoU scores (%) on the DSSE-200. FT indicates the model with fine-tuning.

Method background figure table section caption list paragraph mean


MFCN [47] 83.9 83.7 79.7 59.4 61.1 68.4 79.3 73.3
3
E N et 95.9 88.8 90.7 89.8 41.6 71.2 56.7 76.3
E 3 N et (FT) 96.5 96.1 93.0 77.0 50.4 60.6 68.3 77.4

Table 3: Comparing E 3 N et with previous network structures on the DSSE-200 and CS-150 datasets.

DSSE-200 CS-150
Method #Parameters
Acc P R F1 MIoU Acc P R F1 MIoU
SegNet [4] 29M 0.76 0.71 0.72 0.71 0.49 0.76 0.71 0.72 0.71 0.49
PANet [17] 168M 0.79 0.74 0.72 0.73 0.53 0.96 0.82 0.91 0.87 0.52
PSPNet [49] 46M 0.72 0.69 0.79 0.74 0.51 0.96 0.84 0.97 0.90 0.63
DV3+ [8] 53M 0.78 0.72 0.75 0.73 0.64 0.96 0.81 0.97 0.88 0.63
E 3 N et 3M 0.82 0.79 0.73 0.76 0.57 0.96 0.85 0.97 0.91 0.64

Table 5: Per-category comparison based on IoU scores (%)


on ICDAR2015.

Method non-text text figure mean


MFCN [47] 94.5 91.0 77.1 87.53
3
E N et 81.6 79.1 85.0 81.87
E 3 N et (FT) 90.1 88.3 93.5 90.59

Figure 6: Confusion matrices for DSSE-200 (left) and CS-


150 (right).
an experiment of CS-150, and the results are accuracy 0.96,
precision 0.85, recall 0.97, F1 0.91, and MIoU 0.64. The
Table 4: Per-category comparison based on CS-150. performance is good in CS-150 for both the overall metrics
and the per-category results (as shown in the confusion ma-
figure table trix of Fig. 6-Right). The CS-150 dataset is entirely com-
Method
P R F1 P R F1 posed of scientific papers, and the layout is relatively sim-
Praczyk et al. [28] 0.624 0.500 0.555 0.429 0.363 0.393
ple. We demonstrate some document images and the corre-
Clark et al. [10] 0.961 0.911 0.935 0.962 0.921 0.941
Clark et al. [9] 0.980 0.961 0.970 0.979 0.963 0.971
sponding predictions of CS-150 in Fig. 5(b).
E 3 N et 0.938 0.972 0.956 0.834 0.988 0.905 ICDAR2015. We also conduct qualitative results for IC-
E 3 N et (FT) 0.986 0.970 0.978 0.971 0.977 0.973 DAR2015, as illustrated in Fig. 5(c). With the help of
an edge embedding network, our methods can successfully
classify most of the pixels into layout categories with dif-
E 3 N et has excellent recognition rates for backgrounds, fig- ferent backgrounds and visual contents. In addition, we can
ures, and tables, as the edge information is used to improve see that E 3 N et has a successful effect of dealing with the
background discrimination. However, the ability to recog- figure that is directly embedded in the text paragraph.
nize text is slightly lower than other categories, e.g., some
text pixels are recognized as the table, probably because the 4.3. Comparison with Prior Arts
contents for text and table are quite similar. The sample
To evaluate our model, we compare it with state-of-the-
documents and their corresponding predictions in DSSE-
art document layout analysis methods, which also make use
200 are shown in Fig. 5(a). In general, the document lay-
of image content as input. We follow the settings and eval-
outs are correctly extracted, and the border of the regions
uation protocols of [47] (for DSSE-200 and ICDAR2015)
can be refined with postprocessing steps such as connected
and [9] (for CS-150).
component analysis.
For the DSSE-200 dataset, we can see that our results
CS-150. We follow the same step in DSSE-200 to conduct are more effective than those in [47] (Table 2). It is worth
Table 7: Evaluation of different edge embedding settings.
LSB indicates the edge embedding block by Laplacian, So-
. . .

I/
16
bel, and bilateral filter.
256 256 256
conv4

4
I/
2
I/

64 64
3 32 32 conv2 . . .
+
RGB conv1

256
I/
16 Method Acc P R F1 MIoU

4
I/

2
I/
3
. . .
64
32

E N et 0.82 0.79 0.73 0.76 0.57


E 3 N etw/oEdge
16

0.70 0.75 0.62 0.68 0.46


I/
256 256 256
conv4
4
I/
2
I

E 3 N et (Sobel)
I/

0.78 0.73 0.77 0.75 0.58


64 64
3 32 32 conv2
EEE conv1

E 3 N et (LSB) 0.78 0.73 0.76 0.75 0.51


Figure 7: Network for two-stream fusion.

Table 6: Comparison with two-stream fusion.

Method Acc P R F1 MIoU


3
E N etw/oEdge 0.70 0.75 0.62 0.68 0.46
Two-stream fusion 0.74 0.72 0.66 0.69 0.53
E 3 N et 0.82 0.79 0.73 0.76 0.57

noting that the edge contains more discriminative informa- Figure 8: Confusion matrix for DSSE-200. Left: E 3 N et,
tion for background regions; therefore, E 3 N et can obtain Right: E 3 N et (LSB).
good results for background recognition even without fine-
tuning. E 3 N et has a better recognition effect for figures,
Table 8: Evaluation of the dynamic skip connection.
tables, and sections, and the mean score improves 4% com-
pared with [47]. For the ICDAR2015 dataset, as listed in
Method Acc P R F1 MIoU
Table 5, E 3 N et with fine-tuning also achieves a 3% mean
3
score improvement over [47]. In the comparison for CS- E N et 0.82 0.79 0.73 0.76 0.57
150, we use the precision, recall, and F1 as the metric, and E 3 N etw/oDSC 0.73 0.69 0.66 0.67 0.50
the results in Table 4 show that ours outperform previous E 3 N etw/oEdge 0.70 0.75 0.62 0.68 0.46
approaches for both the figure and table categories. Specif- E 3 N etw/oEdge&DSC 0.67 0.63 0.64 0.64 0.48
ically, when we use E 3 N et trained from synthetic docu-
ments without fine-tuning, the overall performance is com-
parable, and the recall rate is high. Fine-tuning brings the Model architecture. The design of an effective network
distribution of the CS-150 data and improves the precision is of great importance for model learning. In this paper,
of the whole network. we fuse the color channels and the edge explicitly into the
As mentioned in the previous sections, E 3 N et is de- augmented input, and another potential approach is to treat
signed based on an FCN-like backbone. Here, we also com- them as two independent steams and fuse them in the last
pare our approach with the state-of-the-art network for the few layers. Fig. 7 demonstrates the architecture of two-
general semantic segmentation task. Table 3 reports the per- stream fusion, which consists of two branches, two inde-
formance on the DSSE-200 and CS-150 datasets with dif- pendent encoders, and one decoder for the fusion. We list
ferent metrics. We can observe that the proposed E 3 N et the comparison in Table 6. Adding edge information under
achieves better results, while the parameter size is much a two-stream framework improves the performance (for ac-
smaller than others. curacy, F1 , and MIoU). However, compared with E 3 N et,
the effect is limited. This demonstrates that fusion at an
4.4. Ablation Study early stage can take full advantage of the complementarity
of color and edge clues.
In this section, we perform an ablation study on the
DSSE-200 dataset. We start by exploring the variations of Edge embedding. In this section, we verify the effect of the
the network architecture to find the optimal set of fusion EEB block. As shown in Table 7, removing the edge em-
strategies. Then the components in our frameworks, such as bedding causes a drop in all metrics, e.g., 12% for accuracy
the edge embedding block and the skip connection, are eval- and 11% MIoU. We also compare different edge embedding
uated. Throughout the experiments, we use E 3 N etw/oX to settings. The first is to use the single-channel Sobel edges
represent the network of E 3 N et without component X for (E 3 N et (Sobel) in Table 7). The results show that E 3 N et
presentation simplicity. outperforms E 3 N et (Sobel) by using more edge represen-
tation, probably because of the complementarity of different ument Analysis and Recognition, pages 1151–1155,
edge detectors. The second group of experiments involves 2015. 2, 5, 6
changing the edge detector in the EEB to other detectors. [3] A. Asi, R. Cohen, K. Kedem, and J. El-Sana. Simpli-
Here, we replace Canny with the bilateral filter and build fying the reading of historical manuscripts. In IAPR
the model of E 3 N et (LSB). From the table, we can see that International Conference on Document Analysis and
this combination is not as good as the original E 3 N et. We Recognition, pages 826–830, 2015. 2
make the confusion matrix for both networks (Fig. 8) and
[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg-
find that the E 3 N et with Laplacian, Sobel, and Canny edge
net: A deep convolutional encoder-decoder archi-
detectors has significantly better representation for the texts
tecture for image segmentation. IEEE Transac-
and figures.
tions on Pattern Analysis and Machine Intelligence,
Dynamic skip connection. Incorporating the skip connec- 39(12):2481–2495, 2017. 6
tion has been proven to be useful for many computer vision [5] G. M. Binmakhashen and S. A. Mahmoud. Document
tasks, and we wonder whether it can promote a document layout analysis: A comprehensive survey. ACM Com-
layout analysis system. As shown in Table 8, the substantial puting Surveys, 52(6):109, 2019. 1, 2
performance gains over E 3 N etw/oDSC confirm the effec-
[6] G. M. BinMakhashen and S. A. Mahmoud. Historical
tiveness of using the dynamic skip connection for the DLA
document layout analysis using anisotropic diffusion
task. Although adding the DSC into a traditional FCN with-
and geometric features. International Journal on Dig-
out edges also improves the performance (Table 8, rows 3
ital Libraries, pages 1–14, 2020. 1
and 4), the network combined with DSC and edge embed-
ding has been improved on a larger scale and is able to show [7] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
the powerful descriptive ability of document layouts. C. Nunes, and A. Jatowt. Yake! keyword extraction
from single documents using multiple local features.
Speed. The proposed framework is trained and evaluated Information Sciences, 509:257–289, 2020. 1
on a GPU. Inference using an image with a size of 512 ×
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and
384 pixels takes 0.12 seconds with a single Nvidia Titan
H. Adam. Encoder-decoder with atrous separable con-
Xp, meaning that our whole system can generally process
volution for semantic image segmentation. In Euro-
approximately 8 document images per second.
pean Conference on Computer Vision, pages 801–818,
2018. 6
5. Conclusions
[9] C. Clark and S. Divvala. Pdffigures 2.0: Min-
In this paper, we presented a novel solution for construct- ing figures from research papers. In ACM/IEEE on
ing a model of universal document layout analysis. Our Joint Conference on Digital Libraries, pages 143–152,
approach explored the use of the dynamic skip connection 2016. 2, 5, 6, 7
block and edge information to improve the model structure
[10] C. A. Clark and S. Divvala. Looking beyond text:
and the construction of a complete synthetic data scheme.
Extracting figures, tables and captions from computer
We present a dynamic skip connection block that can be dy-
science papers. In Workshops at AAAI Conference on
namically provisioned based on specific instances. We use
Artificial Intelligence, 2015. 7
the edge embedding block to let the model more focused on
text content. In addition, we discuss the feasibility of the fu- [11] L. Ding and A. Goshtasby. On the canny edge detec-
sion strategy with the edge. Experimental comparisons with tor. Pattern Recognition, 34(3):721–725, 2001. 3
the state-of-the-art approaches on DSSE-200, CS-150, and [12] Z. Fu, T. Ma, Y. Zheng, H. Ye, J. Yang,
ICDAR2015 showed the effectiveness and efficiency of our and L. He. Edge-aware deep image deblurring.
proposed E 3 N et for the document layout analysis task. arXiv:1907.02282, 2019. 3
[13] M. Haurilet, Z. Al-Halah, and R. Stiefelhagen. Spase-
References multi-label page segmentation for presentation slides.
[1] D. Acuna, A. Kar, and S. Fidler. Devil is in the edges: In IEEE Winter Conference on Applications of Com-
Learning semantic boundaries from noisy annotations. puter Vision, pages 726–734, 2019. 3
In IEEE Conference on Computer Vision and Pattern [14] D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles.
Recognition, pages 11075–11083, 2019. 3 Multi-scale multi-task fcn for semantic page segmen-
[2] A. Antonacopoulos, C. Clausner, C. Papadopou- tation and table detection. In IAPR International Con-
los, and S. Pletschacher. ICDAR2015 competition ference on Document Analysis and Recognition, 2017.
on recognition of documents with complex layouts- 2
rdcl2015. In IAPR International Conference on Doc- [15] J. Kittler. On the accuracy of the sobel edge detector.
Image and Vision Computing, 1(1):37–42, 1983. 3
[16] A. Kölsch, A. Mishra, S. Varshneya, M. Z. Afzal, and [28] P. A. Praczyk and J. Nogueras-Iso. Automatic ex-
M. Liwicki. Recognizing challenging handwritten an- traction of figures from scientific publications in high-
notations with fully convolutional networks. In In- energy physics. Information Technology and Li-
ternational Conference on Frontiers in Handwriting braries, 32(4):25–52, 2013. 7
Recognition, pages 25–31, 2018. 3 [29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
[17] H. Li, P. Xiong, J. An, and L. Wang. Pyramid attention volutional networks for biomedical image segmenta-
network for semantic segmentation. In British Ma- tion. In International Conference on Medical Im-
chine Vision Conference, 2018. 6 age Computing and Computer-Assisted Intervention,
[18] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu. pages 234–241, 2015. 4
Layoutgan: Generating graphic layouts with wire- [30] F. Shafait and T. M. Breuel. The effect of border noise
frame discriminators. In International Conference on on the performance of projection-based page segmen-
Learning Representations, 2019. 3 tation methods. IEEE Transactions on Pattern Anal-
[19] K. Li, C. Wigington, C. Tensmeyer, H. Zhao, ysis and Machine Intelligence, 33(4):846–851, 2010.
N. Barmpalios, V. I. Morariu, V. Manjunatha, T. Sun, 2
and Y. Fu. Cross-domain document object detection: [31] F. Shafait, J. Van Beusekom, D. Keysers, and T. M.
Benchmark suite and method. In IEEE Conference Breuel. Background variability modeling for statis-
on Computer Vision and Pattern Recognition, pages tical layout analysis. In International Conference on
12915–12924, 2020. 2 Pattern Recognition, pages 1–4, 2008. 2
[20] Y. Li, Y. Zou, and J. Ma. Deeplayout: A semantic seg- [32] N. Siegel, N. Lourie, R. Power, and W. Ammar. Ex-
mentation approach to page layout analysis. In Inter- tracting scientific figures with distantly supervised
national Conference on Intelligent Computing (ICIC), neural networks. In ACM/IEEE on Joint Conference
pages 266–277, 2018. 2 on Digital Libraries, pages 223–232, 2018. 3
[21] C. Lin, S. Zhuang, S. You, X. Liu, and Z. Zhu. Real-
[33] K. Simonyan and A. Zisserman. Very deep convo-
time foreground object segmentation networks using
lutional networks for large-scale image recognition.
long and short skip connections. Information Sci-
In International Conference on Learning Representa-
ences, 2021. 4 tions, 2015. 4
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
[34] Y. Soullard, P. Tranouez, C. Chatelain, S. Nicolas,
D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft
and T. Paquet. Multi-scale gated fully convolutional
coco: Common objects in context. In European Con-
densenets for semantic labeling of historical newspa-
ference on Computer Vision, pages 740–755, 2014. 2,
per images. Pattern Recognition Letters, 131:435–
5
441, 2020. 2
[23] H. Liu, R. Xiong, Q. Song, F. Wu, and W. Gao. Im-
age super-resolution based on adaptive joint distribu- [35] W. Swaileh, K. A. Mohand, and T. Paquet. Multi-
tion modeling. In IEEE Visual Communications and script iterative steerable directional filtering for hand-
Image Processing, 2017. 3 written text line extraction. In IAPR International
Conference on Document Analysis and Recognition,
[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- 2015. 2
tional networks for semantic segmentation. In IEEE
Conference on Computer Vision and Pattern Recogni- [36] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler.
tion, pages 3431–3440, 2015. 1, 2 Gated-scnn: Gated shape cnns for semantic segmenta-
tion. In International Conference on Computer Vision,
[25] Y. Lu and C. L. Tan. Constructing area voronoi dia-
pages 5229–5238, 2019. 3
gram in document images. In IAPR International Con-
ference on Document Analysis and Recognition, pages [37] T. A. Tran, I.-S. Na, and S.-H. Kim. Hybrid page seg-
342–346, 2005. 2 mentation using multilevel homogeneity structure. In
[26] G. Mandal and D. Bhattacharjee. Learning-based sin- International Conference on Ubiquitous Information
gle image super-resolution with improved edge in- Management and Communication (IMCOM), 2015. 2
formation. Pattern Recognition and Image Analysis, [38] N. Vasilopoulos and E. Kavallieratou. Complex layout
30(3):391–400, 2020. 3 analysis based on contour classification and morpho-
[27] M. Mehri, P. Héroux, P. Gomez-Krämer, and R. Mul- logical operations. Engineering Applications of Artifi-
lot. Texture feature benchmarking and evaluation cial Intelligence, 65:220–229, 2017. 2
for historical document image analysis. Interna-
tional Journal on Document Analysis and Recogni-
tion, 20(1):1–35, 2017. 2
[39] K. Vyas and F. Frasincar. Determining the most repre- using fully convolutional networks. In International
sentative image on a web page. Information Sciences, Joint Conference on Artificial Intelligence (IJCAI),
512:1234–1248, 2020. 1 pages 1057–1063, 2018. 2
[40] X. Wang. Laplacian operator-based edge detectors. [47] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer,
IEEE Transactions on Pattern Analysis and Machine and C. Lee Giles. Learning to extract semantic struc-
Intelligence, 29(5):886–890, 2007. 1, 3 ture from documents using multimodal fully convolu-
[41] C. Wick and F. Puppe. Fully convolutional neural net- tional neural networks. In IEEE Conference on Com-
works for page segmentation of historical document puter Vision and Pattern Recognition, pages 5315–
images. In IAPR International Workshop on Docu- 5324, 2017. 2, 4, 5, 6, 7, 8
ment Analysis Systems, pages 287–292, 2018. 2 [48] C. Yuan, H. Huang, C. Feng, G. Shi, and
[42] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and X. Wei. Document-level relation extraction with
Q. Zhou. Look at boundary: A boundary-aware face entity-selection attention. Information Sciences,
alignment algorithm. In IEEE Conference on Com- 568:163–174, 2021. 1
puter Vision and Pattern Recognition, pages 2129– [49] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid
2138, 2018. 3 scene parsing network. In IEEE Conference on Com-
[43] X. Wu, Z. Hu, X. Du, J. Yang, and L. He. Document puter Vision and Pattern Recognition, pages 2881–
layout analysis via dynamic residual feature fusion. 2890, 2017. 6
In IEEE International Conference on Multimedia & [50] W. Zhao, J. Zhang, J. Yang, T. He, H. Ma, and Z. Li. A
Expo (ICME), 2021. 2 novel joint biomedical event extraction framework via
[44] X. Wu, Y. Zheng, H. Ye, W. Hu, T. Ma, J. Yang, and two-level modeling of documents. Information Sci-
L. He. Counting crowds with varying densities via ences, 550:27–40, 2021. 1
adaptive scenario discovery framework. Neurocom- [51] X. Zheng, X. Qiao, Y. Cao, and R. W. Lau. Content-
puting, 397:127–138, 2020. 2 aware generative modeling of graphic design layouts.
[45] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou. ACM Transactions on Graphics (TOG), 38(4):1–15,
Layoutlm: Pre-training of text and layout for docu- 2019. 2
ment image understanding. In ACM SIGKDD Inter- [52] Y. Zheng, S. Kong, W. Zhu, and H. Ye. Scalable docu-
national Conference on Knowledge Discovery & Data ment image information extraction with application to
Mining, pages 1192–1200, 2020. 2 domain-specific analysis. In IEEE International Con-
[46] Y. Xu, F. Yin, Z. Zhang, and C.-L. Liu. Multi-task ference on Big Data, 2019. 2
layout analysis for historical handwritten documents

You might also like