Document Image Layout Analysis Via Explicit Edge Embedding Network
Document Image Layout Analysis Via Explicit Edge Embedding Network
Xingjiao Wu1 *, Yingbin Zheng2∗ , Tianlong Ma1†, Hao Ye2 , Liang He1†
1
East China Normal University, Shanghai, China 2 Videt Lab, Shanghai, China
Abstract
1
ing a network structure that can effectively connect high- chitecture in detail. In Section 4, we demonstrate the quali-
level features and low-level features is also crucial. Inspired tative and quantitative study of the framework. Finally, we
by the adaption branches strategy [44], we consider dynam- conclude our work in Section 5.
ically learning the connection structure from the edge rep-
resentation, and thus we propose the dynamic skip connec- 2. Related Work
tion block. The core idea of the dynamic skip connection
block is to calculate the information gain of the encoding Early DLA work can be divided into two categories, i.e.,
features and add them to the decoding layer by using differ- top-down and bottom-up strategies [5].
ently weighted overlaps. We report the impacts of different The top-down strategy iteratively divides pages into
components as well as the comparison with the state-of-the- columns, blocks, lines of text, and words. The represen-
art approaches on three challenging DLA benchmarks, i.e., tative works belonging to the top-down strategy include
DSSE-200 [47], CS-150 [9], and ICDAR2015 [2]. texture-based analysis [3], run-length smearing [35], DLA
In addition, we augment the data to ensure the univer- projection-profile [30] and white space analysis [31]. The
sality of the models better. The LaTex is used as a com- bottom-up strategy [37, 27, 25, 38] dynamically obtains
position engine with contents (images, tables, and text) we document analysis results from a small, granular data level.
prepared to synthetic data. However, the LaTex style can- It first uses some local features inherent from the text (such
not generate unnormal printing scale texts. The difference as black and white pixel spacing or connection spacing) to
from the previous work is that we use some images that in- detect individual words and then groups the words into lines
clude unconventional texts to replace text to overcome the of text and paragraphs. The top-down methods and the
limitation of the LaTex. To generate more image styles, bottom-up methods deal with common rectangular image
we use the MS COCO dataset [22] as the image material. layouts that are successful. However, for complex layouts,
Since the MS COCO images are annotated, we can easily these methods do not seem to be as effective.
obtain the corresponding image description to easily gener- With recent advances in deep convolutional neural net-
ate the title and description of the image. We use the data works, several methods based on neural networks have been
synthetic method to generate many samples and provide a proposed [14, 46, 20, 41, 52, 51, 34, 19, 45, 43]. For
closed-loop iteratively update the sample library to realize example, He et al. [14] used a multiscale network for se-
automatic learning of the model. mantic page segmentation and element contour detection
Our contributions are summarized as follows. based on three types of document elements (text blocks,
tables, and figures). Recently, the DLA task can also be
• For the layout task, we propose explicitly embedding considered a semantic segmentation task, which is to per-
edge information onto the image channels to generate form a pixel-level understanding of the segmentation ob-
a more efficient image input module. To focus on fea- ject [46, 51, 34, 19]. Xu et al. [46] trained the multitask
ture learning, we utilize the edge by generating three FCN to segment the document image into different regions,
edge channels. We propose explicitly embedding edge and Soullard et al. [34] used the FCN for historical news-
information onto the image channel to generate a more paper images. Zheng et al. [51] further included a deep
efficient image input module for the layout task. generative model for graphic design layouts to synthesize
• To obtain a universal and effective layout analysis layout designs. Li et al. [19] proposed the cross-domain
model, we employ the dynamic skip connection on the DOD model to learn the model for the target domain us-
FCN backbone for the learning of edge representation ing labeled data from the source domain and unlabeled data
and improve the data synthetic method for training data from the target domain. Here many of these papers employ
generation. the FCN [24] for semantic segmentation of the document
pages. With the help of a full convolution structure, FCN
• Extensive evaluations demonstrate the superior perfor- can adapt to any size of the image with the pooling oper-
mance of the proposed E 3 N et. Notably, we achieve ation, which balances the speed and accuracy. However,
state-of-the-art results on three document layout anal- it also causes the spatial information from the image to be
ysis datasets compared with the existing methods. We weakened during the propagation process. To compensate
also conduct an ablation study to evaluate the effect of for this problem, we use a skip connection structure to en-
the edge embedding block and the dynamic skip con- hance spatial information.
nection block. Our whole system can process approxi-
mately 8 document images per second. Data augmentation. In addition to improving the net-
work structure, many researchers focus on the expansion
The rest of this paper is organized as follows. Sec- of data. Some large-scale datasets with additional tools are
tion 2 introduces the background of document layout anal- proposed, and good results have been achieved through the
ysis. Section 3 discusses the model design and network ar- migration of these datasets. Yang et al. [47] proposed a
+ + +
16 16
I/ I/
8 256 256 256 256 8
I/ I/
conv4
4
I/
I/
128 128 128 128
2
I/
2
I
I/
conv3
64 64 64
32
6 32 32 conv2
RGB EE E conv1
W1
W2
W3
2
I/
4
I/
8
I/
32
128 64
images, tables, and texts. We prepare the necessary material 4. Evaluation and Discussion
by collecting figures and tables from web resources. More-
over, to enrich the image information, we use some images We evaluate the proposed E 3 N et on three document lay-
from MS COCO [22] and randomly add the corresponding out analysis benchmarks: DSSE-200 [47], CS-150 [9], and
image title. Due to the limitations of the type of text gen- ICDAR2015 [2]. We first introduce the experimental con-
erated by LaTex, we not only directly use the novel as text figuration. Then we show the qualitative results and com-
sources to generate pdf but also include unconventional text pare E 3 N et with prior works. Finally, we consider the ab-
images as text sources. Using these text sources can over- lation study on DSSE-200 to evaluate the effect of the dy-
come the limitation of LaTex. Using the novel material to namic skip connection block and the edge embedding block.
constrain the minimum unit is wordlevel. Constraint mini-
mum units can avoid format problems that appear by using 4.1. Configurations
some network resources (for example, a paragraph without Categories and model training. There is no unified stan-
spaces or some meaningless text resources). Using LaTex dard on the classification for layout at present. Many pre-
to generate the pdf, we can easily obtain the corresponding vious works divided the layout into three categories: figure,
label. We synthesized images, as shown in Fig. 4, and we tables, and others. Some works are divided into the follow-
can see that our synthesis data are very similar to the real ing seven categories: figure, table, paragraph, background,
document images. caption, lists, and section. However, if only the figure and
The E 3 N et has been able to obtain good results in prac- tables are classified, we cannot effectively use the text and
tical problems. Furthermore, we want to provide a better background information. If there are too many categories,
user experience. Therefore, we propose a semiautomatic it will be more cumbersome for the layout work. This pa-
hybrid data annotation strategy using a human-machine hy- per makes a trade-off and considers the following four cate-
brid. Our training process can regard as a closed-loop. We gories: text, figure, table, and background. We fine-tune the
mance. We first define M as the n × n confusion matrix
with n categories. Accuracy (Acc) is the ratio of the pixels
that are correctly predicted in a given image, i.e.,
P
Mii
Acc = P i (2)
ij Mij
2·P ·R
F1 = (5)
P +R
MIoU is the mean intersection-overuunion of each fore-
ground category, i.e.,
n
1 X M
M IoU = Pn Piin (6)
n + 1 i=0 j=0 Mij + j=0 Mji − Mii
Table 3: Comparing E 3 N et with previous network structures on the DSSE-200 and CS-150 datasets.
DSSE-200 CS-150
Method #Parameters
Acc P R F1 MIoU Acc P R F1 MIoU
SegNet [4] 29M 0.76 0.71 0.72 0.71 0.49 0.76 0.71 0.72 0.71 0.49
PANet [17] 168M 0.79 0.74 0.72 0.73 0.53 0.96 0.82 0.91 0.87 0.52
PSPNet [49] 46M 0.72 0.69 0.79 0.74 0.51 0.96 0.84 0.97 0.90 0.63
DV3+ [8] 53M 0.78 0.72 0.75 0.73 0.64 0.96 0.81 0.97 0.88 0.63
E 3 N et 3M 0.82 0.79 0.73 0.76 0.57 0.96 0.85 0.97 0.91 0.64
I/
16
bel, and bilateral filter.
256 256 256
conv4
4
I/
2
I/
64 64
3 32 32 conv2 . . .
+
RGB conv1
256
I/
16 Method Acc P R F1 MIoU
4
I/
2
I/
3
. . .
64
32
E 3 N et (Sobel)
I/
noting that the edge contains more discriminative informa- Figure 8: Confusion matrix for DSSE-200. Left: E 3 N et,
tion for background regions; therefore, E 3 N et can obtain Right: E 3 N et (LSB).
good results for background recognition even without fine-
tuning. E 3 N et has a better recognition effect for figures,
Table 8: Evaluation of the dynamic skip connection.
tables, and sections, and the mean score improves 4% com-
pared with [47]. For the ICDAR2015 dataset, as listed in
Method Acc P R F1 MIoU
Table 5, E 3 N et with fine-tuning also achieves a 3% mean
3
score improvement over [47]. In the comparison for CS- E N et 0.82 0.79 0.73 0.76 0.57
150, we use the precision, recall, and F1 as the metric, and E 3 N etw/oDSC 0.73 0.69 0.66 0.67 0.50
the results in Table 4 show that ours outperform previous E 3 N etw/oEdge 0.70 0.75 0.62 0.68 0.46
approaches for both the figure and table categories. Specif- E 3 N etw/oEdge&DSC 0.67 0.63 0.64 0.64 0.48
ically, when we use E 3 N et trained from synthetic docu-
ments without fine-tuning, the overall performance is com-
parable, and the recall rate is high. Fine-tuning brings the Model architecture. The design of an effective network
distribution of the CS-150 data and improves the precision is of great importance for model learning. In this paper,
of the whole network. we fuse the color channels and the edge explicitly into the
As mentioned in the previous sections, E 3 N et is de- augmented input, and another potential approach is to treat
signed based on an FCN-like backbone. Here, we also com- them as two independent steams and fuse them in the last
pare our approach with the state-of-the-art network for the few layers. Fig. 7 demonstrates the architecture of two-
general semantic segmentation task. Table 3 reports the per- stream fusion, which consists of two branches, two inde-
formance on the DSSE-200 and CS-150 datasets with dif- pendent encoders, and one decoder for the fusion. We list
ferent metrics. We can observe that the proposed E 3 N et the comparison in Table 6. Adding edge information under
achieves better results, while the parameter size is much a two-stream framework improves the performance (for ac-
smaller than others. curacy, F1 , and MIoU). However, compared with E 3 N et,
the effect is limited. This demonstrates that fusion at an
4.4. Ablation Study early stage can take full advantage of the complementarity
of color and edge clues.
In this section, we perform an ablation study on the
DSSE-200 dataset. We start by exploring the variations of Edge embedding. In this section, we verify the effect of the
the network architecture to find the optimal set of fusion EEB block. As shown in Table 7, removing the edge em-
strategies. Then the components in our frameworks, such as bedding causes a drop in all metrics, e.g., 12% for accuracy
the edge embedding block and the skip connection, are eval- and 11% MIoU. We also compare different edge embedding
uated. Throughout the experiments, we use E 3 N etw/oX to settings. The first is to use the single-channel Sobel edges
represent the network of E 3 N et without component X for (E 3 N et (Sobel) in Table 7). The results show that E 3 N et
presentation simplicity. outperforms E 3 N et (Sobel) by using more edge represen-
tation, probably because of the complementarity of different ument Analysis and Recognition, pages 1151–1155,
edge detectors. The second group of experiments involves 2015. 2, 5, 6
changing the edge detector in the EEB to other detectors. [3] A. Asi, R. Cohen, K. Kedem, and J. El-Sana. Simpli-
Here, we replace Canny with the bilateral filter and build fying the reading of historical manuscripts. In IAPR
the model of E 3 N et (LSB). From the table, we can see that International Conference on Document Analysis and
this combination is not as good as the original E 3 N et. We Recognition, pages 826–830, 2015. 2
make the confusion matrix for both networks (Fig. 8) and
[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. Seg-
find that the E 3 N et with Laplacian, Sobel, and Canny edge
net: A deep convolutional encoder-decoder archi-
detectors has significantly better representation for the texts
tecture for image segmentation. IEEE Transac-
and figures.
tions on Pattern Analysis and Machine Intelligence,
Dynamic skip connection. Incorporating the skip connec- 39(12):2481–2495, 2017. 6
tion has been proven to be useful for many computer vision [5] G. M. Binmakhashen and S. A. Mahmoud. Document
tasks, and we wonder whether it can promote a document layout analysis: A comprehensive survey. ACM Com-
layout analysis system. As shown in Table 8, the substantial puting Surveys, 52(6):109, 2019. 1, 2
performance gains over E 3 N etw/oDSC confirm the effec-
[6] G. M. BinMakhashen and S. A. Mahmoud. Historical
tiveness of using the dynamic skip connection for the DLA
document layout analysis using anisotropic diffusion
task. Although adding the DSC into a traditional FCN with-
and geometric features. International Journal on Dig-
out edges also improves the performance (Table 8, rows 3
ital Libraries, pages 1–14, 2020. 1
and 4), the network combined with DSC and edge embed-
ding has been improved on a larger scale and is able to show [7] R. Campos, V. Mangaravite, A. Pasquali, A. Jorge,
the powerful descriptive ability of document layouts. C. Nunes, and A. Jatowt. Yake! keyword extraction
from single documents using multiple local features.
Speed. The proposed framework is trained and evaluated Information Sciences, 509:257–289, 2020. 1
on a GPU. Inference using an image with a size of 512 ×
[8] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and
384 pixels takes 0.12 seconds with a single Nvidia Titan
H. Adam. Encoder-decoder with atrous separable con-
Xp, meaning that our whole system can generally process
volution for semantic image segmentation. In Euro-
approximately 8 document images per second.
pean Conference on Computer Vision, pages 801–818,
2018. 6
5. Conclusions
[9] C. Clark and S. Divvala. Pdffigures 2.0: Min-
In this paper, we presented a novel solution for construct- ing figures from research papers. In ACM/IEEE on
ing a model of universal document layout analysis. Our Joint Conference on Digital Libraries, pages 143–152,
approach explored the use of the dynamic skip connection 2016. 2, 5, 6, 7
block and edge information to improve the model structure
[10] C. A. Clark and S. Divvala. Looking beyond text:
and the construction of a complete synthetic data scheme.
Extracting figures, tables and captions from computer
We present a dynamic skip connection block that can be dy-
science papers. In Workshops at AAAI Conference on
namically provisioned based on specific instances. We use
Artificial Intelligence, 2015. 7
the edge embedding block to let the model more focused on
text content. In addition, we discuss the feasibility of the fu- [11] L. Ding and A. Goshtasby. On the canny edge detec-
sion strategy with the edge. Experimental comparisons with tor. Pattern Recognition, 34(3):721–725, 2001. 3
the state-of-the-art approaches on DSSE-200, CS-150, and [12] Z. Fu, T. Ma, Y. Zheng, H. Ye, J. Yang,
ICDAR2015 showed the effectiveness and efficiency of our and L. He. Edge-aware deep image deblurring.
proposed E 3 N et for the document layout analysis task. arXiv:1907.02282, 2019. 3
[13] M. Haurilet, Z. Al-Halah, and R. Stiefelhagen. Spase-
References multi-label page segmentation for presentation slides.
[1] D. Acuna, A. Kar, and S. Fidler. Devil is in the edges: In IEEE Winter Conference on Applications of Com-
Learning semantic boundaries from noisy annotations. puter Vision, pages 726–734, 2019. 3
In IEEE Conference on Computer Vision and Pattern [14] D. He, S. Cohen, B. Price, D. Kifer, and C. L. Giles.
Recognition, pages 11075–11083, 2019. 3 Multi-scale multi-task fcn for semantic page segmen-
[2] A. Antonacopoulos, C. Clausner, C. Papadopou- tation and table detection. In IAPR International Con-
los, and S. Pletschacher. ICDAR2015 competition ference on Document Analysis and Recognition, 2017.
on recognition of documents with complex layouts- 2
rdcl2015. In IAPR International Conference on Doc- [15] J. Kittler. On the accuracy of the sobel edge detector.
Image and Vision Computing, 1(1):37–42, 1983. 3
[16] A. Kölsch, A. Mishra, S. Varshneya, M. Z. Afzal, and [28] P. A. Praczyk and J. Nogueras-Iso. Automatic ex-
M. Liwicki. Recognizing challenging handwritten an- traction of figures from scientific publications in high-
notations with fully convolutional networks. In In- energy physics. Information Technology and Li-
ternational Conference on Frontiers in Handwriting braries, 32(4):25–52, 2013. 7
Recognition, pages 25–31, 2018. 3 [29] O. Ronneberger, P. Fischer, and T. Brox. U-net: Con-
[17] H. Li, P. Xiong, J. An, and L. Wang. Pyramid attention volutional networks for biomedical image segmenta-
network for semantic segmentation. In British Ma- tion. In International Conference on Medical Im-
chine Vision Conference, 2018. 6 age Computing and Computer-Assisted Intervention,
[18] J. Li, J. Yang, A. Hertzmann, J. Zhang, and T. Xu. pages 234–241, 2015. 4
Layoutgan: Generating graphic layouts with wire- [30] F. Shafait and T. M. Breuel. The effect of border noise
frame discriminators. In International Conference on on the performance of projection-based page segmen-
Learning Representations, 2019. 3 tation methods. IEEE Transactions on Pattern Anal-
[19] K. Li, C. Wigington, C. Tensmeyer, H. Zhao, ysis and Machine Intelligence, 33(4):846–851, 2010.
N. Barmpalios, V. I. Morariu, V. Manjunatha, T. Sun, 2
and Y. Fu. Cross-domain document object detection: [31] F. Shafait, J. Van Beusekom, D. Keysers, and T. M.
Benchmark suite and method. In IEEE Conference Breuel. Background variability modeling for statis-
on Computer Vision and Pattern Recognition, pages tical layout analysis. In International Conference on
12915–12924, 2020. 2 Pattern Recognition, pages 1–4, 2008. 2
[20] Y. Li, Y. Zou, and J. Ma. Deeplayout: A semantic seg- [32] N. Siegel, N. Lourie, R. Power, and W. Ammar. Ex-
mentation approach to page layout analysis. In Inter- tracting scientific figures with distantly supervised
national Conference on Intelligent Computing (ICIC), neural networks. In ACM/IEEE on Joint Conference
pages 266–277, 2018. 2 on Digital Libraries, pages 223–232, 2018. 3
[21] C. Lin, S. Zhuang, S. You, X. Liu, and Z. Zhu. Real-
[33] K. Simonyan and A. Zisserman. Very deep convo-
time foreground object segmentation networks using
lutional networks for large-scale image recognition.
long and short skip connections. Information Sci-
In International Conference on Learning Representa-
ences, 2021. 4 tions, 2015. 4
[22] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
[34] Y. Soullard, P. Tranouez, C. Chatelain, S. Nicolas,
D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft
and T. Paquet. Multi-scale gated fully convolutional
coco: Common objects in context. In European Con-
densenets for semantic labeling of historical newspa-
ference on Computer Vision, pages 740–755, 2014. 2,
per images. Pattern Recognition Letters, 131:435–
5
441, 2020. 2
[23] H. Liu, R. Xiong, Q. Song, F. Wu, and W. Gao. Im-
age super-resolution based on adaptive joint distribu- [35] W. Swaileh, K. A. Mohand, and T. Paquet. Multi-
tion modeling. In IEEE Visual Communications and script iterative steerable directional filtering for hand-
Image Processing, 2017. 3 written text line extraction. In IAPR International
Conference on Document Analysis and Recognition,
[24] J. Long, E. Shelhamer, and T. Darrell. Fully convolu- 2015. 2
tional networks for semantic segmentation. In IEEE
Conference on Computer Vision and Pattern Recogni- [36] T. Takikawa, D. Acuna, V. Jampani, and S. Fidler.
tion, pages 3431–3440, 2015. 1, 2 Gated-scnn: Gated shape cnns for semantic segmenta-
tion. In International Conference on Computer Vision,
[25] Y. Lu and C. L. Tan. Constructing area voronoi dia-
pages 5229–5238, 2019. 3
gram in document images. In IAPR International Con-
ference on Document Analysis and Recognition, pages [37] T. A. Tran, I.-S. Na, and S.-H. Kim. Hybrid page seg-
342–346, 2005. 2 mentation using multilevel homogeneity structure. In
[26] G. Mandal and D. Bhattacharjee. Learning-based sin- International Conference on Ubiquitous Information
gle image super-resolution with improved edge in- Management and Communication (IMCOM), 2015. 2
formation. Pattern Recognition and Image Analysis, [38] N. Vasilopoulos and E. Kavallieratou. Complex layout
30(3):391–400, 2020. 3 analysis based on contour classification and morpho-
[27] M. Mehri, P. Héroux, P. Gomez-Krämer, and R. Mul- logical operations. Engineering Applications of Artifi-
lot. Texture feature benchmarking and evaluation cial Intelligence, 65:220–229, 2017. 2
for historical document image analysis. Interna-
tional Journal on Document Analysis and Recogni-
tion, 20(1):1–35, 2017. 2
[39] K. Vyas and F. Frasincar. Determining the most repre- using fully convolutional networks. In International
sentative image on a web page. Information Sciences, Joint Conference on Artificial Intelligence (IJCAI),
512:1234–1248, 2020. 1 pages 1057–1063, 2018. 2
[40] X. Wang. Laplacian operator-based edge detectors. [47] X. Yang, E. Yumer, P. Asente, M. Kraley, D. Kifer,
IEEE Transactions on Pattern Analysis and Machine and C. Lee Giles. Learning to extract semantic struc-
Intelligence, 29(5):886–890, 2007. 1, 3 ture from documents using multimodal fully convolu-
[41] C. Wick and F. Puppe. Fully convolutional neural net- tional neural networks. In IEEE Conference on Com-
works for page segmentation of historical document puter Vision and Pattern Recognition, pages 5315–
images. In IAPR International Workshop on Docu- 5324, 2017. 2, 4, 5, 6, 7, 8
ment Analysis Systems, pages 287–292, 2018. 2 [48] C. Yuan, H. Huang, C. Feng, G. Shi, and
[42] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and X. Wei. Document-level relation extraction with
Q. Zhou. Look at boundary: A boundary-aware face entity-selection attention. Information Sciences,
alignment algorithm. In IEEE Conference on Com- 568:163–174, 2021. 1
puter Vision and Pattern Recognition, pages 2129– [49] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid
2138, 2018. 3 scene parsing network. In IEEE Conference on Com-
[43] X. Wu, Z. Hu, X. Du, J. Yang, and L. He. Document puter Vision and Pattern Recognition, pages 2881–
layout analysis via dynamic residual feature fusion. 2890, 2017. 6
In IEEE International Conference on Multimedia & [50] W. Zhao, J. Zhang, J. Yang, T. He, H. Ma, and Z. Li. A
Expo (ICME), 2021. 2 novel joint biomedical event extraction framework via
[44] X. Wu, Y. Zheng, H. Ye, W. Hu, T. Ma, J. Yang, and two-level modeling of documents. Information Sci-
L. He. Counting crowds with varying densities via ences, 550:27–40, 2021. 1
adaptive scenario discovery framework. Neurocom- [51] X. Zheng, X. Qiao, Y. Cao, and R. W. Lau. Content-
puting, 397:127–138, 2020. 2 aware generative modeling of graphic design layouts.
[45] Y. Xu, M. Li, L. Cui, S. Huang, F. Wei, and M. Zhou. ACM Transactions on Graphics (TOG), 38(4):1–15,
Layoutlm: Pre-training of text and layout for docu- 2019. 2
ment image understanding. In ACM SIGKDD Inter- [52] Y. Zheng, S. Kong, W. Zhu, and H. Ye. Scalable docu-
national Conference on Knowledge Discovery & Data ment image information extraction with application to
Mining, pages 1192–1200, 2020. 2 domain-specific analysis. In IEEE International Con-
[46] Y. Xu, F. Yin, Z. Zhang, and C.-L. Liu. Multi-task ference on Big Data, 2019. 2
layout analysis for historical handwritten documents