Convolutional Neural Networks For Page Segmentation of Historical Document Images
Convolutional Neural Networks For Page Segmentation of Historical Document Images
Abstract—This paper presents a page segmentation method pattern between neurons of adjacent layers, CNN can dis-
for handwritten historical document images based on a Con- cover spatial correlations at different granularity of local
volutional Neural Network (CNN). We consider page segment- context [10]. With multiple convolutional layers and pooling
ation as a pixel labeling problem, i.e., each pixel is classified as
one of the predefined classes. Traditional methods in this area layers, CNN has achieved many successes in various fields,
rely on hand-crafted features carefully tuned considering prior e.g., handwriting recognition [11], image classification [12],
knowledge. In contrast, we propose to learn features from raw and text recognition in natural images [13].
image pixels using a CNN. While many researchers focus on In [14], the authors show that an autoencoder can be
developing deep CNN architectures to solve different problems, used to learn features automatically on the training images.
we train a simple CNN with only one convolution layer. We
show that the simple architecture achieves competitive results An autoencoder is a feed forward neural network trained to
against other deep architectures on different public datasets. reconstruct its input. Hidden layers outputs are then used as
Experiments also demonstrate the effectiveness and superiority features to feed an off-the-shelf classifier. In [15], the authors
of the proposed method compared to previous methods. show that by using superpixels as units of labeling, the speed
Keywords-convolutional neural network; page segmentation; of the method is increased. In [16], a Conditional Random
layout analysis; historical document images; deep learning; Field (CRF) [17] is applied in order to model the local
and contextual information jointly to refine the segmentation
I. I NTRODUCTION results which have been achieved in [15]. Following the
Page segmentation is an important prerequisite step of same idea of [16], we consider the segmentation problem
document image analysis and understanding. The goal is to as an image patch labeling problem. The image patches
split a document image into regions of interest. Compared are generated by using superpixels algorithm. In contrast
to segmentation of machine printed document images, page to [14], [15], [16], in this work, we focus on developing
segmentation of historical document images is more chal- an end-to-end method. We combine feature learning and
lenging due to many variations such as layout structure, classifier training into one step. Image patches are used as
decoration, writing style, and degradation. Our goal is to input to train a CNN for the labeling task. During training,
develop a generic segmentation method for handwritten the features used to predict labels of the image patches are
historical documents. In this method, we consider the seg- learned on the convolution layers of the CNN.
mentation problem as a pixel-labeling problem, i.e., for a While many researchers focus on developing very deep
given document image, each pixel is labeled as one of the CNN to solving various problems [12], [18], [19], we train
predefined classes. a simple CNN of one convolution layer. Experiments on
Some page segmentation methods have been developed public historical document image datasets show that despite
recently. These methods rely on hand-crafted features [1], the simple structure and little tuning of hyperparameters, the
[2], [3], [4] or prior knowledge [5], [6], [7], or models that proposed method achieves comparable results compared to
combine hand-crafted features with domain knowledge [8], other CNN architectures.
[9]. In contrast, in this paper, our goal is to develop a more
general method which automatically learns features from II. M ETHODOLOGY
the pixels of document images. Elements such as strokes In order to create general page segmentation method
of words, words in sentences, sentences in paragraphs have without using any prior knowledge of the layout structure of
a hierarchical structure from low to high levels. As these the documents, we consider the page segmentation problem
patterns are repeated in different parts of the documents. as a pixel labeling problem. We propose to use a CNN
Based on these properties, feature learning algorithms can be for the pixel labeling task. The main idea is to learn a
applied to learn layout information of the document images. set of feature detectors and train a nonlinear classifier on
Convolutional Neural Network (CNN) is a feed-forward the features extracted by the feature detectors. With the set
artificial neural network which shares weights among neur- of feature detectors and the classifier, pixels on the unseen
ons in the same layer. By enforcing local connectivity document images can be classified into different classes.
966
Table I: Details of training, test, and validation sets. T R, T E, and C. Evaluation
V A denote the training, test, and validation sets respectively.
We compare the proposed method to the previous meth-
image size (pixels) |T R| |T E| |V A| ods [15], [16]. Similar to the proposed method, superpixels
G. Washington 2200 × 3400 10 5 4
St. Gall 1664 × 2496 20 30 10 are considered as the basic units of labeling. In [15], the
Parzival 2000 × 3008 20 13 2 features are learned on randomly selected grayscale image
CB55 4872 × 6496 20 10 10 patches with a stacked convolutional autoencoder in an
CSG18 3328 × 4992 20 10 10
CSG863 3328 × 4992 20 10 10 unsupervised manner. Then the features and the labels of
the superpixels are used to train a classifier. With the trained
B. Metrics classifier, superpixels are classified into different classes.
In [16], a Conditional Random Field (CRF) is applied in
The most used metrics for page segmentation of histor- order to model the local and contextual information jointly
ical document images are precision, recall, and pixel level for the superpixel labeling task. The trained classifier in [15]
accuracy. Besides of these standard metrics, we also adapt is considered as the local classifier in [16]. Then the local
the metrics which are well defined and has been widely classifier is used to train a contextual classifier which takes
used for common semantic segmentation and scene parsing the output of the local classifier as input and output the
evaluations to evaluate different page segmentation methods. scores of given labels. With the local and contextual clas-
These metrics have been proposed in [26]. They are based sifiers, a CRF is trained to label the superpixels of a given
on pixel accuracy and region intersection over union (IU). image. In the experiments, we use a multilayer perceptron
Consequently, the metrics used in the experiments are: pixel (MLP) as the local classifier in [15], [16] and an MLP
accuracy, mean pixel accuracy, mean IU, and frequency as the contextual classifier in [16]. Simple Linear Iterative
weighted IU (f.w. IU). Clustering algorithm (SLIC) [20] is applied to generate the
In order to obtained the metrics, we define the variables: superpixels. The superiority of SLIC over other superpixel
• nc : the number of classes. algorithms is demonstrated in [15]. In the experiments, for
• nij : the number of pixels of class i predicted to belong each image, 3000 superpixels are generated.
to class j. For class i: Table II reports the pixel accuracy, mean pixel accuracy,
mean IU, and f.w. IU of the three methods. It is shown
– nii : the number of correctly classified pixels (true
that the proposed CNN outperforms the previous method.
positives).
Figure 2 gives the segmentation results of the three methods.
– nij : the number of wrongly classified pixels (false
We can see that visually the CNN achieves more accurate
positives).
segmentation results compared to other methods.
– nji : the number of wrongly not classified pixels
(false negatives). D. Max Pooling
• ti : the total number of pixels in class i, such that Pooling is a widely used technology in CNN. Max pooling
is the most common type of pooling which is applied in
ti = nji . (4) order to reduce spatial size of the representation to reduce
j the number of parameters of the network. In order to show
the impact of max pooling for the segmentation task. We add
With the defined variables, we can compute:
a max pooling layer after the convolution layer. The pooling
• pixel accuracy: size is 2 × 2 pixels. Table II reports the performance of the
nii CNN with a max pooling layer. We can see that only on the
acc = i . (5)
i ti
CB55 dataset, with max pooling the mean pixel accuracy
and mean IU are slightly improved. In general, adding a
• mean accuracy:
max pooling layer does not improve the performance of the
1 nii segmentation task. Figure 3 reports the f.w. IU of the CNN
accmean = × . (6) with different max pooling sizes. We define the max pooling
nc ti
i
size as m × m, such that m = {2 × n | n ∈ N, 0 ≤ n ≤ 13}.
• mean IU: We can see that increasing the pooling size decreases the
performance. The reason is that for some computer vis-
1 n
iumean = × ii . (7) ion problems, e.g., object recognition and text extraction
nc i
t i + j nji − nii in natural images, the exact location of a feature is less
important than its rough location relative to other features.
• f.w. IU: However, for a given document image, to label a pixel in
1 ti × nii the center of a patch, it is not sufficient to know if there
iuweighted = × . (8)
k tk ti + j nji − nii
is text somewhere in that patch, the location of the text is
i
967
Table II: Performance (in percentage) of superpixel labeling with only local MLP, CRF, and the proposed CNN.
Figure 2: Segmentation results on the Parzival, CB55, and CSG863 datasets from top to bottom respectively. The colors: black, white,
blue, red, and pink are used to represent: periphery, page, text, decoration, and comment respectively. The columns from left to right are:
input, ground truth, and segmentation results of the local MLP, CRF, and CNN respectively.
needed. Therefore, the exact location of a feature is helpful kernels. We can see that except on the CS18 dataset, when
for the page segmentation task. K ≥ 4 the performance is not improved.
968
Figure 6: f.w. IU of the CNN on different numbers of training
Figure 3: f.w. IU of the CNN on different max ppooling
g sizes. images.
dataset the pages are more varied and the ground truth is
less consistent.
H. Run Time
The proposed CNN is implemented with the python
library Theano [27]. The experiments are performed on a PC
with an Intel Core i7-3770 3.4 GHz processor and 16 GB
RAM. On average, for each image, the CNN takes about 1
second processing time. The superpixel labeling method [15]
Figure 4: f.w. IU of the CNN on different numbers of kernels. and CRF model [16] take about 2 and 5 seconds respectively.
IV. C ONCLUSION
In this paper, we have proposed a convolutional neural
network (CNN) for page segmentation of handwritten his-
torical document images. In contrast to traditional page
segmentation methods which rely on off-the-shelf classifiers
trained with hand-crafted features, the proposed method
learns features directly from image patches. Furthermore,
feature learning and classifier training are combined into one
step. Experiments on public datasets show the superiority
Figure 5: f.w. IU of the CNN on different numbers of conv layers. of the proposed method over the previous methods. While
many researchers focus on applying very deep CNN archi-
number of convolution layers. It is show that the number of tectures for different tasks, we show that with the simple
layers does not affect the performance of the segmentation one convolution layer CNN, we have achieved comparable
task. However, on the G. Washington dataset, with more performance compared to other network architectures.
layers, the performance is degraded slightly. The reason is
ACKNOWLEDGMENT
that compared to other datasets, the G. Washington dataset
has fewer training images. Furthermore, the layouts of the This work is supported by the Swiss National Science
pages in the G, Washington dataset are more varied. Foundation project HisDoc 2.0 with the grant number:
205120 150173 and National Natural Science Foundation of
G. Number of Training Images China with the grant numbers: 61202257 and 61650110512.
969
[4] K. Chen, H. Wei, J. Hennebert, R. Ingold, and M. Li- [16] K. Chen, M. Seuret, M. Liwicki, J. Hennebert, C.-L. Liu,
wicki, “Page segmentation for historical handwritten docu- and R. Ingold, “Page segmentation for historical handwrit-
ment images using color and texture features,” in Frontiers in ten document images using conditional random fields,” in
Handwriting Recognition (ICFHR), 2014 14th International Frontiers in Handwriting Recognition (ICFHR), 2016 15th
Conference on. IEEE, 2014, pp. 488–493. International Conference on. IEEE, 2016, pp. 90–95.
[5] M. Bulacu, R. van Koert, L. Schomaker, and T. van der [17] J. Lafferty, A. McCallum, and F. Pereira, “Conditional ran-
Zant, “Layout analysis of handwritten historical documents dom fields: Probabilistic models for segmenting and labeling
for searching the archive of the cabinet of the dutch queen,” sequence data,” in Proceedings of the eighteenth international
in Ninth International Conference on Document Analysis and conference on machine learning, ICML, vol. 1, 2001, pp. 282–
Recognition (ICDAR 2007), vol. 1. IEEE, 2007, pp. 357–361. 289.
[6] C. Panichkriangkrai, L. Li, and K. Hachimura, “Character [18] M. D. Zeiler and R. Fergus, “Visualizing and understanding
segmentation and retrieval for learning support system of convolutional networks,” in European conference on com-
japanese historical books,” in Proceedings of the 2nd In- puter vision. Springer, 2014, pp. 818–833.
ternational Workshop on Historical Document Imaging and
Processing. ACM, 2013, pp. 118–122. [19] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning
for image recognition,” in Proceedings of the IEEE Confer-
[7] B. Gatos, G. Louloudis, and N. Stamatopoulos, “Segmenta- ence on Computer Vision and Pattern Recognition, 2016, pp.
tion of historical handwritten documents into text zones and 770–778.
text lines,” in Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on. IEEE, 2014, pp. [20] R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and
464–469. S. Süsstrunk, “Slic superpixels compared to state-of-the-art
superpixel methods,” IEEE transactions on pattern analysis
[8] R. Cohen, A. Asi, K. Kedem, J. El-Sana, and I. Dinstein,
and machine intelligence, vol. 34, no. 11, pp. 2274–2282,
“Robust text and drawing segmentation algorithm for his-
2012.
torical documents,” in Proceedings of the 2nd International
Workshop on Historical Document Imaging and Processing.
[21] V. Nair and G. E. Hinton, “Rectified linear units improve
ACM, 2013, pp. 110–117.
restricted boltzmann machines,” in Proceedings of the 27th
[9] A. Asi, R. Cohen, K. Kedem, J. El-Sana, and I. Dinstein, “A international conference on machine learning (ICML-10),
coarse-to-fine approach for layout analysis of ancient manu- 2010, pp. 807–814.
scripts,” in Frontiers in Handwriting Recognition (ICFHR),
2014 14th International Conference on. IEEE, 2014, pp. [22] X. Glorot and Y. Bengio, “Understanding the difficulty of
140–145. training deep feedforward neural networks,” in Proceedings
of the Thirteenth International Conference on Artificial Intel-
[10] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- ligence and Statistics, 2010, pp. 249–256.
based learning applied to document recognition,” Proceedings
of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. [23] N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and
R. Salakhutdinov, “Dropout: a simple way to prevent neural
[11] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. networks from overfitting.” Journal of Machine Learning
Howard, W. Hubbard, and L. D. Jackel, “Backpropagation Research, vol. 15, no. 1, pp. 1929–1958, 2014.
applied to handwritten zip code recognition,” Neural compu-
tation, vol. 1, no. 4, pp. 541–551, 1989. [24] K. Chen, M. Seuret, H. Wei, M. Liwicki, J. Hennebert, and
R. Ingold, “Ground truth model, tool, and dataset for layout
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet analysis of historical documents,” in IS&T/SPIE Electronic
classification with deep convolutional neural networks,” in Imaging. International Society for Optics and Photonics,
Advances in neural information processing systems, 2012, pp. 2015, pp. 940 204–940 204.
1097–1105.
[25] F. Simistira, M. Seuret, N. Eichenberger, A. Garz, M. Liwicki,
[13] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end text and R. Ingold, “Diva-hisdb: A precisely annotated large
recognition with convolutional neural networks,” in Pattern dataset of challenging medieval manuscripts,” in Frontiers in
Recognition (ICPR), 2012 21st International Conference on. Handwriting Recognition (ICFHR), 2016 15th International
IEEE, 2012, pp. 3304–3308. Conference on. IEEE, 2016, pp. 471–476.
[14] K. Chen, M. Seuret, M. Liwicki, J. Hennebert, and R. In- [26] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional
gold, “Page segmentation of historical document images with networks for semantic segmentation,” in Proceedings of the
convolutional autoencoders,” in Document Analysis and Re- IEEE Conference on Computer Vision and Pattern Recogni-
cognition (ICDAR), 2015 13th International Conference on. tion, 2015, pp. 3431–3440.
IEEE, 2015, pp. 1011–1015.
[27] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu,
[15] K. Chen, C.-L. Liu, M. Seuret, M. Liwicki, J. Hennebert, and G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio,
R. Ingold, “Page segmentation for historical document images “Theano: a cpu and gpu math expression compiler,” in Pro-
based on superpixel classification with unsupervised feature ceedings of the Python for Scientific Computing Conference
learning,” in Document Analysis System (DAS), 2016 12th (SciPy), 2010.
IAPR International Workshop on. IEEE, 2016, pp. 299–304.
970