Libcrn, An Open-Source Document Image Processing Library
Libcrn, An Open-Source Document Image Processing Library
Yann LEYDIER, Jean DUONG Stéphane BRÈS, Véronique ÉGLIN, Frank LeBOURGEOIS, Martial TOLA
CoReNum Université de Lyon, CNRS
F-69006, France INSA-Lyon, LIRIS, UMR5205,
[email protected], [email protected] F-69621, France
fi[email protected]
Abstract—In this paper we introduce libcrn, a multiplatform We introduce libcrn3 , licensed in LGPL (a non-
open-source document image processing library aimed at contaminating Open Source license). Its aim is to allow both
researchers and companies. It is written in C++11 and has researchers and engineers to implement document image
a non-contaminating license that makes it available for use in
any project without legal constraints. processing chains and algorithms. libcrn is available for
The features include low-level image processing (color Windows (Visual C++ 2015), Linux, MacOS and Android.
format conversion, binarization, convolution, PDE. . . ), docu- It is written in C++11 using the “modern” guidelines issued
ment images specific tools (connected components extraction, by the C++ committee so that users can easily and safely
recursive block description, PDF export. . . ), maths (matrix use it (e.g.: no memory management is required from the
arithmetics, linear algebra, GMMs, equation solvers. . . ), clas-
sification and clustering (kNN, k-means, HMMs. . . ). users and no leak can happen). We implemented many
The API is comprehensively documented and libcrn’s archi- image processing algorithms but also the mathematical tools
tecture follows modern C++ guidelines to facilitate the handling needed to process the data that can be extracted from images.
of the library and enforce its safe usage.
A sample OCR, which is only 30 lines long, is described to II. F EATURES
illustrate libcrn’s scope of possibilities.
A. General points
Keywords-document image processing; open source; library;
toolbox In order to facilitate the storage of data, most of the
objects in libcrn can be serialized in XML files. Multi-
I. I NTRODUCTION platform utilities are packaged so that no overwork weights
on the user to make applications run on any OS (e.g.:
Implementing the most basic document image processing automatic file path format conversion, file manipulation,
algorithm may be a good exercise for students. However, character set conversion, dynamically loaded modules. . . ).
when focusing on high-level processing chains or complex
methods, researchers as well as manufacturers rely on tool- B. Image
boxes and software libraries for low-level tasks.
1) Formats: We provide built-in support for numerous
Commercial software libraries are available (Intel IPP,
pixel types (see tab. I). Any other type of pixel format (such
Lead Tools. . . ) and used in industry. They generally offer
as matrices!) is supported as long as it implements the basic
the basic tools needed to perform simple image manipulation
arithmetic operators.
and are well suited for non-specialist engineers. Specialists
often prefer Open Source libraries as it is possible to check Category Subcategory Types
the details of the algorithms and modify and fine tune them Color RGB-based RGB, HSV, YUV (television)
when needed. Perception-based XYZ, L*a*b*, L*u*v*
The most widely used image processing library is Monochromatic Grayscale double, int, byte
Binary bool
OpenCV1 . Whereas is contains a great amount of algorithms, Other 2D vectors Cartesian and polar coords
it is not originally meant for document images and lacks Angles radian, degree and byte
features and services that make it inconvenient. Custom any type with arithmetic ops
Qgar2 [1] is an Open Source document image processing Table I
library created in the early 2000s. It features the most P IXEL TYPES SUPPORTED BY libcrn.
elementary tools to create document analysis software but
also lacks some crucial features such as RGB images.
Although Qgar can be easily extended, its development has All the color types are trivially convertible and multiple
been stalled since 2008. binarization methods are offered: Niblack, Sauvola [2], local
min or max, Fisher, entropy and Otsu [3].
1 http://opencv.org/
2 http://www.qgar.org/ 3 https://github.com/Liris-Pleiad/libcrn
212
4) Geometry: libcrn provides angle arithmetics and tools III. 30 LINES FOR AN OCR
such as circular mean and variance, but also circular his- To illustrate libcrn’s ease of use, we present a very simple
tograms utilities: circular earth mover’s distance [11], kur- OCR engine that is only 30 lines long. It works on a
tosis, trigonometric moments. . . medieval manuscript excerpt written in capital letters with
5) Signal processing: FFT can be applied on complex no spacing between words (see fig. 5). An occurrence of
matrices and vectors. each letter in the alphabet was manually extracted and put
E. Pattern analysis in a folder named data, where each file name corresponds
to the image’s label. The source code is displayed in fig. 6.
1) Clustering: libcrn provides clustering algorithms for
both vectorial and metric data:
• k-means
• k-medoids (PAM and fast [12])
• Outliers: LOF and LoOP [13]
• Spectral Clustering (all formulas from [14]–[17])
• Affinity Propagation [18]
2) Classification: Many data classification problems can
be addressed, using a wide variety of tools from a highly
generic kNN implementation to discrete or Gaussian semi-
continuous HMMs.
3) Other: Other “combinatorics oriented” algorithms are
also available in libcrn, such as Kuhn and Munkres’ “Hun-
garian” bipartition algorithm, the A* path finder or the
disjoint set forest distribution.
F. GUI
Figure 5. Sample medieval manuscript.
Custom widgets are provided to create applications with
Gtkmm – Gtk’s C++ wrapper – versions 2 and 3. This
The first step (fig. 6.1) is to create a feature extractor
include displaying and browsing the block structure of an
that will be used to compare each character to the database.
image, image overlays and the automated generation of
To do that, we use a FeatureSet, which can contain
configuration panels. A Qt widget library will be available
multiple elementary feature extractors. The FeatureSet
soon.
will concatenate the feature vectors extracted by each
A demonstration application is included in libcrn. It
FeatureExtractor. For simplicity, we use the four
allows to quickly test some features of the library on
profile projections and the horizontal and vertical black
any image (see fig. 4). This tool is very handy to run
pixels projections.
simple algorithms over a given image without writing a new
In step 2, we open each pre-labelled character image. As
program.
it is often impossible to know whether an image file contains
RGB, grayscale or binary pixels, we directly store the
image in a Block object. A Block automatically converts
the input image to the format desired by the user (the
default grayscaling and binarization methods can be changed
programmatically at anytime). It also contains named lists of
sub-blocks that will be used later. The extracted features are
stored in a list of shared pointers5 to the base class Object.
The line segmentation is performed in 6.3. The document
image is opened in the same way as the database images.
We create a temporary BlockTreeExtractor that will
extract the text lines and store them in a sub-block list named
Lines.
Just before we actually extract the characters (fig. 6.4),
we get an estimation of the mean stroke width. This will
help us filter the noise. Connected components are extracted
Figure 4. Titus, libcrn’s quick testing tool. 5 Shared pointers are memory management objects that deletes pointers
when they are not referenced anymore.
213
within each line. After that, each Block in the Lines sub- [2] J. Sauvola, T. Seppänen, S. Haapakoski, and M. Pietikäinen,
block list will contain a sub-block list named Characters. “Adaptive document binarization,” in International Confer-
The recursive sub-block lists can be used for more complex ence on Document Analysis and Recognition (ICDAR), vol. 1,
Ulm, Germany, 1997, pp. 147–152.
purposes and can even match an XML Alto’s structure.
Finally, connected components smaller than the mean stroke [3] N. Otsu, “A threshold selection method from gray-level
width are removed and the remaining ones are sorted from histograms,” Automatica, vol. 11, no. 285-296, pp. 23–27,
left to right. 1975.
The 5th and last step is the actual recognition. Each
[4] F. LeBourgeois, “Content based image retrieval using gradient
Block remaining in the Characters sub-sub-list rep- color fields,” in International Conference on Pattern Recog-
resents a letter in the text. Its feature vector is extracted nition (ICPR), Barcelona, Spain, 2000, pp. 1027–1030.
using the same FeatureSet as the database. We search
its nearest neighbor in the database a retrieve a class number [5] L. He, Y. Chao, K. Suzuki, and K. Wu, “Fast connected-
that can be used to compute the character’s transcription. component labeling,” Pattern Recognition, vol. 42, no. 9,
pp. 1977 – 1987, 2009. [Online]. Available: http://www.
Now that our homemade OCR is fully described, we sciencedirect.com/science/article/pii/S0031320308004573
shall not discuss its performance: profile projections are
not known to be the best features for medieval manuscript [6] E. Nelson, Tensor analysis. Princeton University Press, 1967.
recognition! The purpose here was to illustrate the easiness
of designing applications with libcrn. Variety of document [7] J. G. Simmonds, A brief on tensor analysis. Springer-Verlag,
1994.
analysis problems can be addressed, not restricted to ancient
scripts: Printed pages or manuscripts may be considered. [8] T. Schultz, J. Weickert, and H.-P. Seidel, “A higher-order
Historic or business documents may be processed. Even structure tensor,” July 2007.
more borderline tasks are feasible (e.g.: text extraction from
scenes, plate recognition, mobile applications, etc.). [9] S. D. Zenzo, “A note on the gradient of a multi-image,”
Computer Vision, Graphics, and Image Processing, vol. 33,
pp. 116–125, 1986.
IV. C ONCLUSION
[10] T. schultz and G. Kindlmann, “A maximum enhencing higher-
In this paper we introduced libcrn, a multiplatform (non- order tensor glyph,” in Eurographics/IEEE-VGTC Symposium
contaminating) open-source document image processing li- on Visualization, vol. 29, no. 3, 2010.
brary written in C++11 and aimed at researchers and compa-
nies available for Windows (Visual C++ 2015), Linux, Ma- [11] J. Rabin, J. Delon, and Y. Gou, “Circular earth mover’s
cOS and Android. Its API is comprehensively documented distance for the comparison of local features,” in Pattern
Recognition, 2008. ICPR 2008. 19th International Conference
and libcrn’s architecture follows modern C++ guidelines to on. IEEE, 2008, pp. 1–4.
facilitate the handling of the library and enforce its safe
usage (e.g.: no memory management is required from the [12] H.-S. Park and C.-H. Jun, “A simple and fast algorithm
users and no leak can happen). for k-medoids clustering,” Expert Systems with Applications,
libcrn includes low-level image processing (expandable vol. 36, no. 2, Part 2, pp. 3336 – 3341, 2009.
[Online]. Available: http://www.sciencedirect.com/science/
pixel formats, binarization, PDE. . . ), document images spe- article/pii/S095741740800081X
cific tools (connected components extraction, recursive block
description. . . ), maths helpers (algebra, data analysis, geom- [13] H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek,
etry, signal processing. . . ) and pattern analysis algorithms “Loop: Local outlier probabilities,” in Proceedings of the
(classification, clustering. . . ). 18th ACM Conference on Information and Knowledge
Management, ser. CIKM ’09. New York, NY, USA:
We described a short code example that provides OCR ACM, 2009, pp. 1649–1652. [Online]. Available: http:
capacities in only 30 lines. This illustrated the use of feature //doi.acm.org/10.1145/1645953.1646195
extractors, segmentation providers and classification tools.
Many other applications may be designed for research or [14] A. Y. Ng, M. I. Jordan, and Y. Weiss, “On spectral clustering:
industrial needs. libcrn has been useful in several projects Analysis and an algorithm,” in ADVANCES IN NEURAL
INFORMATION PROCESSING SYSTEMS. MIT Press, 2001,
and remains constantly improving. pp. 849–856.
214
/ / 1. Feature e x t r a c t o r
auto f e a t u r e E x t r a c t o r = crn : : F e a t u r e S e t {};
f e a t u r e E x t r a c t o r . PushBack ( s t d : : m a k e s h a r e d <c r n : : F e a t u r e E x t r a c t o r P r o f i l e >
( c r n : : D i r e c t i o n : : LEFT | c r n : : D i r e c t i o n : : RIGHT | c r n : : D i r e c t i o n : : TOP |
c r n : : D i r e c t i o n : : BOTTOM, 1 0 , 1 0 0 ) ) ;
f e a t u r e E x t r a c t o r . PushBack ( s t d : : m a k e s h a r e d <c r n : : F e a t u r e E x t r a c t o r P r o j e c t i o n >
( c r n : : O r i e n t a t i o n : : HORIZONTAL | c r n : : O r i e n t a t i o n : : VERTICAL , 1 0 , 1 0 0 ) ) ;
/ / 2 . Database c r e a t i o n
a u t o d a t a b a s e = s t d : : v e c t o r <c r n : : S O b j e c t >{};
f o r ( a u t o c = ’A ’ ; c <= ’Z ’ ; ++ c )
{
c o n s t a u t o c h a r F i l e N a m e = ” d a t a ” p / c + ” . png ” p ;
a u t o c h a r b l o c k = c r n : : B l o c k : : New ( c r n : : NewImageFromFile ( c h a r F i l e N a m e ) ) ;
d a t a b a s e . push back ( f e a t u r e E x t r a c t o r . E x t r a c t (∗ c h a r b l o c k ) ) ;
}
/ / 3. Line segmentation
a u t o p a g e b l o c k = c r n : : B l o c k : : New ( c r n : : NewImageFromFile ( i m a g e F i l e N a m e ) ) ;
c r n : : B l o c k T r e e E x t r a c t o r T e x t L i n e s F r o m P r o j e c t i o n {U” L i n e s ” } . E x t r a c t ( ∗ p a g e b l o c k ) ;
/ / 4. Character segmentation
c o n s t a u t o sw = c r n : : S t r o k e s W i d t h ( ∗ p a g e b l o c k −>GetGray ( ) ) ;
auto s = crn : : S t r i n g {};
f o r ( a u t o n l i n e = s i z e t { 0 } ; n l i n e < p a g e b l o c k −>G e t N b C h i l d r e n (U” L i n e s ” ) ; ++ n l i n e )
{
a u t o l i n e = p a g e b l o c k −>G e t C h i l d (U” L i n e s ” , n l i n e ) ;
l i n e −>E x t r a c t C C (U” C h a r a c t e r s ” ) ;
l i n e −>F i l t e r M i n O r (U” C h a r a c t e r s ” , sw , sw ) ;
l i n e −>S o r t T r e e (U” C h a r a c t e r s ” , c r n : : D i r e c t i o n : : LEFT ) ;
f o r ( a u t o n c h a r = s i z e t { 0 } ; n c h a r < l i n e −>G e t N b C h i l d r e n (U” C h a r a c t e r s ” ) ;
++ n c h a r )
{
a u t o c h a r a c t e r = l i n e −>G e t C h i l d (U” C h a r a c t e r s ” , n c h a r ) ;
/ / 5. Recognition
auto f e a t u r e s = f e a t u r e E x t r a c t o r . E x t r a c t (∗ c h a r a c t e r ) ;
auto r e s = crn : : B a s i c C l a s s i f y : : N e a r e s t N e i g h b o r ( f e a t u r e s ,
d a t a b a s e . b e g i n ( ) , d a t a b a s e . end ( ) ) ;
s += c h a r 3 2 t (U ’A ’ + r e s . c l a s s i d ) ;
}
s += U ’ \ n ’ ;
}
CRNVerbose ( s ) ; / / d i s p l a y t h e r e s u l t
215