JBIG2 Compression of Monochrome Images With OCR
JBIG2 Compression of Monochrome Images With OCR
!"#$%&'()+,-./012345<yA|
M ASARYK U NIVERSITY
FACULTY OF I NFORMATICS
JBIG2 Compression of
Monochrome Images with OCR
D IPLOMA THESIS
Brno, 2012
Declaration
Hereby I declare, that this paper is my original authorial work, which
I have worked out by my own. All sources, references and literature
used or excerpted during elaboration of this work are properly cited
and listed in complete reference to the due source.
ii
Acknowledgement
I would like to thank my supervisor doc. RNDr. Petr Sojka, Ph.D. for
his guidance and help by providing references to publications relevant
or similar to topics discussed in the diploma thesis. I also would like
to thank my friends Ján Vorčák and Lukáš Lacko for reading this
thesis and helping me with their opinions. Many thanks belong to
Tomáš Márton for giving me quick guidance about parallelization in
C++.
iii
Abstract
The aim of the diploma thesis is to design and implement a solution
for improving compression ratio of an open-source jbig2enc encoder.
The improvement is achieved with created support for using an OCR
engine. In order to present created solution, relevant tools working
with JBIG2 standard and OCR tools are introduced. Jbig2enc encoder
enhanced by using Tesseract OCR engine, in order to get text recogni-
tion results, is introduced. The new version of jbig2enc is evaluated on
data from digital libraries together with description of its integration
into two such libraries: DML-CZ and EuDML.
iv
Keywords
OCR, image preprocessing, JBIG2, compression, compression ratio,
scanned image, DML, speed improvement, bitonal images, Tesseract,
Leptonica, jbig2enc, DML-CZ, EuDML.
v
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2 JBIG2 and Known Tools . . . . . . . . . . . . . . . . . . . . . 5
2.1 JBIG2 Principles . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Jbig2dec . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Jbig2enc . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Jbig2enc Improvement . . . . . . . . . . . . . . . 7
2.4 PdfJbIm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.5 JPedal JBIG2 Image Decoder . . . . . . . . . . . . . . . . 9
2.6 Jbig2-imageio . . . . . . . . . . . . . . . . . . . . . . . . 10
2.7 PdfCompressor . . . . . . . . . . . . . . . . . . . . . . . 11
2.8 Summary of Tools . . . . . . . . . . . . . . . . . . . . . . 11
3 OCR Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1 ABBYY FineReader . . . . . . . . . . . . . . . . . . . . . 14
3.2 InftyReader . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.3 PrimeOCR . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Tesseract OCR . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 GOCR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.6 OCRopus . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 Summary of OCR Tools . . . . . . . . . . . . . . . . . . 18
4 Enhancement of Jbig2enc Using an OCR . . . . . . . . . . . 20
4.1 Performance Issues and Its Solutions . . . . . . . . . . . 20
4.1.1 Hash Functions . . . . . . . . . . . . . . . . . . . 21
4.1.2 OCR Recognition Run in Parallel and Its Limi-
tations . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.3 Summary of Achieved Speed Improvement . . . 24
4.2 Interface for Calling OCR . . . . . . . . . . . . . . . . . 26
4.3 Similarity Function . . . . . . . . . . . . . . . . . . . . . 28
4.4 Using Tesseract as OCR Engine . . . . . . . . . . . . . . 30
4.5 Jbig2enc Workflow Using Created OCR API . . . . . . . 31
5 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.1 Comparison with Previous Version of Improved Jbig2enc 33
5.2 Evaluation on Data from EuDML . . . . . . . . . . . . . 36
6 Usage in DML . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.1 DML-CZ . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
6.2 EuDML . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
1
7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
A CD Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
B Manual for Running Jbig2enc Improvement . . . . . . . . . 52
B.1 Jbig2enc Command-line Arguments Enabling Created
Improvements . . . . . . . . . . . . . . . . . . . . . . . . 52
2
1 Introduction
In the world, more and more information is available in an electronic
version and they need to be stored. There are two possibilities for
solving the problem of acquiring enough storage space in order to
store all the data: either by acquiring additional storage space or by
compressing the data.
Compression does not have to be used only for storing purposes.
It can be used for decreasing a bandwidth needed to transmit the
data or to decrease time to access and load the data from a disk to
a memory. Operations on a processor are processed in a matter of
nanoseconds, on the other hand, time to access the data on the disk
takes milliseconds. Thereby, if a decompression process is faster than
a difference between access time to the compressed and the original
document, time to access and load the document is reduced.
Digital libraries are good example of collections providing large
volumes of data. A well designed digital library (DL) needs to tackle
of scalability, persistence, transfer size and speed, linking related data,
format migration, etc. Transfer size and speed can be greatly improved
using good compression mechanisms.
There exists a vast amount of compression methods where each
of them has advantages and disadvantages. Most of them are better
for usage at a specific type of data than others. Different compression
methods are used for compressing images, videos, text, etc.
JBIG2 [1, 2] is a standard for compressing bitonal images. These
are images consisting of two colors only. Most of scanned documents
are composed of such images. JBIG2 compression method achieves
great compression results in both lossless and lossy modes. We work
with special kind of lossy compression called perceptually lossless
compression. It is a lossy compression method which creates output
without any visible loss.
It uses several basic principles in order to improve compression
ratio. Specialized compression methods are used for different types
of regions. A different method is used for images than for a text. By
recognizing the concrete type of data and using a specific compression
mechanism, greater compression ratio is achieved.
We focus on a method for compressing text regions. This method
3
1. I NTRODUCTION
4
2 JBIG2 and Known Tools
JBIG2 is a standard for compressing bitonal images developed by Joint
Bi-level Image Experts Group. Bitonal images are images consisting
only of two colors (usually black and white). The typical area where
such images occurs is a scanned text. JBIG2 was published in 2000 as
an international standard ITU T.88 [2] and one year later as ISO/IEC
14492 [1].
In Section 2.1, a standard JBIG2 and its basic principles are intro-
duced. Different tools working with a standard JBIG2 are described
in the following sections. There are described both open-source and
commercial tools.
5
2. JBIG2 AND K NOWN T OOLS
all the data. Each occurrence of the symbol points to its representant
with memorizing information about its position in the document.
JBIG2 uses modified versions of adaptive Arithmetic and Huffman
coding. Huffman coding is used mostly by faxes because of its lower
computation demands. But Arithmetic coding gives slightly better
results.
JBIG2 supports a multi-page compression used by symbol coding
(coding of text regions). Any symbol that is frequently used on more
than one page is stored in a global dictionary. Such symbols are stored
only once per several pages and thus reducing space needed for
storing the document even further. For more information, see thesis
JBIG2 Compression [4].
Support for JBIG2Decode filter has been embedded into a PDF
since PDF version 1.4 (2001, Acrobat 5, see 3rd edition of the PDF
Reference book [5, pages 80–84]). This allows storing compressed
images inside PDF according to the standard JBIG2. This allows to
spread far and wide the JBIG2 standard without placing any burden
on the end users. Users are not forced to install any specific decoder
to read PDFs containing JBIG2 encoded images. In the worst case,
user would need just to upgrade its PDF reader to a version fully
supporting PDF version 1.4 or newer.
When JBIG2 images are stored in the PDF, headers and some other
data are discarded. Discarded information is instead stored in a PDF
dictionary associated with the image object stream. PDF dictionary
is a specific PDF object for holding metadata. It is in format of an
associative table containing pairs of objects (key and value). For more
information see [5].
2.2 Jbig2dec
Jbig2dec [6] is an open-source decoder for a JBIG2 image format
which is developed by ghostscript developers. It can be redistributed
or modified under the terms of GNU General Public License version 2
or newer2 . In spite of this not being a complete implementation, it
is maintained to work with available encoders and thus it is able to
decode most of the documents that are widely available.
2. http://www.gnu.org/copyleft/gpl.html
6
2. JBIG2 AND K NOWN T OOLS
2.3 Jbig2enc
Jbig2enc [7, 8] is an open-source encoder written in C/C++ by Adam
Langley with support of Google. It is developed under Apache Li-
cense, Version 2.03 .
Jbig2enc encoder uses an open-source library Leptonica [9] which
is being developed by Dan Bloomberg and it is published under
a Creative Commons Attribution 3.0 United States License4 . The Lep-
tonica library is used for manipulating images. For example it handles
page segmentation to regions containing text, images and other data,
segmentation to separate symbols (connected components), logical
operations at binary level, skewing or rotating an image. The Lepton-
ica library is used by other programs such as Tesseract OCR engine
(for more information see Section 3.4) or jbig2enc encoder.
Halftone coding is not supported in jbig2enc encoder. Instead,
jbig2enc encoder uses a generic coding for the halftone images. It
supports creating an output in a format suitable for putting into a PDF
document. This feature (support) is very useful for tools working with
optimized PDF documents that uses the standard JBIG2 for achieving
better results.
According to a JBIG2 standard either Huffman coding or Arith-
metic coding can be used for symbol coding, but the jbig2enc encoder
supports only the Arithmetic coding.
Jbig2enc encoder is able to create an output which can be easily
put into a PDF document. It creates one file per each image plus one
file corresponding to a global symbol dictionary. They correspond
exactly to the PDF image objects and the global dictionary object
that are directly put into the PDF document. Thereby, when putting
an image into the PDF encoded with the jbig2enc encoder, it is only
necessary to correctly fill a PDF dictionary with metadata about the
image, mainly its dimensions.
3. http://www.apache.org/licenses/LICENSE-2.0
4. http://creativecommons.org/licenses/by/3.0/us/legalcode
7
2. JBIG2 AND K NOWN T OOLS
8
2. JBIG2 AND K NOWN T OOLS
symbols (which were found equivalent). It has more or less the same
effect as choosing a random one (image quality remains mostly the
same).
2.4 PdfJbIm
PdfJbIm [10] is a PDF enhancer which optimizes size of bitonal images
inside PDF documents. To optimize size of PDF documents, it uses
benefits of a standard JBIG2 and of an open-source encoder jbig2enc
(see Section 2.3).
PdfJbIm expects a PDF document containing images with a scanned
text on an input. These images are rendered from the PDF document.
They are encoded using an jbig2enc encoder with enable symbol cod-
ing for the text areas. It uses perceptually lossless (visually lossless)
compression of the jbig2enc encoder. This is the most suitable coding
for this kind of data. Flyspecks appear in all scanned texts and thereby
two visually equivalent symbols are not the same in each pixel. If
perceptually lossless compression is used, a great improvement in
a quality of a compressed image and in a compression ratio can be
achieved.
Figure 2.2 shows basic steps of pdfJbIm during a process of opti-
mizing PDF documents with the jbig2enc encoder.
In bachelor thesis [3], the tool pdfJbIm is called PDF re-compressor.
Since then, several improvements and bug fixes were added. The main
improvement is an added support for multi-layer PDF documents.
There are also added options to make the workflow more customiz-
able. It is enhanced with a support for running a new version of the
jbig2enc encoder. The new version of the encoder is developed as part
of this thesis and adds a support for using an OCR in the process of
image compression (see Section 4).
9
2. JBIG2 AND K NOWN T OOLS
Associating encoder
Jbig2enc encoder
output with image info
2.6 Jbig2-imageio
The Jbig2-imageio [12] is a plugin enabling access to images encoded
according to JBIG2 standard. It is a pure Java implementation which
does not require use of JNI (Java Native Interface) and is being devel-
oped under GNU GPLv35 with support of Levigo.
This plugin can be used by Java ImageIO API, which is part of
10
2. JBIG2 AND K NOWN T OOLS
a standard Java API for manipulating with images. Since it uses Java
ImageIO API, user does not need to make any changes to the code.
User needs only to add this plugin as dependency. If he tries to decode
an JBIG2 image, its type is automatically recognized and decoder is
automatically used.
2.7 PdfCompressor
PdfCompressor [13] is a commercial tool developed by CVISION
Technologies, Inc. It makes PDF documents fully searchable using
very fast and accurate OCR and highly optimized using modern
compression methods. It allows user to choose between accuracy and
speed. Higher accuracy is achieved mostly by running additional
methods for further recognition of problematic parts of the document.
For compression of images in a PDF document, it uses benefits of
JBIG2 and JPEG2000 image compression formats. For bitonal images,
it uses JBIG2. For coloured images it uses JPEG2000. Black and white
scans are usually compressed by a factor 5–10× compared to TIFF G4
and coloured scans by a factor 10–100× compared to JPEG [13].
High speed for compression and OCR of PDF is achieved because
of high optimization for multi-threading. Thereby, it is able to process
several documents in matter of a few seconds.
PdfCompressor does not only support compression and OCRing,
it also supports encryption of PDFs, web-optimization bates-stamping
and other features.
Demo version of this tool is available for download at the project’s
home page. Online version of PdfCompressor is available at the same
page for users to try it out.
11
2. JBIG2 AND K NOWN T OOLS
12
2. JBIG2 AND K NOWN T OOLS
13
3 OCR Tools
As we already know, JBIG2 compression of a text region is based
on segmenting an image into tokens (symbols). For each different
token a representant is chosen. The main problem is that scanned text
contains noise and flyspecks. This makes visually equivalent symbols
to seem different for the computer even though they look the same for
a human eye. The problem is similar to the problem of OCR (optical
character recognition) where each symbol is recognized to be readable
by a computer.
The process of recognizing symbols by OCR engine is very similar
to the process of identifying representative symbols occurring repeat-
edly on the image. OCR engine has disadvantage to JBIG2 encoder –
for each symbol on the image, it needs to decide what is its computer
representation even if it is uncertain. In contradiction, JBIG2 encoder
can choose if the symbol points to an existing representant or a new
representant should be created.
OCR (Optical Character Recognition) is a technology that enables
machine to translate images into a text format which is easily search-
able and editable [14].
There exists extreme amount of different OCR engines and OCR
software but in this chapter we introduce only several of them. We
focus mainly on OCR engines, which are interesting or special in some
way or are widely used. There exist a lot of so called converters. They
contain the OCR engine as part of themselves. As an example, we
mention PdfCompressor introduced in Section 2.7. It just converts
PDF document into a searchable one. Description of such documents
is out of scope of this thesis and they are not described here any
further.
14
3. OCR T OOLS
3.2 InftyReader
InftyReader [19] is a commercial OCR software for recognizing scien-
tific documents including math formulae. It creates output in various
file formats including LATEX, MathML, XHTML, IML, HRTEX2 . It is
developed in the laboratory of M. Suzuki, Faculty of Mathematics,
15
3. OCR T OOLS
3.3 PrimeOCR
PrimeOCR [21] is a commercial Windows OCR engine which reduces
OCR error rates by implementing “Voting” OCR technology. They
claim reduction of errors made by standard OCR engine by up to
65–80%. PrimeOCR allows to use more OCR engines. If a low image
quality is detected, more than one OCR engine is used in order to
improve results of the OCR.
It is not necessary to identify language of the document in ad-
vance. PrimeOCR is able to identify it automatically. Processing is
significantly faster if the language is specified in advance. PrimeOCR
recognizes one dominant language per page but it allows to recognize
secondary language for which English is usually used.
PrimeOCR is created mainly to be used in production systems
than by individual users. Their licensing prices correspond to it. For
developers they offer SDK with an orthogonal API accessible via
a DLL library. On a product page a detailed usage manual [22] is
available together with several usage examples.
16
3. OCR T OOLS
1985 and 1995. In the following decade, most of its development was
stopped. In 2005 it was released as open-source by Hewlett-Packard
and UNLV3 . Since 2006 Tesseract is sponsored by Google and is re-
leased under the Apache License, Version 2.0. In 2005 Tesseract was
evaluated by UNLV as one of top three OCR engines in terms of
character accuracy [23].
It uses an open-source library Leptonica for manipulating with
images. Images are internally stored using Leptonica structures.
Tesseract by itself does not come with any graphical interface and
is run from a command-line. But there was created several projects
which provide GUI for Tesseract. Here we show several of them.
• FreeOCR – a Windows Tesseract GUI.
• GImageReader – GTK GUI frontend for Tesseract that supports
selection columns and parts of document. It allows to process
PDF documents and even spell check the output.
• OcrGUI – a Linux GUI written in C using GLib and GTK+,
supports Tesseract and GOCR. It includes spell checking using
an open source spell checker Hunspell [24, 25].
3.5 GOCR
GOCR [26] is an OCR program developed under GNU Public License.
GOCR is also known under JOCR name which was created while
registering GOCR at Sourceforge, where the name gocr was already
taken and new name needed to be created.
It can be used with different front-ends and thus it makes it very
easy to port to different OSes and architectures. It can open many
image formats and its quality is being improved on a daily basis [26].
3.6 OCRopus
OCRopus [27] is a free document analysis and optical character recog-
nition system being developed in Python and released under Apache
17
3. OCR T OOLS
18
3. OCR T OOLS
19
4 Enhancement of Jbig2enc Using an OCR
Main idea behind JBIG2 and its processing of text regions is segmen-
tation of text regions to components where each component mostly
corresponds exactly to one symbol. This is very similar to OCR pro-
cessing which segments image into words and symbols and then
recognizes them.
Our goal is to improve the jbig2enc encoder using a good OCR
engine. It should not put any constraints on users and developers
using an actual version of the jbig2enc encoder. Thereby, as default
OCR engine we require an open-source product licensed under a li-
cense compatible with jbig2enc encoder and its API for using OCR. It
should be straightforward for implementation of the designed API.
Tesseract is developed under an Apache License 2.0 compatible
for purposes of enhancing jbig2enc encoder. It supports wide range of
languages. As a bonus, it uses the same library for holding image data
structures as the jbig2enc encoder. This makes it an ideal candidate
for integrating it with the encoder.
We create an API for using OCR engine as much as possible inde-
pendent on a concrete OCR engine. It is relatively easy replaceable
but little modifications of code are stilled required.
Before description of the created API, we tackle performance is-
sues and provide its solutions.
20
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
There are significant differences between using or not using the OCR.
Thus we have designed and implemented two different hash func-
tions where basically hash function using OCR is an extension of the
standard hash function. To be more precise, hash function using OCR
is not only a hash function but it also maintains results retrieved from
symbol recognition of the OCR engine used. It is to prevent multiple
runs of OCR recognition per one symbol (representant).
Hash function is counted from sizes of the representant (symbol)
and number of holes found in the symbol. For example symbol ’B’
has three holes (two inside the symbol and one represented by outer
21
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
22
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
23
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
24
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
25
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
10
Compression time in seconds
0
1 2 3 4 5 6 7
Number of images
26
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
27
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
28
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
29
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
3. For further information about Tesseract OCR engine see Section 3.4
30
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
31
4. E NHANCEMENT OF J BIG 2 ENC U SING AN OCR
32
5 Evaluation
33
5. E VALUATION
34
5. E VALUATION
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
1 2 3 4 5 6 7 8 9 10
Number of pages
2. Version without the improvement made in [3] and without the improvement
using OCR.
35
5. E VALUATION
jbig2enc encoder with using Tesseract as the OCR engine. Size of the
second image is one third of the first one and there are no visible
differences.
But if we look more closely, they are not the same in each pixel. The
difference between those two images is shown in Figure 5.4. We can
see that the improved version of jbig2enc greatly reduces flyspecks
and dirts, and at the same time, a quality of the output image is not
decreased.
36
5. E VALUATION
600
500
400
300
200
100
0
1 2 3 4 5 6 7 8 9 10
Number of
pages
3. The PDF documents we do not provide on attached CD because the data are
provided only for EuDML system internal use.
37
5. E VALUATION
38
6 Usage in DML
Digital mathematical libraries (DMLs) contain a large volume of doc-
uments with scanned text (more than 80% of EuDML is scanned),
which are mostly created by scanning older papers which were writ-
ten and published earlier and their digital versions are already lost.
Documents created this way are referred to as retro-born digital docu-
ments.
Research in math is influenced greatly by older articles and pa-
pers. When something new in math is discovered or researched, it is
often based on older papers and discoveries. To make research more
comfortable users require easy access to these kinds of documents.
Thus DMLs need to provide documents that are easy to both find and
access.
6.1 DML-CZ
Project DML-CZ (Czech digital mathematical library) funded from
2005 to 2009 was developed in order to digitize mathematical texts
published in the Czech Republic. It comprises periodicals, selected
monographs and conference proceedings since nineteenth century up
to most recently produced mathematical publications. It is available at
dml.cz serving almost 30,000 articles on 300,000 pages to the public.
39
6. U SAGE IN DML
page of the PDF document. This way a basic PDF document is created.
The PDF document is then optimized and as a final step the PDF
document is digitally signed.
The Perl script directly engages tools from a command-line and
thus the integration is simple. The PdfJbIm, together with improved
jbig2enc using an OCR engine to improve compression ratio, is inte-
grated as one of tools used for optimizing the PDF.
6.2 EuDML
Project of European Digital Mathematical Library (EuDML) [32] cre-
ates an infrastructure system aimed to integrate the mathematical
contents available online throughout Europe to allow both extensive
and specialized mathematical discovery. It is a three years project
funded by the European Commission, started in February 2010.
The primary aim of EuDML project is to create an infrastructure
and an uniform interface in order to provide mathematical content
available throughout Europe. Documents are gathered from content
providers with their metadata. The metadata needs to fulfill only
a minimal criteria. Documents and metadata are then internally pro-
cessed to provide enhanced information for browsing and searching.
The EuDML system maps metadata provided from content pro-
viders to its internal structure. OAI-PMH1 is a primary mean for
content harvesting but other methods will be considered throughout
the systems’ lifetime.
The primary goal of EuDML system is not to provide innovative
tools, but rather to integrate existing tools and maximize the accessi-
bility of mathematical content for the users.
40
6. U SAGE IN DML
easily change existing services or add a new one. Many tools in Eu-
DML system are so-called processing nodes. They can be chained to
so-called processes. The initial node in a process typically generates or
otherwise obtains chunks of data which are consecutively processed
by following nodes. A processing node basically enhances chunks of
data it receives on its input and sends them further with possible side
effects such as indexing the contents of the chunk. The final node in
a process typically stores the enhanced chunks of data in a storage or
discards it.
There was developed a framework written in Java that orches-
trates a flow between processing nodes. Author of an individual tool,
thereby just needs to implement a processing node with well-defined
inputs and outputs. In a configuration file (in XML format), the de-
sired flow between processing nodes is defined.
Several tools used in EuDML system are not written in Java and
their reimplementation would be too expensive. Therefore they are
either provided with remote interface, and communication is handled
either using REST or SOAP, or they are provided as binaries. They are
then processed using a Java runtime environment. Java runtime envi-
ronment allows to execute a code outside JVM (Java Virtual Machine)
directly from Java code. This brings additional computation require-
ments but in compare to opening and processing a PDF document it
is neglectable.
In order to integrate PdfJbIm2 into the EuDML system and there-
fore to integrate the improved jbig2enc, it is necessary to create a pro-
cessing node. It is necessary to configure a process workflow that
serves a PDF document at start and returns its compressed version
in the end. PdfJbIm is written in Java and therefore its integration is
relatively straightforward and mostly corresponds to its main method
that is used if PdfJbIm is run from a command-line.
A processing node which handles integration of PdfJbIm is repre-
sented by the class PdfCompressorProcessingNode. At start, it takes an
EnhancerProcessMessage, where information about stored PDF docu-
ment is provided. It retrieves the PDF document as a byte array using
a provided class EnhancementUtils. ImageExtractor class of pdfJbIm
41
6. U SAGE IN DML
42
6. U SAGE IN DML
43
7 Conclusion
The main goal of this thesis was to create an uniform interface for
using an OCR engine in order to improve compression ratio of the
jbig2enc encoder. The API was successfully created and a solution
using Tesseract as the OCR engine was implemented. We have shown
results achieved on the data in digital mathematical libraries.
As part of jbig2enc encoder improvement, solutions decreasing
computation time were created. Hash function and methods bene-
fiting from new approach for holding data were presented. Speed
improvement, computation requirements of OCR and an influence of
running OCR in parallel were presented.
By implementing support for OCR, better possibilities for choosing
a more adequate representative symbol were created. This allows not
only to improve the compression ratio, but also to improve the quality
of the output image.
The created similarity function requires further testing before
putting this tool into the real working environment. When using an
OCR, there are acceptable certain errors, but the compression should
be resistant to errors. This requires more complex testing in order to
achieve both maximum compression ratio and zero error rate.
In order to improve compression ratio of the Tesseract module,
new global dictionary containing all standardly used symbols should
be created.
For further improvement of computation performance, other parts
of the jbig2enc encoder could be parallelized.
The achieved results imply that we came closer to the ideal com-
pression ratio, but there is still a long way to go. The way can be short-
ened by implementing a module for an OCR engine with a greater
accuracy. It also can be shortened by training the OCR engine on the
data which are later being processed by the encoder.
We have shown how to integrate a new version of the jbig2enc
encoder into two digital mathematical libraries. Czech digital mathe-
matical library already uses an older version of the jbig2enc encoder
(developed as part of bachelor thesis [3]) for more than a year. This
shows that such tool is useful and beneficial in a real environment.
44
Bibliography
[1] JBIG Committee. 14492 FCD. ISO/IEC JTC 1/SC 29/WG 1, 1999.
http://www.jpeg.org/public/fcd14492.pdf.
45
7. C ONCLUSION
46
7. C ONCLUSION
[30] Barbara Chapman, Gabriele Jost and Ruud van der Pas. Us-
ing OpenMP: Portable Shared Memory Parallel Programming.
Massachusetts Institute of Technology, 2007.
47
7. C ONCLUSION
[33] Petr Sojka and Radim Hatlapatka. PDF Enhancements Tools for
a Digital Library: pdfJbIm and pdfsign. pages 45–55, Brno, Czech
Republic, 2010. Masaryk University.
48
List of Figures
2.1 Example of two originally different symbols recognized
as equivalent 8
2.2 PdfJbIm workflow 10
4.1 Formula for counting hash without OCR results 22
4.2 Comparison of compression times for different versions
of improved jbig2enc and different amount of images
processed at once 26
4.3 Class diagram of jbig2enc API for using OCR 28
4.4 Problematic versions of the same letter ’e’ 29
4.5 Class diagram of Tesseract module implementing
jbig2enc API for using OCR 30
5.1 Number of different symbols (representative
symbols) 35
5.2 Compression results of jbig2enc and its improved
versions 37
5.3 Image before and after compression according to JBIG2
standard 38
5.4 Difference between original image and image compressed
with a jbig2enc encoder using OCR 38
49
List of Tables
2.1 Summary of tools working with standard JBIG2 12
3.1 Summary of OCR tools 19
4.1 Comparison of computational times based on speed
improvements of improved jbig2enc (computation time is
in seconds) 25
5.1 Number of different symbols (representative
symbols) 34
5.2 Results of an enhanced jbig2enc encoder 36
50
A CD Content
• Jbig2enc and its improved version
• Test data
51
B Manual for Running Jbig2enc Improvement
52