Wang OoRexx and PDFBox
Wang OoRexx and PDFBox
Vienna,Select date
An Introduction to Apache PDFBox Library: Nutshell Examples I
Table of contents
Abstract .................................................................................................... II
Table of figures ......................................................................................... III
1 Introduction ......................................................................................... 1
2 Background .......................................................................................... 2
2.1 Portable Document Format ............................................................... 2
2.2 Apache PDFBox .............................................................................. 5
2.3 Java .............................................................................................. 7
2.4 Open Object Rexx ........................................................................... 8
2.5 Bean Scripting Framework for ooRexx ............................................... 10
3 Installation.......................................................................................... 11
3.1 Java ............................................................................................. 11
3.2 ooRexx ......................................................................................... 11
3.3 BSF4ooRexx .................................................................................. 12
3.4 Apache PDFBox ............................................................................. 12
4 Nutshell Examples ................................................................................ 13
4.1 Creating a PDF Document ............................................................... 13
4.2 Adding Text to an Existing PDF Document ......................................... 16
4.3 Use Different Fonts and Colors ......................................................... 18
4.4 Extracting Text from an Existing PDF Document ................................. 21
4.5 Drawing some Shapes .................................................................... 23
4.6 Creating a Table with Content .......................................................... 26
4.7 Converting a PDF Document To Image .............................................. 29
4.8 Inserting Image to a PDF Document ................................................. 31
4.9 Adding Multiple Pages to a PDF Document ......................................... 33
4.10 Splitting a PDF Document with Multiple Pages .................................... 36
4.11 Merging Multiple PDF Documents...................................................... 38
4.12 Setting the Document Metadata ....................................................... 40
4.13 Adding Watermark to a Document .................................................... 42
4.14 Encrypting a PDF Document ............................................................ 45
4.15 Creating a PDF-A Document ............................................................ 48
4.16 Validating a PDF-A Document .......................................................... 52
4.17 Creating a Digital Signature ............................................................. 54
4.18 Verifying a Digital Signature ............................................................ 58
5 Conclusio ............................................................................................ 61
6 References .......................................................................................... 62
Appendix .................................................................................................. 64
An Introduction to Apache PDFBox Library: Nutshell Examples II
Abstract
This thesis presents a collection of 18 nutshell examples demonstrating the use of
Apache PDFBox, a Java library for creating and manipulating PDF documents. The
examples are implemented using ooRexx, a high-level object-oriented scripting
language, and BSF4ooRexx, a bridge between ooRexx and Java. The thesis provides
an overview of the PDF format and the features of PDFBox, followed by the
implementation of the examples, which cover a range of functionality such as
creating a new PDF document, adding text, images, and annotations, manipulating
existing PDF documents, and extracting data from them. The examples are designed
to be concise and easy to follow, allowing users to quickly understand how to use
PDFBox to accomplish various tasks related to PDF document creation and
manipulation.
An Introduction to Apache PDFBox Library: Nutshell Examples III
Table of figures
Figure 1: “01. Creating a PDF Document.rex” ............................................... 14
Figure 2: Output of “01. Creating a PDF Document.rex” ................................. 15
Figure 3: “02. Adding Text to an Existing PDF Document.rex” ......................... 16
Figure 4: Warning for the overwrite mode .................................................... 17
Figure 5: Output of “02. Adding Text to an Existing PDF Document.rex” ........... 17
Figure 6: “03. Use Different Fonts and Colors.rex”......................................... 19
Figure 7: Output of “03. Use Different Fonts and Colors.rex”........................... 20
Figure 8: “04. Extracting Text from an Existing PDF Document.rex” ................. 21
Figure 9: Output of “04. Extracting Text from an Existing PDF Document.rex” ... 22
Figure 10: “05. Drawing some Shapes.rex” .................................................. 24
Figure 11: Output of “05. Drawing some Shapes.rex” .................................... 25
Figure 12: “06. Creating a Table with Content.rex” ........................................ 27
Figure 13: Output of “06. Creating a Table with Content.rex” .......................... 28
Figure 14: “07. Converting a PDF Document To Image.rex” ............................ 29
Figure 15: Output of “07. Converting a PDF Document To Image.rex” .............. 30
Figure 16: “08. Inserting Image to a PDF Document.rex” ............................... 31
Figure 17: Output of “08. Inserting Image to a PDF Document.rex” ................. 32
Figure 18: “09. Adding Multiple Pages to a PDF Document.rex” ....................... 34
Figure 19: Output of “09. Adding Multiple Pages to a PDF Document.rex” ......... 35
Figure 20: “10. Splitting a PDF Document with Multiple Pages.rex” .................. 36
Figure 21: Output of “10. Splitting a PDF Document with Multiple Pages.rex” .... 37
Figure 22: “11. Merging Multiple PDF Documents.rex”.................................... 38
Figure 23: Output of “11. Merging Multiple PDF Documents.rex” ..................... 39
Figure 24: “12. Setting the Document Metadata.rex” ..................................... 40
Figure 25: Output of “12. Setting the Document Metadata.rex” ....................... 41
Figure 26: “13. Adding Watermark to a Document.rex” .................................. 43
Figure 27: Output of “13. Adding Watermark to a Document.rex” .................... 44
Figure 28: “14. Encrypting a PDF Document.rex” .......................................... 45
Figure 29: Output 1 of “14. Encrypting a PDF Document.rex” .......................... 46
Figure 30: Output 2 of “14. Encrypting a PDF Document.rex” .......................... 46
Figure 31: Output 3 of “14. Encrypting a PDF Document.rex” .......................... 47
Figure 32: “15. Creating a PDF-A Document.rex” .......................................... 49
Figure 33: Output of “15. Creating a PDF-A Document.rex” ............................ 51
Figure 34: “16. Validating a PDF-A Document.rex” ........................................ 53
Figure 35: Output of “16. Validating a PDF-A Document.rex” .......................... 53
Figure 36: Keystore creation ...................................................................... 54
An Introduction to Apache PDFBox Library: Nutshell Examples IV
1 Introduction
In recent years, the use of PDF files has become ubiquitous in a wide range of
industries and applications. The Portable Document Format (PDF) provides a versatile
and reliable means of sharing and distributing documents across different devices and
platforms. However, working with PDF files can often be challenging, especially when
it comes to manipulating and customizing them to specific needs.
This Bachelor's thesis explores the use of PDFBox, an open-source Java library, to
create and manipulate PDF documents. Specifically, we investigate how the
combination of PDFBox with the programming languages of Java and ooRexx, along
with the BSF4ooRexx bridge, can be used to create powerful, flexible, and efficient
PDF manipulation scripts.
The first section introduces the topic, explaining the importance of PDF files and the
challenges associated with working with them. This section also outlines the goals and
objectives of the thesis.
The second section offers background information on PDFBox, Java, and ooRexx,
including their features and capabilities. This section also explains how these
components can be used together to create PDF manipulation scripts.
The third chapter will provide a step-by-step installation guide for the necessary
components, including ooRexx, Java, and BSF4ooRexx. This chapter aims to ensure
that readers can install the components and start working with PDFBox effectively.
The fourth section comprises 18 Nutshell examples that demonstrate the extensive
capabilities of PDFBox for creating and manipulating PDF documents. These examples
cover a range of practical applications, such as creating PDF documents, extracting
text and images from PDF files, and merging and splitting PDF documents.
Finally, the thesis will conclude with a summary of the main findings and contributions
of the study. Overall, this bachelor’s thesis offers valuable insights into the potential
of PDFBox and ooRexx for creating and manipulating PDF documents. The Nutshell
examples provided demonstrate the power and flexibility of this platform and provide
a valuable resource for developers and users alike.
An Introduction to Apache PDFBox Library: Nutshell Examples 2
2 Background
To implement the nutshell examples presented in this bachelor thesis, the use of
several technologies and components is required. This chapter will provide
information about the relevant components in order to understand the implementing
process.
The beauty of PDF lies in its ability to maintain document formatting and layout,
regardless of the platform or device used to open it. This means that PDFs can be
shared, viewed, and printed with ease, regardless of the operating system, software
program, or device used [2]. In addition, PDFs can be secured with passwords and
digital signatures, ensuring document confidentiality and authenticity [3].
PDF was created by Adobe Systems in the early 1990s, and its development was
driven by the need for a universal file format that could be used across different
computer systems. The idea for PDF came from John Warnock, the co-founder of
Adobe Systems. In 1991, Warnock wrote a paper called "The Camelot Project" that
proposed a new way of working with documents. The paper described a system that
would allow documents to be viewed and printed on any computer, regardless of the
software used to create them. The system would also allow documents to be stored
electronically and distributed easily [4]. Warnock's vision was to create a document
format that would be as reliable and consistent as paper, but more versatile and
portable. He believed that by creating a universal format for documents, it would be
possible to improve the way people worked with information, making it easier to
share and collaborate on documents.
The first version of PDF was released in 1993, and it quickly gained popularity among
businesses and organizations that needed a reliable way to share and distribute
documents. The format became especially popular in industries such as publishing,
An Introduction to Apache PDFBox Library: Nutshell Examples 3
where it was used to create electronic versions of books and other printed materials.
In 2008, Adobe Systems released the PDF 1.7 specification as an ISO standard,
making PDF an open standard that could be used by anyone [5]. This move helped
to further establish PDF as a universal document format, and it paved the way for
the development of a wide range of PDF-related software tools and applications.
Today, PDF is one of the most widely used document formats in the world, with
millions of PDF documents created and shared every day. It has become an essential
tool for businesses, governments, and individuals who need a reliable way to share
and exchange information. And with the rise of mobile devices, PDF has become
even more important, as it allows users to access and view documents on a wide
range of devices, including smartphones and tablets.
PDF has gone through several versions since its inception. Each version has
introduced new features and improvements that have made PDF an increasingly
powerful and versatile format for creating, sharing, and exchanging documents.
Below, we will explore the version history of PDF [6].
• PDF 1.0 - The first version of PDF was released by Adobe Systems in 1993. It
was based on PostScript, a page description language developed by Adobe.
PDF 1.0 introduced basic features such as the ability to embed fonts, images,
and other media into documents. It also allowed for documents to be viewed
and printed on any computer, regardless of the software used to create them.
• PDF 1.1 - This version, released in 1996, introduced support for interactive
form elements, such as text boxes and radio buttons. It also included support
for annotations, which allowed users to add comments and notes to PDF
documents.
• PDF 1.2 - Released in 1998, introduced support for digital signatures and
encryption, making PDF documents more secure. It also included support for
multimedia elements, such as audio and video.
• PDF 1.3 - This version, released in 2000, introduced support for layers, which
allowed users to control the visibility of different elements in a document. It
also included support for color management, making it easier to create
accurate color representations in PDF documents.
• PDF 1.4 - Released in 2001, PDF 1.4 introduced support for transparency,
allowing for more sophisticated graphics and designs. It also included support
for tagged PDF, which made PDF documents more accessible for users with
disabilities.
An Introduction to Apache PDFBox Library: Nutshell Examples 4
• PDF 1.5 - This version, released in 2003, introduced support for JPEG2000
compression, which made it possible to create smaller PDF files without
sacrificing image quality. It also included support for 3D graphics and
interactive multimedia elements.
• PDF 1.6 - Released in 2004, PDF 1.6 introduced support for layers that could
be nested within one another, making it easier to organize complex
documents. It also included support for live transparency, which allowed for
more complex and dynamic designs.
• PDF 1.7 - The final version of PDF to be released by Adobe Systems, PDF 1.7
was introduced in 2006. It included a range of new features, such as support
for the Adobe XML architecture, enhanced support for digital signatures, and
improved handling of complex graphics and fonts.
• PDF 2.0 - In 2017, PDF 2.0 was released by the International Organization
for Standardization (ISO). This version introduced a range of new features,
including support for hybrid PDFs that can include both PDF and HTML
content, improved support for 3D graphics and annotations, and enhanced
security features.
PDF has gained popularity for various reasons. It offers a multitude of benefits over
other file formats, making it an ideal choice for sharing and exchanging documents.
One of the most significant advantages of PDF is its ability to preserve formatting.
Unlike other file formats, PDF documents retain their formatting and layout
regardless of the software or device used to view them. This means that the original
document's fonts, colors, images, and graphics are preserved, ensuring that the
document looks the same on any device. This is particularly important for documents
like contracts, where formatting is crucial [7]. PDF also offers robust security
features, such as password protection and encryption, that help safeguard sensitive
documents from unauthorized access or modification. These features ensure that
only authorized users can access or edit the document, making it an ideal choice for
confidential documents such as financial reports, legal contracts, or medical records
[8]. Another benefit of PDF is its smaller file size. PDF files can be compressed to
reduce their size without compromising the document's quality. This means that PDF
documents take up less storage space, making them easier to share, store, and
transfer over the internet. This is particularly important for large documents that
would otherwise be difficult to email or upload to a website [9]. PDF documents are
also searchable, making it easy to find specific information within a document
An Introduction to Apache PDFBox Library: Nutshell Examples 5
quickly. This feature is especially useful for large documents like textbooks or
research papers. The search function allows users to find information quickly without
having to read through the entire document [10]. PDF is also cross-platform
compatible, meaning that it can be viewed and printed on any device or operating
system. This makes it easy to share documents with others, regardless of their
device or software. PDF files are also easy to create, edit, and share using various
software applications, such as Adobe Acrobat, Microsoft Word, or Google Docs.
The Apche PDFBox library is made up of four main components: PDFBox, FontBox,
XMPBox, and Preflight [12].
• PDFBox is the core component of the library and provides the main
functionality for working with PDF files, including parsing, creation, and
manipulation of PDF documents. It allows developers to extract text, images,
and other content from PDF files, as well as add, remove, or modify elements
within a PDF document.
• FontBox is a component of PDFBox that provides support for font
manipulation. It allows developers to extract information about the fonts used
in a PDF document, as well as embed or subset fonts to reduce file size and
ensure that the document is displayed correctly.
• XMPBox is a component of PDFBox that provides support for Extensible
Metadata Platform (XMP) metadata. It allows developers to extract, modify,
and add XMP metadata to PDF documents, which can be used to provide
additional information about the document, such as author, date, and
copyright information.
• Preflight is a component of PDFBox that provides preflighting functionality. It
allows developers to check whether a PDF document conforms to certain
standards, such as the PDF/A or PDF/X standards and provides detailed
reports of any issues that are found.
One of the most significant advantages of PDFBox is that it is entirely free to use,
making it accessible to everyone, regardless of their budget. Additionally, PDFBox is
designed to be compatible with multiple platforms, including Windows, Mac, and
Linux, which means users can access and use it from any device. PDFBox's
comprehensive functionality is another significant benefit. It can perform a variety
of tasks, including creating, modifying, and extracting content from PDF files.
Additionally, it supports advanced features such as encryption and digital signatures,
which makes it an ideal tool for businesses that deal with sensitive information.
An Introduction to Apache PDFBox Library: Nutshell Examples 7
Another important benefit of PDFBox is its high performance. It can handle large
files and perform complex operations quickly, saving users valuable time and
resources. Finally, PDFBox benefits from an active community of developers and
users, who continuously work to improve and update the tool, making it an excellent
choice for anyone looking for a reliable and versatile PDF solution [12].
2.3 Java
Java is a high-level, class-based and object-oriented programming language that
has become a popular choice for software development since it was first released in
1995. Developed by Sun Microsystems (now owned by Oracle Corporation), Java
was designed to be platform-independent, meaning that it can run on multiple
operating systems without requiring recompilation [16]. This makes Java a highly
versatile language, widely used for developing everything from desktop applications
to mobile apps, web applications, and enterprise software. Java's popularity can be
attributed to its simplicity, robustness, and security features [17]. It has a vast
library of built-in classes and functions, making it easy to write complex programs
quickly. Additionally, Java has a vast community of developers, who contribute to
open-source projects, share code snippets, and provide support to new
programmers. Java is an object-oriented language, which means that it is based on
the concept of objects. Objects are instances of classes, which are templates that
define the properties and methods of an object. Java's object-oriented programming
model enables developers to write modular, reusable code that is easy to maintain
and update [18].
For Java applications to run on a computer or device, they require the Java Runtime
Environment (JRE), which is a software package that includes the Java Virtual
Machine (JVM) and other necessary components. The JVM is a key component of the
JRE, and it is responsible for executing Java bytecode [19]. Bytecode is a low-level,
machine-independent code that is generated by the Java compiler when a program
is compiled [20]. The JVM interprets this bytecode and executes it on the underlying
system, providing a platform-independent runtime environment for Java
applications. One of the key benefits of the JVM is that it provides a layer of
abstraction between the Java code and the underlying operating system. This means
that developers can write code that runs on any system that has a compatible JVM
installed, without needing to worry about the specific details of the operating system
or hardware [21]. This makes it much easier to develop cross-platform applications
An Introduction to Apache PDFBox Library: Nutshell Examples 8
that can run on a variety of devices and operating systems. The JRE also includes a
number of other components, such as class libraries, that are necessary for running
Java applications. Class libraries are collections of pre-built classes and functions
that developers can use to speed up development and reduce the amount of code
they need to write. These class libraries cover a wide range of functionality, from
basic data types and control structures to more advanced features such as network
programming and graphical user interfaces [22]. While the JRE is an essential
component of the Java ecosystem, it is important to note that it is separate from the
Java Development Kit (JDK), which includes additional tools such as the Java
compiler and debugger. The JRE is primarily used for running Java applications, while
the JDK is used for developing and compiling Java code [23].
ooRexx offers several benefits that make it an attractive option for programmers.
Here are some of the key advantages of ooRexx [27]:
• Fewer rules: ooRexx has relatively few rules about format, allowing
programmers to write code in their preferred style. For example, instructions
can span multiple lines, be typed in uppercase or lowercase, and include
multiple instructions on a single line.
• Built-in functions and methods: ooRexx comes with a rich set of built-in
functions and methods that perform various processing, searching, and
comparison operations for text and numbers. This saves programmers time
and effort, as they don't have to write these functions from scratch.
• Clear error messages and powerful debugging: ooRexx provides clear error
messages and a powerful debugging tool (TRACE instruction) to help
programmers quickly identify and fix errors in their code. This saves time and
effort during the development and testing phases.
An Introduction to Apache PDFBox Library: Nutshell Examples 10
The key benefit of BSF4ooRexx is the ability to leverage existing Java libraries and
components from within ooRexx scripts. This enables developers to take advantage
of the vast number of Java libraries available and provides an easy way to add
scripting support to Java applications.
An Introduction to Apache PDFBox Library: Nutshell Examples 11
3 Installation
This chapter presents a comprehensive installation guide for various programming
languages and components. This guide is designed to provide a step-by-step process
for installing the software and tools necessary to replicate the nutshell examples
presented in this thesis. It is important to follow each step carefully to ensure that the
testing system is properly configured and able to run the necessary software.
3.1 Java
Firstly a Java Runtime Environment (JRE) is required to be installed on your computer.
3.2 ooRexx
The next thing to do is to install the ooRexx environment. To do that, follow these
three steps:
• To verify the installation, open a command prompt or terminal and enter the
following command: rexx -v. If ooRexx has been installed correctly, the output
should display the version of ooRexx that you have installed.
3.3 BSF4ooRexx
Before proceeding with the installation, make sure to uninstall any previous versions
of BSF4ooRexx on your system. To do this, use the menu "BSF4ooRexx -> Installation
-> Uninstall BSF4ooRexx".
2. Unzip the downloaded archive: Once the download is complete, locate the
downloaded archive and extract its contents to a directory on your computer.
3. Copy the PDFBox jar files: Navigate to the directory where you extracted the
contents of the archive and copy the .jar files to the "BSF4ooRexx/lib"
directory.
An Introduction to Apache PDFBox Library: Nutshell Examples 13
4 Nutshell Examples
In this chapter, 18 nutshell examples are presented. The Objective of these examples
is to give an insight into the functionality and working concept of the Java library
PDFbox. Each example will start with the code, which will be explained. At the end a
screenshot of the result will be showed. It is highly recommended to run the examples
in the right order, because many examples will build on the results of previous
examples. It's also important to note that a "resources" folder exists in the root
directory, and certain examples require files from this folder to run successfully. Please
ensure that these files are present in the folder before running the codes. To gain a
better understanding of the code presented in the nutshell examples, it is highly
recommended to go through the slides of Business Programming 1 & 2 by Professor
Rony Flatscher [29] [30].
16 fontclass = "org.apache.pdfbox.pdmodel.font.Standard14Fonts"
17 fname = BSF.loadClass(fontclass)~FontName~HELVETICA_BOLD
18 font =.bsf~new("org.apache.pdfbox.pdmodel.font.PDType1Font",fname)
19
24 cont~newLineAtOffset(100, 700)
25 cont~showText("Hello World")
26 cont~endText
27 cont~close
28
As one can see the code is very clearly structured and easy to understand. The very
first thing to do is to get Java support within the ooRexx environment. For that we
use the directive statement “::requires”, which is executed before all other non-
directive statements. The last line from the code loads the ooRexx module "BSF.CLS",
which camouflages Java as ooRexx. So, the whole functionality of Java is fully granted.
The next thing we need to do is to change the directory to the location where the
program is saved. The first line retrieves the location of the program and names it as
“pgm”. The dots in the command disregard other additional information, which is not
needed in this example. The following line uses the retrieved program location and
defines the current root directory, the parameter “L” here stands for location. The
program can be moved to any new location without rewriting the code.
Now we can focus on the actual tasks of this example. At first a PDF document needs
to be created. Therefore we use the statement “.bsf~new” to import the Java class
“org.apache.pdfbox.pdmodel.PDDocument” and create a new instance of that class. At
the same time the name “doc” is given to the newly created instance of this class. It
is important to use the fully qualified name of the Java class. After that is done, the
imported Java class can be treated as if it was an ooRexx class. For a PDF document
to be valid it must contain at least one page. Therefore, an empty page needs to be
created and added to the document. For that, a new 121instance of the Java class
“PDPage” named “page” is created in the same way as “doc”. The Java method
“addPage”, which was imported with the Java class, is used for adding the empty page
to the document.
The PDFbox library uses the class “ PDPageContentStream” for adding contents to the
document. Once again, an instance of this class is created. For the creation of the
An Introduction to Apache PDFBox Library: Nutshell Examples 15
cotentstream two parameters are needed, names of the document and the desired
page. The next step is to set up the font type of the text we want to add. PDFbox
comes with several build-in font types and one of them is selected and named “font”.
For adding the text content to the document, the following methods are used:
The final step is to save and close the PDF document with all the contents added. The
methods “save” and “close” are used.
The following figure shows a screenshot of the resulting PDF document opened with
Adobe Acrobat Pro.
The code of the second example is similar to the code of the first example. First, we
get Java support and define the directory of the program. Next step is to read the
existing PDF Document created in the previous example. Therefore, the Java class
An Introduction to Apache PDFBox Library: Nutshell Examples 17
“java.io.File” is used for providing access to the source file, which can be loaded now
with the method “~loadPDF” imported with the Java class "org.apache.pdfbox.Loader".
After the document is loaded the first page is selected for editing by the method
“getPage”.
In this example, overwriting the existing contents is not desired. For that, the
“AppendMode” needs to be set to “APPEND”. This allows us to create a content stream
that works in the append mode.
For the new content the same methods are used. To prevent overlapping with the
existing content the starting position of the new content has been changed. In the last
step the document can be saved and closed. The following figure is a screenshot
showing the resulting PDF document.
42 fname = BSF.loadClass(fontclass)~FontName~COURIER
43 font=.bsf~new("org.apache.pdfbox.pdmodel.font.PDType1Font",fname)
44 cont~setFont(font, 15)
45 cont~showText("This is a line in COURIER")
46 cont~newLine
47
First, a new PDF document with a blank page and the contentstream are created.
Second, the font type “HELVETICA” and font size 15 are set as default. Then the text
insert process is started. This process is repeated several times, each time with a
different font type. The Java class "org.apache.pdfbox.pdmodel.font.Standard14Fonts"
has to be reimported and set as default font type every time. For different font size it
An Introduction to Apache PDFBox Library: Nutshell Examples 20
is sufficient to call the method “~setFont” with the desired parameter. Next step is to
change the font color. Therefore, the Java class “java.awt.Color” is imported to provide
color support. The method “~setNonStrokingColor” is used to set up default text color.
Last step is to save and close the document.
After getting Java support and setting up the directory location the source file is
loaded. To read the content a new instance of the Java class
"org.apache.pdfbox.text.PDFTextStripper" is created. Then the text content is saved
by the method “~getText”.
The next step is to create an empty text file and extract the content to it. Hereby the
Java class “java.io.FileWriter" is needed. The following methods are used in the extract
process:
After the extract process the text content is saved without any formatting to a text file
as the following figure shows.
An Introduction to Apache PDFBox Library: Nutshell Examples 22
15 -- draw a line
16 cont~moveTo(100,700)
17 cont~lineTo(250,700)
18
19 -- draw a curve
20 cont~moveTo(350,700)
21 cont~curveTo(380,720,420,750,500,700)
22
23 -- draw a rectangle
24 cont~addRect(120,600,100,-100)
25
26 -- draw a triangle
27 cont~moveTo(350,500)
28 cont~lineTo(450,500)
29 cont~lineTo(400,600)
30 cont~lineTo(350,500)
31
37 cont~lineTo(200,300)
38 cont~lineTo(120,360)
39
40 -- draw a circle
41 cx=400
42 cy=350
43 r=50
44 k=0.552284749831
45 cont~moveTo(cx-r,cy)
46 cont~curveTo(cx-r,cy+k*r,cx-k*r,cy+r,cx,cy+r)
47 cont~curveTo(cx+k*r,cy+r,cx+r,cy+k*r,cx+r,cy)
48 cont~curveTo(cx+r,cy-k*r,cx+k*r,cy-r,cx,cy-r)
49 cont~curveTo(cx-k*r,cy-r,cx-r,cy-k*r,cx-r,cy)
50
First, a new document with a blank page and the associated content stream are
created. Then the method “~setLineWidth” is used to set up the line thickness of the
path we want to draw.
The first shape to draw is a simple straight from one point to another point. The method
“~moveTo” is used to define the coordinates of the starting point and the method
“~lineTo” describes the path to the ending point.
Drawing a curve is slightly more complex. In addition to the starting point and ending
point, 2 further control points are needed to describe the path of the curve. That is the
reason why the method “~curveTo” has six parameters, which are the X/Y coordinates
of the two control points and the ending point.
For drawing a rectangle, the method “~addRect” can be used to simplify the coding.
With a starting point and the length/width this method automatically draws the four
sides of the rectangle.
An Introduction to Apache PDFBox Library: Nutshell Examples 25
The next shape to draw is a triangle. Therefore, we must combine three lines to form
the triangle. With a set of three Point chosen, the points are connected with each
other. In the same way we can draw a star shape, by connecting five points with each
other.
The last and the trickiest shape to draw is a circle. It is impossible to draw a perfect
circle, but a very good approximation can be reached by combining several curves. In
this example the circle is divided into four parts. First, the circle center and the radius
are defined. Then the constant “k” is needed to determine the coordinates of the
control points. So, these four curves are drawn to form the circle.
The final step is to use the method “~stroke” to make all the paths defined visible.
Then the doc can be saved and closed.
41 cont~stroke
42 cont~close
43
After the document and the content stream are created, the coordinates of the initial
point and the size of the cells are defined. Then, the number of the columns and rows
are defined. After the preparation is done, we can start to draw the table manually.
Therefore, two do-loops are created, one each for the rows and columns. First, the
method “~addRect” is used for drawing the table, cell by cell. Then, a text content is
inserted to each of the cells. After the loops are done, the whole table is made visible
by the methode “~stroke”.
As one can see the code for this example is very simple. The Java class “Loader” is
used to import the existing PDF document. Then, the Java class “PDFRenderer” is
imported to save the whole page as an image to the buffer. For saving the Image file
there are a few steps to do. First, the Java class “ImageIO” has to be imported. Then,
the Java class “io.File” is used to create an empty image file. The last step is to use
the method “~write” from the class “ImageIO” for writing the buffered image to the
created image file.
The figure below shows a screenshot of the resulting image file opened with Microsoft
Paint.
An Introduction to Apache PDFBox Library: Nutshell Examples 30
After creating the new document and a new blank page, the Java class
“PDImageXObject” is imported to handle the image file we want to insert. The method
“~createFromFile” is used to create an image object that can be used for further
operations. To use this method, we need the path of the desired image file and the
name of document object we created.
After the content stream is prepared, the method “~drawImage” is used to insert the
image object we created from the original image file. With that done, the document
can be saved and closed.
An Introduction to Apache PDFBox Library: Nutshell Examples 32
The following figure shows the resulting document with the image inserted.
41 cont~newLine
42 cont~newLine
43 cont~showText("This is the fourth line")
44 cont~endText
45 cont~close
46 end
47
After the document is created, we want to add several pages according to the goal of
this example. In this code a do loop is used to simplify the code. For demonstration
the number of loops is set to 3. It is of course possible to use another value instead or
just do it manually without the loop function.
For each loop a page with own index is created and added to document. It is necessary
to create a content stream for each of the pages. After inserting the desired content,
the content stream is closed. After all loops are done, the document can be saved and
closed.
The first thing to do is to import a document with several pages. The resulting
document of the last example is used for that. For splitting the document a new
instance of the Java class “org.apache.pdfbox.multipdf.Splitter” is created. First, the
document is split into a list of separate documents by using the method “~split”. Next,
the method “~listIterator” is used to create an iterator for pointing every single
document from the list. We use a loop statement to save those documents. The first
step is to check the number of documents created, which defines the number of loops
to run. The next step is to create the loops, in which we use the method “~next” to
select a single document and save it. After the loops are done, every page from the
original documents is saved as a new separate document.
An Introduction to Apache PDFBox Library: Nutshell Examples 37
Figure 21: Output of “10. Splitting a PDF Document with Multiple Pages.rex”
An Introduction to Apache PDFBox Library: Nutshell Examples 38
The first thing to do is to use the Java class “java.io.File” to provide access to the
source files. Next, a new instance of the Java class “PDFMergerUtility” needs to be
created. The file name of the new document is set by the method
“~setDestinationFileName”. Then, the method “~addSource” is used to import source
file we want to merge. This needs to be done for every file. Next, the memory usage
setting needs to be defined. Therefore, the Java class “MemoryUsageSetting” is
imported and set to “~setupMainMemoryOnly”. The last step is to use the method
“~mergeDocuments” with the defined memory usage setting to merge all the imported
documents.
An Introduction to Apache PDFBox Library: Nutshell Examples 39
The following figure shows a screenshot of the resulting document opened with Adobe
Acrobat Pro.
With PDFbox it is very easy to edit the metadata of a document. First, a document is
loaded using the Java class “org.apache.pdfbox.Loader”. Then, the method
“~getDocumentInformation” is used to load the metadata of the document. Now it is
possible to modify it. In this example these following methods are used:
After setting the metadata the document is saved and closed. The following figure
shows a screenshot of the modified metadata.
An Introduction to Apache PDFBox Library: Nutshell Examples 41
26 fontclass = "org.apache.pdfbox.pdmodel.font.Standard14Fonts"
27 fname = BSF.loadClass(fontclass)~FontName~HELVETICA_BOLD
28 font=.bsf~new("org.apache.pdfbox.pdmodel.font.PDType1Font",fname)
29
43 cont~setNonStrokingColor(col)
44
The first thing to do is to import a document with several pages. We can use the
resulting document of last example. Next, the number of pages is counted to determine
how many loops are needed to add the watermark. In this example, we will use a short
text as watermark.
Within the loop a content stream is created for the current page. To avoid overwriting
existing content the append mode must be activated. After setting the font type the
content stream can be started with the method “~beginText”. The next thing to do is
to set the transparency of the watermark. Therefore, the Java class
“PDExtendedGraphicsState” needs to be imported. The alpha constant, which defines
the transparency, is modified by the method “~setNonStrokingAlphaConstant”. This
setting needs to be saved to the content stream using the method
“~setGraphicsStateParameters”. Then, the font color of the text is set to red and the
font size to 70. The Java class “org.apache.pdfbox.util.Matrix” is imported to rotate
the watermark. Therefore, the method “getRotateInstance” is used to define the
rotation parameters. The first parameter is the rotation angle and the next two
parameters are coordinates of the rotation point. This rotation setting needs to be
saved to the content stream as well. The last step of the loop is to insert the text as
watermark and close the content stream.
After all loops are done, the document can be saved and closed. The following figure
shows the resulting document.
An Introduction to Apache PDFBox Library: Nutshell Examples 44
The first thing to do is to create or load a PDF document, which should be secured.
In this example the resulting document of last example is used to demonstrate how
the securing process works.
After the document is loaded, the first part of the securing process can be started. A
standard PDF document can be opened, copied, printed or modified by anyone. To
avoid that, the Java class “org.apache.pdfbox.pdmodel.encryption.AccessPermission”
is imported to define access permissions for the user. The following methods are
used to define if the user has certain permission to the document.
An Introduction to Apache PDFBox Library: Nutshell Examples 46
In this example these permissions are set to false, which means that the user is not
allowed to process these operations.
The next part of the securing process is to encrypt the document and adopt the
access permission settings we defined in the first part of the process. Therefore, the
Java class “org.apache.pdfbox.pdmodel.encryption.StandardProtectionPolicy“ needs
to be imported. This class requires three parameters, the first two are owner and
user password, the third parameter is the access permission setting. Owner
password and user password are strings that can be chosen freely. With the user
password the document can be opened and all operations, that aren’t forbidden by
the defined access permissions, can be processed. The owner password is needed to
gain full access to the document and is not subject to any restriction.
The next step is to set the length of the secret key used to encrypt the document by
using the method “~setEncryptionKeyLength”. The access permission setting needs
to be activated by the method ”~setPermissions“. Finally, the document is encrypted
by the method “~protect”.
After the securing process the document can be saved and closed.
The following figures will show the effects of the securing process.
The first thing to do is to create a new PDF document and add a blank page to it. The
PDF/A format has some requirements that need to be fulfilled. The first requirement
concerns the fonts. All fonts used in the document must be embedded in the file,
because of this the usage of the build-in standard font types is not suitable. For this
example, the font file “arial.ttf” has been downloaded and put into the folder
“resources”. The Java class “java.io.File” is used to provide access to the file and the
class “org.apache.pdfbox.pdmodel.font.PDType0Font” is imported to load the font file.
Another requirement of the PDF/A format is to have metadata defined in the document.
The tricky part here is the fact, that the use of the ISO-standardized metadata format
Extensible Metadata Platform (XMP) is required. The PDFbox library does not support
this format by default, so the sub-library xmpbox needs to be implemented to handle
XMP metadata. After this is done, the Java class “org.apache.xmpbox.XMPMetadata”
can be imported to provide XMP metadata support. First, the method
“~createXMPMetadata” is used to create a new XMP metadata. A PDF/A document
requires at least two entries in the metadata, the title and the PDF/A version of the
document. For these two entries we must use different standardized schemas to define
them. The title requires the use of the Dublin Core Metadata Element Set, which is
done by using the methods “~createAndAddDublinCoreSchema” and “~setTitle”. The
PDF/A Identification Schema defines the entries, which indicate that the file is a PDF/A
An Introduction to Apache PDFBox Library: Nutshell Examples 50
The last requirement of PDF/A standard is to include the color space profile used into
the document. For this example, the color space profile “sRGB.icc” is used and saved
to the resources folder. First, the color profile is converted to an input stream by
importing the Java class “java.io.FileInputStream”. Next, the Java class
“org.apache.pdfbox.pdmodel.graphics.color.PDOutputIntent” is imported to create an
output intent using the converted color profile. The color profile needs to be defined in
document by the following methods:
• ~setInfo
• ~setOutputCondition
• ~setOutputConditionIdentifier
• ~setRegistryName
After inserting these entries the color profile can be implemented by the method
“~addOutputIntent”.
The next step is to create a content stream and insert some text in the document.
Then, the document can be saved and closed.
8 -- use the parser to validate the document and save the result
9 pclass = "org.apache.pdfbox.preflight.parser.PreflightParser"
10 parser=.bsf~new(pclass,file)
11 doc=parser~parse
12 if doc~validate~isValid=1
13 then content=file " is a valid PDF/A-1b document"
14 else content=file " is not a valid PDF/A-1b document"
15
The PDFbox library doesn’t support PDF/A validation by default. The sub-library
preflight needs to be implemented.
The next step is to save the validation result to a new document. Therefore, a new
PDF document is created, and a blank page is added. Then, a content stream is
created to insert the validation result. In the end, the document is saved and closed.
21 sig~setSignDate(cal)
22 doc~addSignature(sig)
23
The first step is to load an existing document that need to be signed. In this
example, the resulting document of the last example is used. The next step is to
create and configure the signature. Therefore, a new instance of the Java class
“org.apache.pdfbox.pdmodel.interactive.digitalsignature.PDSignature” is created. For
configuring the signature, the following methods are used:
• ~setFilter
• ~setSubFilter
• ~setName
• ~setLocation
• ~setReason
• ~setSignDate
The next thing to do is to create the signature generator by creating a new instance
of the Java class “org.bouncycastle.cms.CMSSignedDataGenerator”. Then, an info
generator is needed. For the info generator can be created, the Java classes
“org.bouncycastle.operator.jcajce.JcaContentSignerBuilder” and
“org.bouncycastle.operator.jcajce.JcaDigestCalculatorProviderBuilder” need to be
imported and defined. The info generator is created by importing the Java class
“org.bouncycastle.cms.jcajce.JcaSignerInfoGeneratorBuilder” and added to the
signature generator by the method “~addSignerInfoGenerator”. The certificate store
needs to be added too.
After the signature generator is created and configured, the signing process can be
started. First, the destination file for the signed document needs to be defined. Then,
an output stream is created for the destination file. The output stream will be used
by the method “saveIncrementalForExternalSigning”, which will write the signing
data. The next step is to make the signing data suitable for the Cryptographic
Message Syntax (CMS) Standard, which is used for digital signatures and encryption.
The content is read by the method “~getContent” and converted to a byte array by
importing the Java class “org.apache.commons.io.IOUtils“ and using the method
“~toByteArray”. The Java class “org.bouncycastle.cms.CMSProcessableByteArray“ is
imported to provide CMS support.
An Introduction to Apache PDFBox Library: Nutshell Examples 57
The method “~generate” generates the signature, which can be retrieved by the
method “~getEncoded”. The last step is to use the method “~setSignature” to save
the signed signature to the document.
The following figure show the resulting document with the digital signature.
43 contclass = "org.apache.pdfbox.pdmodel.PDPageContentStream"
44 cont=.bsf~new(contclass,doc,page)
45
The basic idea of the verifying process is to check, whether the signature matches the
signed content. So, the first step is to extract them from the signed document. After
the document has been loaded, the method “~getSignatureDictionaries” is used to get
all signature dictionaries. Then, the method “~get(0)” is used to get the first signature
dictionary. The signature is extracted from the signature dictionary by the method
“~getContents”. The next step is to extract the signed content of the document.
Therefore, an input stream converted from the source document file is needed. The
method “~getSignedContent” extracts the signed content from the input stream.
The next step is to convert the extracted signed content to a new byte array, that is
suitable for the Cryptographic Message Syntax (CMS) Standard. The Java class
“org.bouncycastle.cms.CMSProcessableByteArray” is imported for this task. Next, the
converted signed content and the signature are used to create a new object of the
Java class “org.bouncycastle.cms.CMSSignedData”, which can checked, whether the
signature matches the signed content. For the verifying process a verifier needs to be
created. The first step is to get the signer info by using the method “~getSignerInfos”
and certificate that matches the signer info. Next, a verifier builder is created by the
An Introduction to Apache PDFBox Library: Nutshell Examples 60
The last thing to do is to create a new document and insert the verifying result. After
that, the document can be saved and closed.
5 Conclusio
In conclusion, this thesis has explored the use of Apache PDFBox in combination with
ooRexx to create and manipulate PDF documents. Through the 18 Nutshell examples
provided, we have demonstrated the extensive capabilities of this platform,
highlighting the power and flexibility of PDFBox for a range of practical applications.
The use of BSF4ooRexx as a bridge between Java and ooRexx has provided a valuable
toolkit for working with Java Libraries such as Apache PDFbox, which has allowed
developers and users to create custom PDF manipulation solutions tailored to their
specific needs. The platform has opened new possibilities for PDF document creation
and manipulation, increasing efficiency and productivity.
Through the Nutshell examples provided, we have demonstrated the ability of PDFBox
to create and manipulate PDF documents for a range of practical applications, including
creating PDF documents, generating contents, and extracting information from PDF
files. The examples have highlighted the ease of use and flexibility of PDFBox, making
it an attractive platform for developers and users alike.
Looking towards future research, there is potential to explore the use of more complex
programs in conjunction with PDFBox, Java, and ooRexx to handle even more complex
document structures. Additionally, deeper exploration of the PDFBox library could help
identify more advanced features and capabilities, providing even greater opportunities
for developers and users.
Overall, this thesis has shown the potential of PDFBox and ooRexx for creating and
manipulating PDF documents. The Nutshell examples provided in this thesis serve as
a foundation for further research and development in this area, highlighting the
importance of exploring the potential of PDFBox to advance the capabilities of PDF
document manipulation. The combination of PDFBox and ooRexx represents a valuable
resource for anyone working with PDF documents, and the potential for further
development in this area is significant.
An Introduction to Apache PDFBox Library: Nutshell Examples 62
6 References
[1] D. Johnson, “PDF: The document format for everything.” https://www.pdfa.org/pdf-the-document-
format-for-everything-2/
[2] “What is a PDF? Portable Document Format | Adobe Acrobat.” https://www.adobe.com/acrobat/about-
adobe-pdf.html
[3] “Choosing a security method for PDFs.” https://helpx.adobe.com/acrobat/using/choosing-security-
method-pdfs.html
[4] J. Warnock, “The Camelot Project.” https://www.pdfa.org/norm-refs/warnock_camelot.pdf
[5] “Document Management — Portable Document Format — Part 1: PDF 1.7,” 2008.
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2008.pdf
[6] Wikipedia contributors, “History of PDF,” Wikipedia, Jan. 29, 2023.
https://en.wikipedia.org/wiki/History_of_PDF#:~:text=The%20Portable%20Document%20Format%20
(PDF,an%20open%20standard%20in%202008.
[7] Kodhodbanaan, “Benefits of Using PDF Files In Your Office - Tishare,” newzworldmagazine.com, Jan.
24, 2023. https://worldtimemagazine.com/benefits-of-using-pdf-files-in-your-office-tishare/
[8] “Overview of security in Acrobat and PDFs.” https://helpx.adobe.com/acrobat/using/overview-security-
acrobat-pdfs.html
[9] “How to compress a PDF file.” https://helpx.adobe.com/acrobat/how-to/compress-
pdf.html#:~:text=To%20reduce%20the%20size%20of,from%20the%20drop%2Ddown%20menu.
[10] “How to make PDF searchable: Make PDF text searchable | Adobe Acrobat.”
https://www.adobe.com/acrobat/hub/how-to/make-a-pdf-searchable
[11] “Apache License, Version 2.0.” https://www.apache.org/licenses/LICENSE-2.0
[12] Wikipedia contributors, “Apache PDFBox,” Wikipedia, Oct. 02, 2022.
https://en.wikipedia.org/wiki/Apache_PDFBox
[13] “Index of /dist/pdfbox/pdfbox/1.0.0.” https://archive.apache.org/dist/pdfbox/pdfbox/1.0.0/
[14] “Apache PDFBox | PDFBox 2.0.0 Migration Guide.” https://pdfbox.apache.org/2.0/migration.html
[15] “Apache PDFBox | PDFBox 3.0 Migration Guide.” https://pdfbox.apache.org/3.0/migration.html
[16] Wikipedia contributors, “Java (programming language),” Wikipedia, Mar. 09, 2023.
https://en.wikipedia.org/wiki/Java_(programming_language)
[17] “Features of Java - Javatpoint,” www.javatpoint.com. https://www.javatpoint.com/features-of-java
[18] “Java OOP (Object-Oriented Programming).”
https://www.w3schools.com/java/java_oop.asp#:~:text=Java%20%2D%20What%20is%20OOP%3F,co
ntain%20both%20data%20and%20methods.
[19] “Java JRE | Java Run-time Environment - Javatpoint,” www.javatpoint.com.
https://www.javatpoint.com/java-jre
[20] “Java Bytecode - Javatpoint,” www.javatpoint.com. https://www.javatpoint.com/java-bytecode
[21] W. I. a J. V. M.-D. F. Techopedia, “Java Virtual Machine (JVM),” Techopedia.com, May 01, 2013.
https://www.techopedia.com/definition/3376/java-virtual-machine-jvm
[22] Wikipedia contributors, “Java Class Library,” Wikipedia, Jan. 13, 2023.
https://en.wikipedia.org/wiki/Java_Class_Library
An Introduction to Apache PDFBox Library: Nutshell Examples 63
Appendix
Prerequisites to execute the nutshell examples:
Software:
Installation Guide:
How to use: