Design RST
Design RST
_design:
.. _design.intro:
Introduction
------------
.. _design-goals:
Design Goals
------------
The qpdf library includes support for reading and rewriting PDF files.
It aims to hide from the user details involving object locations,
modified (appended) PDF files, use of object streams, and stream
filters including encryption. It does not aim to hide knowledge of the
object hierarchy or content stream contents. Put another way, a user
of the qpdf library is expected to have knowledge about how PDF files
work, but is not expected to have to keep track of bookkeeping details
such as file positions.
There are some convenience routines for very common operations such as
walking the page tree and returning a vector of all page objects. For
full details, please see the header files
:file:`QPDF.hh` and
:file:`QPDFObjectHandle.hh`. There are also some
additional helper classes that provide higher level API functions for
certain document constructions. These are discussed in :ref:`helper-classes`.
.. _helper-classes:
Helper Classes
--------------
There are two kinds of helper classes: *document* helpers and *object*
helpers. Document helpers are constructed with a reference to a ``QPDF``
object and provide methods for working with structures that are at the
document level. Object helpers are constructed with an instance of a
``QPDFObjectHandle`` and provide methods for working with specific types
of objects.
- Core class interfaces do not know about helper classes. For example,
no methods of ``QPDF`` or ``QPDFObjectHandle`` will include helper
classes in their interfaces.
- Most of the time, object helpers don't know about other object
helpers. However, in some cases, one type of object may be a
container for another type of object, in which case it may make sense
for the outer object to know about the inner object. For example,
there are methods in the ``QPDFPageObjectHelper`` that know
``QPDFAnnotationObjectHelper`` because references to annotations are
contained in page dictionaries.
.. _implementation-notes:
Implementation Notes
--------------------
When reading an object from the input source, if the requested object
is inside of an object stream, the object stream itself is first read
into memory. Then the tokenizer reads objects from the memory stream
based on the offset information stored in the stream. Those individual
objects are cached, after which the temporary buffer holding the
object stream contents is discarded. In this way, the first time an
object in an object stream is requested, all objects in the stream are
cached.
- The client requests the root object by getting the value of the
``/Root`` key from trailer dictionary and returns it. It is an
unresolved indirect ``QPDFObjectHandle``.
.. _object_internals:
Object Internals
~~~~~~~~~~~~~~~~
.. _casting:
Casting Policy
--------------
The C++ code in qpdf is free of old-style casts except where unavoidable
(e.g. where the old-style cast is in a macro provided by a third-party
header file). When there is a need for a cast, it is handled, in order
of preference, by rewriting the code to avoid the need for a cast,
calling ``const_cast``, calling ``static_cast``, calling
``reinterpret_cast``, or calling some combination of the above. As a
last resort, a compiler-specific ``#pragma`` may be used to suppress a
warning that we don't want to fix. Examples may include suppressing
warnings about the use of old-style casts in code that is shared between
C and C++ code.
When the intention is just to switch the type because of exchanging data
between incompatible interfaces, use ``QIntC``. This is the usual case.
However, there are some cases in which we are explicitly intending to
use the exact same bit pattern with a different type. This is most
common when switching between signed and unsigned characters. A lot of
qpdf's code uses unsigned characters internally, but ``std::string`` and
``char`` are signed. Using ``QIntC::to_char`` would be wrong for
converting from unsigned to signed characters because a negative
``char`` value and the corresponding ``unsigned char`` value greater
than 127 *mean the same thing*. There are also
cases in which we use ``static_cast`` when working with bit fields where
we are not representing a numerical value but rather a bunch of bits
packed together in some integer type. Also note that ``size_t`` and
``long`` both typically differ between 32-bit and 64-bit environments,
so sometimes an explicit cast may not be needed to avoid warnings on one
platform but may be needed on another. A conversion with ``QIntC``
should always be used when the types are different even if the
underlying size is the same. QPDF's automatic build builds on 32-bit
and 64-bit platforms, and the test suite is very thorough, so it is
hard to make any of the potential errors here without being caught in
build or test.
.. _encryption:
Encryption
----------
Starting with version 4.0.0, qpdf can read files that are not encrypted
but that contain encrypted attachments, but it cannot write such files.
qpdf also requires the password to be specified in order to open the
file, not just to extract attachments, since once the file is open, all
decryption is handled transparently. When copying files like this while
preserving encryption, qpdf will apply the file's encryption to
everything in the file, not just to the attachments. When decrypting the
file, qpdf will decrypt the attachments. In general, when copying PDF
files with multiple encryption formats, qpdf will choose the newest
format. The only exception to this is that clear-text metadata will be
preserved as clear-text if it is that way in the original file.
One point of confusion some people have about encrypted PDF files is
that encryption is not the same as password protection.
Password-protected files are always encrypted, but it is also possible
to create encrypted files that do not have passwords. Internally, such
files use the empty string as a password, and most readers try the
empty string first to see if it works and prompt for a password only
if the empty string doesn't work. Normally such files have an empty
user password and a non-empty owner password. In that way, if the file
is opened by an ordinary reader without specification of password, the
restrictions specified in the encryption dictionary can be enforced.
Most users wouldn't even realize such a file was encrypted. Since qpdf
always ignores the restrictions (except for the purpose of reporting
what they are), qpdf doesn't care which password you use. QPDF will
allow you to create PDF files with non-empty user passwords and empty
owner passwords. Some readers will require a password when you open
these files, and others will open the files without a password and not
enforce restrictions. Having a non-empty user password and an empty
owner password doesn't really make sense because it would mean that
opening the file with the user password would be more restrictive than
not supplying a password at all. QPDF also allows you to create PDF
files with the same password as both the user and owner password. Some
readers will not ever allow such files to be accessed without
restrictions because they never try the password as the owner password
if it works as the user password. Nonetheless, one of the powerful
aspects of qpdf is that it allows you to finely specify the way
encrypted files are created, even if the results are not useful to
some readers. One use case for this would be for testing a PDF reader
to ensure that it handles odd configurations of input files. If you
attempt to create an encrypted file that is not secure, qpdf will warn
you and require you to explicitly state your intention to create an
insecure file. So while qpdf can create insecure files, it won't let
you do it by mistake.
.. _random-numbers:
.. _adding-and-remove-pages:
While qpdf's API has supported adding and modifying objects for some
time, version 3.0 introduces specific methods for adding and removing
pages. These are largely convenience routines that handle two tricky
issues: pushing inheritable resources from the ``/Pages`` tree down to
individual pages and manipulation of the ``/Pages`` tree itself. For
details, see ``addPage`` and surrounding methods in
:file:`QPDF.hh`.
.. _reserved-objects:
.. _foreign-objects:
The other way to copy foreign objects is by passing a page from one
``QPDF`` to another by calling ``QPDF::addPage``. In contrast to
``QPDF::makeIndirectObject``, this method automatically distinguishes
between indirect objects in the current file, foreign objects, and
direct objects.
When you copy objects from one ``QPDF`` to another, the input source
of the original file remain valid until you have finished with the
destination object. This is because the input source is still used
to retrieve any referenced stream data from the copied object. If
needed, there are methods to force the data to be copied. See comments
near the declaration of ``copyForeignObject`` in
:file:`include/qpdf/QPDF.hh` for details.
.. _rewriting:
- Initialize state:
- For each value that is an indirect object, grab the next object
number (via an operation that returns and increments the number). Map
object to new number in renumber table. Push object onto queue.
- Pop queue.
Once we have finished the queue, all referenced objects will have been
written out and all deleted objects or unreferenced objects will have
been skipped. The new cross-reference table will contain an offset for
every new object number from 1 up to the number of objects written. This
can be used to write out a new xref table. Finally we can write out the
trailer dictionary with appropriately computed /ID (see spec, 8.3, File
Identifiers), the cross reference table offset, and ``%%EOF``.
.. _filtered-streams:
Filtered Streams
----------------
.. _object-accessors:
Object Accessor Methods
-----------------------
..
This section is referenced in QPDFObjectHandle.hh
*Why were type errors made into warnings?* When type checks were
introduced into qpdf in the early days, it was expected that type errors
would only occur as a result of programmer error. However, in practice,
type errors would occur with malformed PDF files because of assumptions
made in code, including code within the qpdf library and code written by
library users. The most common case would be chaining calls to
``getKey()`` to access keys deep within a dictionary. In many cases,
qpdf would be able to recover from these situations, but the old
behavior often resulted in crashes rather than graceful recovery. For
this reason, the errors were changed to warnings.
*Why even warn about type errors when the user can't usually do anything
about them?* Type warnings are extremely valuable during development.
Since it's impossible to catch at compile time things like typos in
dictionary key names or logic errors around what the structure of a PDF
file might be, the presence of type warnings can save lots of developer
time. They have also proven useful in exposing issues in qpdf itself
that would have otherwise gone undetected.
*Why does the behavior of a type exception differ between the C and C++
API?* There is no way to throw and catch exceptions in C short of
something like ``setjmp`` and ``longjmp``, and that approach is not
portable across language barriers. Since the C API is often used from
other languages, it's important to keep things as simple as possible.
Starting in qpdf 10.5, exceptions that used to crash code using the C
API will be written to stderr by default, and it is possible to register
an error handler. There's no reason that the error handler can't
simulate exception handling in some way, such as by using ``setjmp`` and
``longjmp`` or by setting some variable that can be checked after
library calls are made. In retrospect, it might have been better if the
C API object handle methods returned error codes like the other methods
and set return values in passed-in pointers, but this would complicate
both the implementation and the use of the library for a case that is
actually quite rare and largely avoidable.
.. _smart-pointers:
Smart Pointers
--------------
This section describes changes to the use of smart pointers that were
made in qpdf 10.6.0 and 11.0.0.
Here is a list of things you need to think about when migrating from
``PointerHolder`` to ``std::shared_ptr``. After the list, we will
discuss how to address each one using the ``POINTERHOLDER_TRANSITION``
preprocessor symbol or other C++ coding techniques.
Old code:
.. code-block:: c++
PointerHolder<X> x_p;
X* x = new X();
x_p = x;
New code:
.. code-block:: c++
PointerHolder<Base> base_p;
Derived* derived = new Derived();
base_p = derived;
New code:
.. code-block:: c++
std::shared_ptr<Base> base_p;
Derived* derived = new Derived();
base_p = std::shared_ptr<Base>(derived);
If you are not ready to take action yet, you can ``#define
POINTERHOLDER_TRANSITION 0`` before including any qpdf header file or
add the definition of that symbol to your build. This will provide the
backward-compatible ``PointerHolder`` API without any deprecation
warnings. This should be a temporary measure as ``PointerHolder`` may
disappear in the future. If you need to be able to support newer and
older versions of qpdf, there are other options, explained below.
Note that, even with ``0``, you should rebuild and test your code.
There may be compiler errors if you have containers of
``PointerHolder``, but most code should compile without any changes.
There are no uses of containers of ``PointerHolder`` in qpdf's API.
There are two significant things you can do to minimize the impact of
switching from ``PointerHolder`` to ``std::shared_ptr``:
- - value
- meaning
- - undefined
- Same as ``0`` but issues a warning
- - ``0``
- Provide a backward compatible ``PointerHolder`` and suppress
all deprecation warnings; supports all prior qpdf versions
- - ``1``
- Make the ``PointerHolder<T>(T*)`` constructor explicit;
resulting code supports all prior qpdf versions
- - ``2``
- Deprecate ``getPointer()`` and ``getRefcount()``; requires
qpdf 10.6.0 or later.
- - ``3``
- Deprecate all uses of ``PointerHolder``; requires qpdf 11.0.0
or later
- - ``4``
- Disable all functionality from ``qpdf/PointerHolder.hh`` so
that ``#include``-ing it has no effect other than defining
``POINTERHOLDER_IS_SHARED_POINTER``; requires qpdf 11.0.0 or
later.
Based on the above, here is a procedure for preparing your code. This
is the procedure that was used for the qpdf code itself.
You can do these steps without breaking support for qpdf versions
before 10.6.0:
For example:
.. code-block:: c++
.. code-block:: c++
.. code-block:: c++
auto p = std::make_unique<X[]>(n);
// or, if X has a private constructor:
auto p = std::unique_ptr<X[]>(new X[n]);
- Old code:
.. code-block:: c++
- New code:
.. code-block:: c++
.. code-block:: c++
#include <qpdf/PointerHolder.hh>
#ifdef POINTERHOLDER_IS_SHARED_POINTER
std::shared_ptr<X> x;
#else
PointerHolder<X> x;
#endif // POINTERHOLDER_IS_SHARED_POINTER
x = decltype(x)(new X())
or
.. code-block:: c++
#include <qpdf/PointerHolder.hh>
#ifdef POINTERHOLDER_IS_SHARED_POINTER
auto x_p = std::make_shared<X>();
X* x = x_p.get();
#else
auto x_p = PointerHolder<X>(new X());
X* x = x_p.getPointer();
#endif // POINTERHOLDER_IS_SHARED_POINTER
x_p->doSomething();
x->doSomethingElse();
If you don't need to support older versions of qpdf, you can proceed
with these steps without protecting changes with the preprocessor
symbol. Here are the remaining changes.
Historical Background
~~~~~~~~~~~~~~~~~~~~~
Since its inception, the qpdf library used its own smart pointer
class, ``PointerHolder``. The ``PointerHolder`` class was originally
created long before ``std::shared_ptr`` existed, and qpdf itself
didn't start requiring a C++11 compiler until version 9.1.0 released in
late 2019. With current C++ versions, it is no longer desirable for qpdf
to have its own smart pointer class.