PDF Explained
PDF Explained
BLOG
CONTENT SPONSORSHIP
See everything available through the O’Reilly learning platform and star Search
In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical
structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate
the required entries in each object. We then look at two common structures in PDF files: text
strings and dates.
BLOG
CONTENT SPONSORSHIP
Figure 4-1. Typical document structure for a two page PDF document
Trailer Dictionary
This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the
first things to be processed when a program wants to read a PDF document. It contains entries
allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries
are summarized in Table 4-1.
Table 4-1. Entries in a trailer dictionary (*denotes required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/ID Array of two Uniquely identifies the file within a work flow. The first string is
Strings decided when the file is first created, the second modified by
workflow systems when they modify the file.
<<
/Size 421
/Root 377 0 R
/Info 375 0 R
/ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>
>>
Once the trailer dictionary has been processed, we can go on to read the document information
dictionary and the document catalog.
BLOG
Table 4-2. Entries in a document information dictionary. The types “text string” and “date
CONTENT SPONSORSHIP
string” are explained later in this chapter.
/Title text The document’s title. Note that this is nothing to do with any title
string displayed on the first page.
/Subject text The subject of the document. Again, this is just metadata with no
string particular rules about content.
/Creator text The name of the program which originally created this document, if
string it started as another format (for example, “Microsoft Word”).
/Producer text The name of the program which converted this file to PDF, if it
string started as another format (for example, the format of a word
processor).
<<
/ModDate (D:20060926213913+02'00')
/CreationDate (D:20060926213913+02'00')
/Title (catalogueproduit-UK.qxd)
/Creator (QuarkXPress: pictwpstops filter 1.0)
/Producer (Acrobat Distiller 6.0 for Macintosh)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/Author (James Smith)
>> BLOG
CONTENT SPONSORSHIP
The date string format (for /CreationDate and /ModDate) is discussed in the section Dates.
The text string format (which describes how different encodings can be used within the string
type) is described in Text Strings.
Document Catalog
The document catalog is the root object of the main object graph, from which all other objects
may be reached through indirect references. In Table 4-3, we list the document catalog
dictionary entries which are required, and some of the many optional ones, so as to introduce
brief PDF topics we don’t cover elsewhere in these pages.
Table 4-3. The document catalog (*denotes required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/Pages* indirect The root node of the page tree. Page trees are discussed in
reference to Pages and Page Trees.
dictionary
/PageLabels number tree A number tree giving the page labels for this document.
This mechanism allows for pages in a document to have
more complicated numbering than just 1,2,3…. For
example, the preface of a book may be numbered i,ii,iii...,
whilst the main content starts again at 1,2,3….These page
labels are displayed in PDF viewers—they have nothing to
do with printed output.
/Names dictionary The name dictionary. This contains various name trees,
which map names to entities, to prevent having to use
object numbers to reference them directly.
BLOG
/Outlines indirect The outline dictionary is the root of the document outline,
reference to commonly
CONTENT known as the bookmarks.
SPONSORSHIP
dictionary
/Parent* indirect reference The parent node of this node in the page tree.
to dictionary
/Resources dictionary The page’s resources (fonts, images, and so on). If this
entry is omitted entirely, the resources are inherited from
the parent node in the page tree. If there are really no
resources, include this entry but use an empty dictionary.
/Contents indirect reference The graphical content of the page in one or more sections.
to stream or array If this entry is missing, the page is empty.
of such references
/Rotate integer The viewing rotation of the page in degrees, clockwise from
north. Value must be a multiple of 90. Default value: 0.
This applies to both viewing and printing. If this entry is
missing, its value is inherited from its parent node in the
page tree.
/MediaBox* rectangle The page’s media box (the size of its media, i.e., paper). For
most purposes, the page size. If this entry is missing, it is
inherited from its parent node in the page tree.
/CropBox rectangle The page’s crop box. This defines the region of the page
visible by default when a page is displayed or printed. If
absent, its value is defined to be the same as the media box.
The rectangle data structure for the media box and the other boxes is an array of four numbers.
These define the diagonally opposite corners of the rectangle—the first two elements of the
array being the x and y coordinates of one corner, the latter two elements being those of the
other. Normally, the lower-left and upper-right corners are given. So, for example:
defines a 500 by 800 point page with a crop box removing 100 pointsBLOG
on each side of the page.
CONTENT SPONSORSHIP
The pages are linked together using a page tree, rather than a simple array. This tree structure
makes it faster to find a given page in a document with hundreds or thousands of pages. Good
PDF applications build a balanced tree (one with the minimum height for the number of
nodes). This ensures that a particular page can be located quickly. The nodes with no children
are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-
2.
This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate
or root page tree node (i.e., not a page itself) are summarized in Table 4-5.
Figure 4-2. A page tree for seven pages. The exact shape of the tree is left to the individual PDF
application. The PDF code for this tree is shown in Example 4-2.
Example 4-2. PDF objects used to build the page tree illustrated in Figure 4-2
/Kids* array of indirect The immediate child page-tree nodes of this node.
references
/Count* integer The number of page nodes (not other page tree nodes) which
are eventual children of this node.
/Parent indirect reference to Reference to the parent of this node (the node of which this
page tree node is a child). Must be present if not the root node of the page
tree.
In this tree, any page can be found at most two indirect references away from the root node.
Text Strings
Strings outside of the actual textual content of a page (e.g., bookmark names, document
information etc.) are known as text strings. They are encoded using either PDFDocEncoding or
(in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1
Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.
Text strings which are encoded as Unicode are distinguished by looking at the first two bytes:
these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which
indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ
(254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.
Dates
The creation and modification dates /CreationDate and /ModDate in the document
information dictionary are examples of the PDF date format, which encodes a date in a string,
including information about the time zone.
BLOG
where the parentheses indicate a string as usual. The other parts of the date are summarized in
CONTENT SPONSORSHIP
Table 4-6.
Portion Meaning
O The relationship of local time to Universal Time, either +, - or Z. + signifies local time is
later than UT, - earlier, and Z equal to Universal Time.
HH' The absolute value of the offset from Universal Time in hours, in two digits from 00 to
23.
mm' The absolute value of the offset from Universal Time in minutes, in two digits from 00
to 59.
All parts of the date after the year are optional. For example, (D:1999) is perfectly valid.
Plainly, though, if you omit one part, you must omit everything which follows, otherwise the
result would be ambiguous. The default values for DD and MM is 01, for all other parts, the
default is zeros.
For example:
(D:20060926213913+02'00')
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
BLOG
represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal
CONTENT SPONSORSHIP
Time.
Putting it Together
This is a manually-created text, to be processed into a valid PDF file by pdftk using the method
introduced in Chapter 2. It is a three page document, with document information dictionary and
page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the
corresponding object graph.
%PDF-1.1 Header
1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node
<< /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >>
endobj
4 0 obj Contents stream for page one
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET
endstream
endobj
2 0 obj Page one
<<
/Rotate 0
/Parent 1 0 R
/Resources
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [4 0 R]
>>
endobj
5 0 obj Document catalog
<< /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >>
endobj
6 0 obj Page three
<<
/Rotate 0
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/Parent 3 0 R
/Resources BLOG
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
CONTENT SPONSORSHIP
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [7 0 R]
>>
endobj
3 0 obj Intermediate page tree node, linking to pages two and three
<< /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >>
endobj
8 0 obj Page two
<<
/Rotate 270
/Parent 3 0 R
/Resources
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [9 0 R]
>>
endobj
9 0 obj Content stream for page two
<< >>
stream
q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q
1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj
endstream
endobj
7 0 obj Content stream for page three
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET
endstream
endobj
10 0 obj Document information dictionary
<<
/Title (PDF Explained Example)
/Author (John Whitington)
/Producer (Manually Created)
/ModDate (D:20110313002346Z)
/CreationDate (D:2011)
>>
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
endobj xref
0 11 BLOG
trailer Trailer dictionary CONTENT SPONSORSHIP
<<
/Info 10 0 R
/Root 5 0 R
/Size 11
/ID [<75ff22189ceac848dfa2afec93deee03> <75ff22189ceac848dfa2afec93deee03>]
>>
startxref
0
%%EOF
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
BLOG
CONTENT SPONSORSHIP
Figure 4-3. Example 4-3 converted to a valid PDF with pdftk and displayed in Acrobat Reader
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
BLOG
CONTENT SPONSORSHIP
BLOG
CONTENT SPONSORSHIP
Teach/write/train Contact us Australia & New Zealand Take O’Reilly with you and learn
anywhere, anytime on your phone
Careers Newsletters Hong Kong & Taiwan
and tablet.
Press releases Privacy policy India
Submit an RFP
© 2024, O’Reilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to
earn fees by linking to Amazon.com and affiliated sites.
Terms of service • Privacy policy • Editorial independence