0% found this document useful (0 votes)
44 views18 pages

PDF Explained

Nice and fantastic

Uploaded by

jaelin.izzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views18 pages

PDF Explained

Nice and fantastic

Uploaded by

jaelin.izzy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG

CONTENT SPONSORSHIP

See everything available through the O’Reilly learning platform and star Search

PDF Explained by John Whitington

BUY ON AMAZON BUY ON EBOOKS.COM

Chapter 4. Document Structure

In this chapter, we leave behind the bits and bytes of the PDF file, and consider the logical
structure. We consider the trailer dictionary, document catalog, and page tree. We enumerate
the required entries in each object. We then look at two common structures in PDF files: text
strings and dates.

Figure 4-1 shows the logical structure of a typical document.


TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG

CONTENT SPONSORSHIP

Figure 4-1. Typical document structure for a two page PDF document

Trailer Dictionary
This dictionary, residing in the file’s trailer rather than the main body of the file, is one of the
first things to be processed when a program wants to read a PDF document. It contains entries
allowing the cross-reference table—and thus the file’s objects—to be read. Its important entries
are summarized in Table 4-1.
Table 4-1. Entries in a trailer dictionary (*denotes required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

Key Value type Value


BLOG

/Size* Integer Total number


CONTENT of entries in the file’s cross-reference table (usually
SPONSORSHIP
equal to the number of objects in the file plus one).

/Root* Indirect The document catalog.


reference to
dictionary

/Info Indirect The document’s document information dictionary.


reference to
dictionary

/ID Array of two Uniquely identifies the file within a work flow. The first string is
Strings decided when the file is first created, the second modified by
workflow systems when they modify the file.

Here’s an example trailer dictionary:

<<
/Size 421
/Root 377 0 R
/Info 375 0 R
/ID [<75ff22189ceac848dfa2afec93deee03> <057928614d9711db835e000d937095a2>
>>

Once the trailer dictionary has been processed, we can go on to read the document information
dictionary and the document catalog.

Document Information Dictionary


The document information dictionary contains the creation and modification dates of the file,
together with some simple metadata (not to be confused with the more comprehensive XMP
metadata discussed in XML Metadata).
Document information dictionary entries are described in Table 4-2. A typical document
information dictionary is givenTEAMS INDIVIDUALS
in Example 4-1. FEATURES SIGN IN TRY NOW

BLOG
Table 4-2. Entries in a document information dictionary. The types “text string” and “date
CONTENT SPONSORSHIP
string” are explained later in this chapter.

Key Value Value


type

/Title text The document’s title. Note that this is nothing to do with any title
string displayed on the first page.

/Subject text The subject of the document. Again, this is just metadata with no
string particular rules about content.

/Keywords text Keywords associated with this document. No advice is given as to


string how to structure these.

/Author text The name of the author of the document.


string

/CreationDate date The date the document was created.


string

/ModDate date The date the document was last modified.


string

/Creator text The name of the program which originally created this document, if
string it started as another format (for example, “Microsoft Word”).

/Producer text The name of the program which converted this file to PDF, if it
string started as another format (for example, the format of a word
processor).

Example 4-1. Typical document information dictionary

<<
/ModDate (D:20060926213913+02'00')
/CreationDate (D:20060926213913+02'00')
/Title (catalogueproduit-UK.qxd)
/Creator (QuarkXPress: pictwpstops filter 1.0)
/Producer (Acrobat Distiller 6.0 for Macintosh)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/Author (James Smith)
>> BLOG

CONTENT SPONSORSHIP
The date string format (for /CreationDate and /ModDate) is discussed in the section Dates.
The text string format (which describes how different encodings can be used within the string
type) is described in Text Strings.

Document Catalog
The document catalog is the root object of the main object graph, from which all other objects
may be reached through indirect references. In Table 4-3, we list the document catalog
dictionary entries which are required, and some of the many optional ones, so as to introduce
brief PDF topics we don’t cover elsewhere in these pages.
Table 4-3. The document catalog (*denotes required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

Key Value type Value


BLOG

/Type* name CONTENT


Must be /Catalog.
SPONSORSHIP

/Pages* indirect The root node of the page tree. Page trees are discussed in
reference to Pages and Page Trees.
dictionary

/PageLabels number tree A number tree giving the page labels for this document.
This mechanism allows for pages in a document to have
more complicated numbering than just 1,2,3…. For
example, the preface of a book may be numbered i,ii,iii...,
whilst the main content starts again at 1,2,3….These page
labels are displayed in PDF viewers—they have nothing to
do with printed output.

/Names dictionary The name dictionary. This contains various name trees,
which map names to entities, to prevent having to use
object numbers to reference them directly.

/Dests dictionary A dictionary mapping names to destinations. A destination


is a description of a place within a PDF document to which
a hyperlink sends the user.

/ViewerPreferences dictionary A viewer preferences dictionary, which allows flags to


specify the behavior of a PDF viewer when the document
is viewed on screen, such as the page it is opened on, the
initial viewing scale and so on.

/PageLayout name Specifies the page layout to be used by PDF viewers.


Values are /SinglePage, /OneColumn, /TwoColumnLeft,
/TwoColumnRight, /TwoPageLeft, /TwoPageRight. (Default:
/SinglePage). Details are in Table 28 of ISO 32000-
1:2008.

/PageMode name Specifies the page mode to be used by PDF viewers.


Values are /UseNone, /UseOutlines, /UseThumbs,
/FullScreen, /UseOC, /UseAttachments. (Default: /UseNone).
TEAMS Details are in Table FEATURES
INDIVIDUALS 28 of ISO 32000-1:2008.
SIGN IN TRY NOW

BLOG
/Outlines indirect The outline dictionary is the root of the document outline,
reference to commonly
CONTENT known as the bookmarks.
SPONSORSHIP
dictionary

/Metadata indirect The document’s XMP metadata—see XML Metadata.


reference to
stream

Pages and Page Trees


A page tree, built from page dictionaries, brings together instructions for drawing the graphical
and textual content (which we consider in Chapter 5 and Chapter 6) with the resources (fonts,
images, and other external data) which those instructions make use of. It also includes the page
size, together with a number of other boxes defining cropping and so forth.

The entries in a page dictionary are summarized in Table 4-4.


Table 4-4. Entries in a page dictionary (*denotes required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

Key Value type Value


BLOG

/Type* name Must


CONTENT be /Page.
SPONSORSHIP

/Parent* indirect reference The parent node of this node in the page tree.
to dictionary

/Resources dictionary The page’s resources (fonts, images, and so on). If this
entry is omitted entirely, the resources are inherited from
the parent node in the page tree. If there are really no
resources, include this entry but use an empty dictionary.

/Contents indirect reference The graphical content of the page in one or more sections.
to stream or array If this entry is missing, the page is empty.
of such references

/Rotate integer The viewing rotation of the page in degrees, clockwise from
north. Value must be a multiple of 90. Default value: 0.
This applies to both viewing and printing. If this entry is
missing, its value is inherited from its parent node in the
page tree.

/MediaBox* rectangle The page’s media box (the size of its media, i.e., paper). For
most purposes, the page size. If this entry is missing, it is
inherited from its parent node in the page tree.

/CropBox rectangle The page’s crop box. This defines the region of the page
visible by default when a page is displayed or printed. If
absent, its value is defined to be the same as the media box.

The rectangle data structure for the media box and the other boxes is an array of four numbers.
These define the diagonally opposite corners of the rectangle—the first two elements of the
array being the x and y coordinates of one corner, the latter two elements being those of the
other. Normally, the lower-left and upper-right corners are given. So, for example:

/MediaBox [0 0 500 800]


/CropBox [100 100 400 700]

TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

defines a 500 by 800 point page with a crop box removing 100 pointsBLOG
on each side of the page.
CONTENT SPONSORSHIP
The pages are linked together using a page tree, rather than a simple array. This tree structure
makes it faster to find a given page in a document with hundreds or thousands of pages. Good
PDF applications build a balanced tree (one with the minimum height for the number of
nodes). This ensures that a particular page can be located quickly. The nodes with no children
are the pages themselves. An example page tree structure for seven pages is shown in Figure 4-
2.

This would be written in PDF objects as shown in Example 4-2. The entries in an intermediate
or root page tree node (i.e., not a page itself) are summarized in Table 4-5.

Figure 4-2. A page tree for seven pages. The exact shape of the tree is left to the individual PDF
application. The PDF code for this tree is shown in Example 4-2.

Example 4-2. PDF objects used to build the page tree illustrated in Figure 4-2

1 0 obj Root node


<< /Type /Pages /Kids [2 0 R 3 0 R 4 0 R] /Count 7 >>
endobj
2 0 obj Intermediate node
<< /Type /Pages /Kids [5 0 R 6 0 R 7 0 R] /Parent 1 0 R /Count 3 >>
endobj
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
3 0 obj Intermediate node
<< /Type /Pages /Kids [8 0 R 9 0 R 10 0 R] /Parent 1 0 BLOG
R /Count 3 >>
endobj CONTENT SPONSORSHIP
4 0 obj Page 7
<< /Type /Page /Parent 1 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
5 0 obj Page 1
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
6 0 obj Page 2
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
7 0 obj Page 3
<< /Type /Page /Parent 2 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
8 0 obj Page 4
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
9 0 obj Page 5
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
10 0 obj Page 6
<< /Type /Page /Parent 3 0 R /MediaBox [0 0 500 500] /Resources << >> >>
endobj
Table 4-5. Entries in an intermediate or root page tree node (*denotes a required entry)
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

Key Value type Value


BLOG

/Type* name MustSPONSORSHIP


CONTENT be /Pages.

/Kids* array of indirect The immediate child page-tree nodes of this node.
references

/Count* integer The number of page nodes (not other page tree nodes) which
are eventual children of this node.

/Parent indirect reference to Reference to the parent of this node (the node of which this
page tree node is a child). Must be present if not the root node of the page
tree.

In this tree, any page can be found at most two indirect references away from the root node.

Text Strings
Strings outside of the actual textual content of a page (e.g., bookmark names, document
information etc.) are known as text strings. They are encoded using either PDFDocEncoding or
(in more recent documents) Unicode. PDFDocEncoding is a based on the ISO Latin-1
Encoding. It is documented fully in Annex D of ISO Standard 32000-1:2008.

Text strings which are encoded as Unicode are distinguished by looking at the first two bytes:
these will be 254 followed by 255. This is the Unicode byte-order marker U+FEFF, which
indicates the UTF16BE encoding. This means a PDFDocEncoding string can’t begin with þ
(254) followed by ÿ (255), but this is unlikely to occur in any reasonable circumstance.

Dates
The creation and modification dates /CreationDate and /ModDate in the document
information dictionary are examples of the PDF date format, which encodes a date in a string,
including information about the time zone.

A date string has the format:


(D:YYYYMMDDHHmmSSOHH'mm')
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG
where the parentheses indicate a string as usual. The other parts of the date are summarized in
CONTENT SPONSORSHIP
Table 4-6.

Table 4-6. PDF date format constituents

Portion Meaning

YYYY The year, in four digits, e.g., 2008.

MM The month, in two digits from 01 to 12.

DD The day, in two digits from 01 to 31.

HH The hour, in two digits from 00 to 23.

mm The minute, in two digits from 00 to 59.

SS The second, in two digits from 00 to 59.

O The relationship of local time to Universal Time, either +, - or Z. + signifies local time is
later than UT, - earlier, and Z equal to Universal Time.

HH' The absolute value of the offset from Universal Time in hours, in two digits from 00 to
23.

mm' The absolute value of the offset from Universal Time in minutes, in two digits from 00
to 59.

All parts of the date after the year are optional. For example, (D:1999) is perfectly valid.
Plainly, though, if you omit one part, you must omit everything which follows, otherwise the
result would be ambiguous. The default values for DD and MM is 01, for all other parts, the
default is zeros.

For example:
(D:20060926213913+02'00')
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG
represents September 26th 2006 at 9:39:13 p.m, in a time zone two hours ahead of Universal
CONTENT SPONSORSHIP
Time.

Putting it Together
This is a manually-created text, to be processed into a valid PDF file by pdftk using the method
introduced in Chapter 2. It is a three page document, with document information dictionary and
page tree. Figure 4-3 shows this document displayed in Acrobat Reader. Figure 4-4 is the
corresponding object graph.

Example 4-3. A three page document with document information dictionary

%PDF-1.1 Header
1 0 obj Top-level of page tree: has two children—page one and an intermediate page tree node
<< /Kids [2 0 R 3 0 R] /Type /Pages /Count 3 >>
endobj
4 0 obj Contents stream for page one
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page One) Tj ET
endstream
endobj
2 0 obj Page one
<<
/Rotate 0
/Parent 1 0 R
/Resources
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [4 0 R]
>>
endobj
5 0 obj Document catalog
<< /PageLayout /TwoColumnLeft /Pages 1 0 R /Type /Catalog >>
endobj
6 0 obj Page three
<<
/Rotate 0
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
/Parent 3 0 R
/Resources BLOG
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
CONTENT SPONSORSHIP
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [7 0 R]
>>
endobj
3 0 obj Intermediate page tree node, linking to pages two and three
<< /Parent 1 0 R /Kids [8 0 R 6 0 R] /Count 2 /Type /Pages >>
endobj
8 0 obj Page two
<<
/Rotate 270
/Parent 3 0 R
/Resources
<< /Font << /F0 << /BaseFont /Times-Italic /Subtype /Type1 /Type /Font >>
/MediaBox [0.000000 0.000000 595.275590551 841.88976378]
/Type /Page
/Contents [9 0 R]
>>
endobj
9 0 obj Content stream for page two
<< >>
stream
q 1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Two) Tj ET Q
1. 0.000000 0.000000 1. 50. 750 cm BT /F0 16 Tf ((Rotated by 270 degrees)) Tj
endstream
endobj
7 0 obj Content stream for page three
<< >>
stream
1. 0.000000 0.000000 1. 50. 770. cm BT /F0 36. Tf (Page Three) Tj ET
endstream
endobj
10 0 obj Document information dictionary
<<
/Title (PDF Explained Example)
/Author (John Whitington)
/Producer (Manually Created)
/ModDate (D:20110313002346Z)
/CreationDate (D:2011)
>>
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW
endobj xref
0 11 BLOG
trailer Trailer dictionary CONTENT SPONSORSHIP
<<
/Info 10 0 R
/Root 5 0 R
/Size 11
/ID [<75ff22189ceac848dfa2afec93deee03> <75ff22189ceac848dfa2afec93deee03>]
>>
startxref
0
%%EOF
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG

CONTENT SPONSORSHIP

Figure 4-3. Example 4-3 converted to a valid PDF with pdftk and displayed in Acrobat Reader
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG

CONTENT SPONSORSHIP

Figure 4-4. Object graph for Example 4-3

Get PDF Explained now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by


job role, and more from O’Reilly and nearly 200 top publishers.
START YOUR FREE TRIAL
TEAMS INDIVIDUALS FEATURES SIGN IN TRY NOW

BLOG

CONTENT SPONSORSHIP

ABOUT O’REILLY SUPPORT INTERNATIONAL DOWNLOAD THE O’REILLY APP

Teach/write/train Contact us Australia & New Zealand Take O’Reilly with you and learn
anywhere, anytime on your phone
Careers Newsletters Hong Kong & Taiwan
and tablet.
Press releases Privacy policy India

Media coverage Indonesia

Community partners Japan


Affiliate program

Submit an RFP

Diversity WATCH ON YOUR BIG SCREEN


O’Reilly for View all O’Reilly videos, Superstream
marketers events, and Meet the Expert sessions on
your home TV.

DO NOT SELL MY PERSONAL INFORMATION

© 2024, O’Reilly Media, Inc. All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.
We are a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for us to
earn fees by linking to Amazon.com and affiliated sites.
Terms of service • Privacy policy • Editorial independence

You might also like