PDF Reader From Scratch
PDF Reader From Scratch
Subsequently, ISO published ISO 32000-2:2017, defining PDF 2.0, which introduced
new features and provided critical clarifications and corrections to ambiguities in the
PDF 1.7 specification.12 A dated revision, ISO 32000-2:2020, incorporates further
improvements identified during early adoption.12 Access to PDF 2.0 specifications was
initially restricted but is now available at no cost via the PDF Association, thanks to
industry sponsorship.12 For developers building a reader, particularly one aiming for
broad compatibility and robustness, consulting both ISO 32000-1 (for PDF 1.x files)
and ISO 32000-2 (for PDF 2.0 and clarifications applicable to earlier versions) is
essential.12 These documents provide the authoritative definition of PDF syntax,
objects, operators, and rendering rules.2
PDF files are designed to be read starting from the end. The reader locates the
startxref value, jumps to the specified XRef table, reads the object locations, and uses
the /Root reference in the trailer to find the Document Catalog, which serves as the
entry point to the document's page structure and resources.16
Indirect objects are fundamental to PDF's structure. They are defined using an object
number, a generation number (usually 0 for new objects), and the obj and endobj
keywords (e.g., 4 0 obj... endobj).16 They are referenced using the object number,
generation number, and the R keyword (e.g., 4 0 R).17 This mechanism allows objects,
such as fonts or images used on multiple pages, to be defined once and reused,
reducing file size.19 It also facilitates incremental updates, where modified objects can
be appended to the file along with a new XRef section without rewriting the entire
document.10
Once at the specified offset, the parser reads the XRef information. A traditional XRef
section starts with the keyword xref, followed by one or more subsection headers.
Each header consists of two numbers: the starting object number for that subsection
and the count of entries in it.15 Following the header are the entries themselves, each
exactly 20 bytes long (including the line terminator).15 Each entry corresponds to one
object number and contains:
1. A 10-digit byte offset from the start of the file to the beginning of the object
definition.
2. A space.
3. A 5-digit generation number.
4. A space.
5. A keyword: n if the object is in use, or f if the object is free (deleted).15
The first object (number 0) is always marked free (f) and has a generation number of
65535; it serves as the head of the linked list of free objects.15
Starting with PDF 1.5, an alternative, more compact format called an XRef stream was
introduced. This encodes the cross-reference information within a stream object
itself, offering compression and greater flexibility than the fixed-format table. Parsing
an XRef stream involves decoding the stream content (often using /FlateDecode) and
interpreting its structured data according to rules defined in the specification (ISO
32000-1, Section 7.5.8). A reader must be prepared to handle both traditional XRef
tables and XRef streams.
The parser then performs a file seek operation to position the file pointer at that byte
offset.15 It reads the object definition, starting with the object number, generation
number, and obj keyword, and ending with the endobj keyword.16 The content between
obj and endobj is then parsed according to the basic PDF object syntax rules
(detecting Booleans, Numbers, Strings, Names, Arrays, Dictionaries, or Streams).14
To optimize performance, especially for objects referenced multiple times (like shared
resource dictionaries or fonts), it is common practice to cache parsed indirect objects
in memory. Subsequent references to the same object can then retrieve the cached
version, avoiding redundant file I/O and parsing.
Crucially, the stream dictionary contains a /Length entry specifying the number of
bytes in the (potentially encoded) stream data.15 It may also contain a /Filter entry,
whose value is either a single Name object or an Array of Name objects identifying the
filter(s) applied.15 If multiple filters are specified, they must be applied in the order
listed to decode the data.15 An optional /DecodeParms entry (a dictionary or an array
of dictionaries) provides parameters specific to each filter, if needed.15
The parser reads the number of raw bytes specified by /Length and then applies the
corresponding decoding algorithm(s) indicated by /Filter. Common filters include:
● /FlateDecode: Decompresses data using the zlib/deflate algorithm (very common
for text and image data).15
● /LZWDecode: Uses Lempel-Ziv-Welch compression.15
● /ASCIIHexDecode: Decodes data represented as ASCII hexadecimal characters.15
● /ASCII85Decode: Decodes ASCII base-85 encoded data, more compact than
hex.15
● /CCITTFaxDecode: Decompresses image data using CCITT Group 3 or Group 4
fax standards.15
● /DCTDecode: Decompresses JPEG baseline image data.15
● /RunLengthDecode: Simple run-length encoding scheme.15
● /JBIG2Decode: For bi-level (black and white) image data, often used in scanned
documents.15
The process of parsing a PDF reveals its non-linear nature. Locating the XRef requires
reading the end of the file 17, potentially following a chain of /Prev links backward
through updates 10, and then jumping to various byte offsets scattered throughout the
file based on the XRef data.15 Parsing a single object, like a Page dictionary, often
necessitates resolving indirect references within it, leading to the parsing of other
objects, such as its /Contents stream or its /Resources dictionary.16 Furthermore,
interpreting a stream object requires parsing its dictionary first to determine its length
and any required decoding filters.15 The introduction of object streams adds another
layer of indirection, where parsing the container stream is a prerequisite to accessing
the individual objects compressed within it.15 This interconnectedness and
dependency imply that a PDF parser cannot operate purely sequentially. It must
handle object loading on demand, potentially triggering further lookups and parsing
steps in a recursive or iterative manner. Careful state management and robustness
against broken links or invalid object definitions are crucial for successfully navigating
the potentially complex graph structure of a PDF document.
Text rendering is governed by a specific subset of the graphics state known as the
text state.21 Key parameters within the text state include the current font and size (set
by Tf), character spacing, word spacing, horizontal scaling, text leading (line spacing),
and the text rendering mode (e.g., fill, stroke, clip).
Text elements are typically enclosed within a text object, delimited by the BT (Begin
Text) and ET (End Text) operators.16 Most operators that affect the text state or draw
text glyphs are only valid between BT and ET.
Several core operators are used for positioning and drawing text:
● Tf (Set font): Takes a font resource name (defined in the page's /Resources) and
a size as operands (e.g., /F1 12 Tf) to set the active font and text size.16
● Text Positioning Operators (Td, TD, Tm, T*): These operators manipulate the
text matrix, which controls the position and orientation of text being drawn. Td
moves the start of the next line relative to the current line start. TD is similar but
also sets the leading. T* moves to the start of the next line using the current
leading. Tm directly sets the text matrix.21
● Tj (Show text): Takes a string operand and draws the corresponding glyphs at
the current text position, advancing the position after each glyph based on its
width.
● TJ (Show text with adjustments): Takes an array operand containing strings and
numeric adjustments. It draws the strings, applying the specified numeric
adjustments (in thousandths of text space units) between them. This allows for
precise control over glyph spacing and kerning [e.g., [(W) -120 (o) -110 (r) -110 (l)
-110 (d)] TJ].
● ' (Move to next line and show text): Equivalent to T* followed by Tj.
● " (Set word/char spacing, move to next line, show text): Equivalent to setting
word and character spacing, then performing '.
Within a text object (BT/ET), glyphs are positioned relative to the origin using the text
matrix (Tm). This matrix is initialized at the start of the text object and updated by
operators like Td, TD, T*, and Tm itself.21 When a text-showing operator like Tj or TJ is
encountered, the position of each glyph is determined relative to the current text
matrix.
For a Tj operator showing a string, the first glyph is placed at the origin defined by the
text matrix. Subsequent glyphs are placed sequentially, with the text matrix being
updated after each glyph based on its width (obtained from the font's metrics) and
any active character or word spacing parameters. The TJ operator provides more
explicit control, allowing numeric adjustments to be inserted between strings or
individual characters within the array operand, directly modifying the spacing relative
to the implicit width-based advancement.
To calculate the final, absolute position of a glyph on the page canvas (necessary for
rendering or layout analysis), its position in text space must be transformed first by
the current text matrix and then by the CTM that was active when the text object
began. The effective transformation is FinalMatrix = TextMatrix * CTM. Tracking both
matrices accurately is crucial.
A key aspect of PDF text handling is that glyph positions are often implicit and
cumulative, rather than explicitly defined for every character.21 This stems from PDF's
PostScript heritage, where drawing commands sequentially modify a current point or
state.2 Operators like Td and T* specify movement relative to the current text line's
start or the previous position.21 The standard Tj operator relies entirely on the defined
widths of glyphs in the current font to determine how far to advance the text position
after drawing each character. The TJ operator allows fine-tuning this implicit advance
with explicit numeric adjustments. This cumulative approach is efficient for file size
but makes position extraction complex. A reader cannot simply find Tj operators; it
must simulate the stateful execution of the content stream within BT/ET blocks. This
involves maintaining the current text matrix and CTM, parsing operators sequentially,
accessing font metric information (glyph widths) to calculate advances, applying
spacing parameters, and performing the matrix multiplications (TextMatrix * CTM) to
determine the final coordinates for each glyph rendered by Tj or TJ. Failure to
accurately model this cumulative process will result in incorrect text positioning and
layout.
A font dictionary object contains crucial information about the font, including:
● /Type: Must be /Font.
● /Subtype: Specifies the font technology (e.g., /Type1, /TrueType, /Type3, /Type0
for composite fonts).
● /BaseFont: The PostScript name of the font (e.g., /Times-Roman).
● /Encoding: Defines the mapping from character codes (bytes) in the text strings
(Tj/TJ) to glyph names or character identifiers (CIDs) within the font. This can be a
predefined encoding name (e.g., /WinAnsiEncoding) or a reference to a custom
Encoding dictionary.
● /FirstChar, /LastChar, /Widths: For simple fonts, these define the range of
character codes covered and provide an array of corresponding glyph widths.
● /FontDescriptor: A reference to another dictionary containing detailed font
metrics (e.g., ascent, descent, cap height, italic angle, stem widths, font bounding
box) and potentially a reference to an embedded font program stream (e.g.,
/FontFile, /FontFile2, /FontFile3). (See ISO 32000, Section 9).
This information is critical. For rendering, the reader needs the correct font program
(either embedded or located on the system), the /Encoding to map character codes
to glyphs, and the metrics (/Widths, FontDescriptor data) for accurate positioning and
line spacing.21 For text extraction, especially aiming for structure preservation,
knowing the font name, size (from Tf), and style attributes (often inferred from the
/BaseFont name or /FontDescriptor flags like /ItalicAngle) helps in reconstructing the
visual appearance and identifying logical elements like headings. Precise positioning
for layout analysis relies heavily on accurate glyph widths derived from the /Widths
array or the font program itself.
Once a path (or multiple subpaths) is constructed, it can be rendered using painting
operators:
● S (Stroke): Draws the path outlines using the current stroke color and line style
parameters.
● f or F (Fill): Fills the interior of the path using the current non-stroking color,
applying the non-zero winding number rule to determine the inside region.
● f* (eoFill): Fills the path using the even-odd rule.
● B, B*, b, b*: Various combinations of filling and stroking the path simultaneously.
These operators typically consume coordinate operands or other parameters from the
content stream stack (e.g., 100 200 m 300 200 l S draws a horizontal line).
Crucially, the graphics state, including the CTM, can be saved onto a stack using the q
(save graphics state) operator and restored using the Q (restore graphics state)
operator. This allows transformations or other state changes (like setting a clipping
path) to be applied locally to a specific set of drawing operations without affecting
the surrounding content.
The graphics state in PDF operates in a modal and hierarchical fashion. Operators like
w, J, rg, cm, or gs (set graphics state from ExtGState dictionary resource) modify the
current state, and these settings persist, affecting all subsequent drawing operations
until they are explicitly changed again or the graphics state is popped from the stack
using the Q operator.19 The q operator pushes the entire current graphics state onto a
stack.10 This modality reduces redundancy, as parameters don't need to be repeated
for every command.21 The q/Q mechanism enables localized changes; for example,
one might save the state (q), apply a rotation (cm), draw an object, and then restore
the state (Q) to revert the rotation before drawing subsequent objects.10 Similarly, the
gs operator allows complex sets of parameters (like transparency settings) stored in
an /ExtGState resource dictionary to be applied with a single command.19 This stateful,
stack-based model requires the renderer to meticulously track all components of the
current graphics state – including the CTM, current colors and color spaces, line
styles, text state parameters, clipping path, transparency settings, etc. – and manage
the graphics state stack correctly. Errors in state management are a common source
of rendering bugs, leading to incorrect positions, sizes, colors, or styles.
If a font is not embedded, the PDF reader must attempt to find a suitable substitute
font on the host operating system. This relies on matching the font name specified in
the PDF (e.g., /BaseFont) with available system fonts. This approach is inherently less
reliable. A matching font might not be found, or a font with the same name might have
different character metrics or glyph shapes, leading to incorrect text layout, spacing,
line breaks, or even missing characters (often displayed as '.notdef' glyphs or blanks).
While the specification defines 14 "Standard Type 1 Fonts" (e.g., Times-Roman,
Helvetica, Courier, Symbol, ZapfDingbats) that viewers were historically expected to
provide, modern best practice strongly favors embedding all necessary fonts or font
subsets for maximum portability and fidelity.1
To minimize file size impact, PDF supports font subsetting. Instead of embedding the
entire font file (which can be large), only the data for the glyphs actually used within
the document is included.1 The font name is typically modified (e.g., prefixed with six
random letters and a plus sign) to indicate it's a subset and avoid conflicts with system
fonts.
B. Font Encodings
An encoding defines the crucial mapping between the numerical character codes
used within text strings (operands to Tj and TJ) and the actual glyph descriptions
within the font program. A single byte in a string might map to the glyph 'A' under one
encoding but '€' or an accented character under another.
For reliable text extraction (copy-paste, search, accessibility), the reader needs to
reverse this mapping. The optional /ToUnicode entry in a font dictionary references a
special CMap that provides a mapping from character codes back to standard
Unicode values. This is particularly important for fonts with custom encodings or
symbolic fonts where the character codes don't correspond to standard values. (See
ISO 32000-1, Section 9.10.2).
These heuristic methods are complex to implement robustly and can be prone to
errors, especially with unconventional layouts or poorly structured documents.
It achieves this through a Structure Tree Root, referenced from the document
Catalog. This tree organizes page content items (text fragments, images, vector
graphics) into a hierarchy of logical elements, using standard structure types
analogous to HTML tags, such as:
● <P> for paragraphs
● <H1>, <H2>, etc. for headings
● <L> for lists (containing <LI> list items)
● <Table> for tables (containing <TR>, <TH>, <TD>)
● <Figure> for images or graphics
● <Artifact> for elements purely decorative, not part of the logical content.
However, creating Tagged PDFs is optional for PDF writers, and support among
consuming devices (including AT) has historically been uneven, partly due to
ambiguities in early specifications.1 While PDF 2.0 improved the definitions 1, many
existing PDFs are untagged. Therefore, a general-purpose PDF reader aiming for
structure preservation still needs to implement heuristic layout analysis as a fallback
for untagged documents.
Two other crucial aspects of the graphics state are clipping and transparency:
● Clipping Paths: Any path constructed using operators like m, l, c, re can be
designated as the current clipping path using the W (non-zero winding rule) or W*
(even-odd rule) operators. Subsequent drawing operations will only affect the
areas of the page inside the clipping path. The clipping path is part of the
graphics state and is saved and restored by q and Q.
● Transparency: Introduced in PDF 1.4, the transparency model allows objects to
be drawn partially or fully transparent, blending with the content beneath them.
Key concepts include:
○ Alpha: Objects can have constant stroking (/CA) and non-stroking (/ca) alpha
values, controlling their opacity. These are typically set via the gs operator
referencing an /ExtGState dictionary.
○ Blend Modes (/BM): Define how the color of a transparent object interacts
with the color of the background (e.g., /Normal, /Multiply, /Screen, /Overlay).
Set via /ExtGState.
○ Soft Masks (/SMask): Allow the transparency of an object to be determined
spatially by the luminosity or alpha values of another graphical object (often
an image or a transparency group), enabling effects like feathered edges or
complex masking. Set via /ExtGState.
○ Transparency Groups: A sequence of objects can be grouped together
(using a /Group dictionary with /S /Transparency). The objects within the
group are composited against each other in an isolated context (often an
offscreen buffer), and the final result of the group is then composited onto
the page backdrop. This allows controlling how transparency effects interact
within a set of objects. (See ISO 32000-1 Section 11, ISO 32000-2 Section 11
22
).
Knockout groups (/K true) modify behavior further, causing elements within the
group to composite only against the group's initial backdrop, effectively "knocking
out" the contribution of other elements within the group that lie beneath them.
The rendering order specified in the content stream becomes absolutely critical when
transparency is involved. Changing the sequence of drawing operations can lead to
dramatically different visual results due to the cumulative nature of blending.
A robust reader should not simply crash when encountering such files. It needs error
handling and recovery mechanisms:
● Attempting to rebuild the XRef table by scanning the file for obj/endobj markers if
the startxref or /Prev links are invalid.
● Skipping over objects that cannot be parsed, potentially logging an error.
● Providing sensible default values or fallbacks for missing resources (e.g.,
substituting a default font if a declared font is missing and not embedded).
● Implementing lenient parsing where appropriate, tolerating minor syntax
deviations commonly produced by certain writers, while balancing this against the
risk of misinterpreting the content.
There is an inherent trade-off between strict adherence to the standard and the
practical need to open and render the vast number of imperfect PDFs found in the
real world.
B. Complex Transparency and Color Management
As discussed (Insight 9), correctly implementing the full transparency model is
challenging.22 Edge cases involving interactions between multiple transparent layers,
different blend modes, nested transparency groups (isolated vs. knockout), and
complex soft masks can be particularly difficult to get right and computationally
expensive.
Accurate color management is another significant challenge. PDFs can utilize a wide
variety of color spaces: device-dependent spaces (DeviceGray, DeviceRGB,
DeviceCMYK), CIE-based spaces (CalGray, CalRGB, Lab), ICCBased spaces (using
embedded ICC profiles), and special spaces (Indexed, Pattern, Separation, DeviceN).22
To render colors consistently across different display devices and printers, the reader
must:
● Correctly interpret color values according to the active color space.
● Perform color transformations, potentially using ICC profiles or standard
colorimetric calculations, to map colors from the PDF's source color space to the
output device's color profile (gamut).
● Respect the specified rendering intent (e.g., Perceptual, Relative Colorimetric)
which guides how out-of-gamut colors are handled.
While a basic reader might focus solely on rendering the static appearance,
supporting interactivity (even just following links or displaying form field appearances)
significantly enhances usability. Full support requires implementing event handling,
form field logic, potentially a JavaScript engine, and signature validation capabilities.
PDFs can also be encrypted to protect their content.1 Standard security handlers use
passwords (user password to open, owner password to restrict permissions like
printing or editing) and encryption algorithms (historically RC4, now predominantly
AES-128 or AES-256 for PDF 2.0 1).14 A reader must detect encrypted files, prompt for
a password if needed (for user password-protected files), and implement the
specified decryption algorithms to access the content. Handling different revisions of
the security handler and potential third-party encryption schemes adds further
complexity.
PDF 1.7 introduced a formal extension mechanism allowing vendors to add custom
data and features beyond the core ISO standard, using registered prefixes (e.g.,
Adobe uses ADBE) within an /Extensions dictionary.13 While a reader is not obligated
to understand or render vendor-specific extensions, it should ideally parse and ignore
unknown extensions gracefully rather than failing.13 The intended public registry for
these extensions appears not to be actively maintained, adding ambiguity.13
The PDF ecosystem is vast, diverse, and constantly evolving. Files encountered may
be decades old, created by a multitude of writing applications of varying quality,
potentially using complex features like advanced transparency or encryption, or even
containing vendor-specific extensions.1 The standard itself evolves, with PDF 2.0
clarifying ambiguities and adding new capabilities.4 Consequently, building a truly
robust, general-purpose PDF reader demands more than a simple implementation of
the core specification. It requires significant investment in defensive programming,
sophisticated error recovery strategies (e.g., for malformed XRefs or objects), graceful
degradation for unsupported or unknown features, and potentially continuous
adaptation to specification updates and common implementation quirks observed in
real-world files. Developers must make conscious decisions about the trade-offs
between strict standards conformance and the pragmatic handling of imperfect but
prevalent documents.
Achieving the goal of page structure preservation necessitates addressing both the
visual layout and the underlying logical organization. Relying solely on rendering
commands is insufficient; understanding paragraph breaks, column flow, and reading
order requires dedicated mechanisms like Tagged PDF parsing or layout analysis
heuristics.
Works cited