0% found this document useful (0 votes)

15 views26 pages

PDF Reader From Scratch

This document provides a comprehensive technical overview of building a PDF reader that preserves page structure, emphasizing the importance of understanding the PDF specification (ISO 32000) and its file structure. It details the core components of a PDF file, low-level parsing techniques, and the challenges of decoding content streams, while also highlighting the interconnected nature of PDF objects and the necessity for robust parsing methods. The report serves as a guide for developers aiming to create a PDF reader that maintains document fidelity and structure across various platforms.

Uploaded by

Douglas Mutethia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views26 pages

PDF Reader From Scratch

Uploaded by

Douglas Mutethia

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Building a PDF Reader with Page Structure Preservation: A

Technical Deep Dive

I. Introduction
The Portable Document Format (PDF) has become the de facto standard for
electronic document exchange, prized for its ability to preserve document fidelity
across diverse software, hardware, and operating systems.1 Developed by Adobe
Systems and now standardized as ISO 32000 1, PDF encapsulates text, images, vector
graphics, fonts, and interactive elements within a single file, ensuring consistent
presentation.2 The objective of creating a PDF reader from scratch, particularly one
that preserves page structure, necessitates a profound understanding of the
underlying file format specification. This report details the technical requirements,
methodologies, and challenges involved in such an undertaking, drawing upon the
official PDF standards and related technical documentation. It covers the fundamental
structure of PDF files, low-level parsing techniques, text and graphics rendering, font
handling, layout analysis for structure preservation, the role of the graphics state, and
common robustness considerations.

II. Understanding the PDF Specification and File Structure

A. The Definitive Standard: ISO 32000
Any attempt to implement a PDF reader must be grounded in the official specification.
Initially developed by Adobe Systems 4, the PDF specification has undergone several
revisions (PDF 1.0 through 1.7).5 Version 1.7 was donated to the International
Organization for Standardization (ISO) and published as ISO 32000-1:2008.3 This
standard specifies the digital format for representing electronic documents, intended
for developers creating or interpreting PDF files.3 While the official ISO standard is
available for purchase, Adobe provides an equivalent version at no cost, facilitating
development.3

Subsequently, ISO published ISO 32000-2:2017, defining PDF 2.0, which introduced
new features and provided critical clarifications and corrections to ambiguities in the
PDF 1.7 specification.12 A dated revision, ISO 32000-2:2020, incorporates further
improvements identified during early adoption.12 Access to PDF 2.0 specifications was
initially restricted but is now available at no cost via the PDF Association, thanks to
industry sponsorship.12 For developers building a reader, particularly one aiming for
broad compatibility and robustness, consulting both ISO 32000-1 (for PDF 1.x files)
and ISO 32000-2 (for PDF 2.0 and clarifications applicable to earlier versions) is
essential.12 These documents provide the authoritative definition of PDF syntax,
objects, operators, and rendering rules.2

B. Core File Structure Components

A PDF file is not a monolithic entity but a structured collection of components.10 At a
high level, a typical PDF file consists of:
1. Header: The very first line identifies the file as PDF and specifies the version of
the specification it adheres to (e.g., %PDF-1.7).13 However, this version number
can be overridden by a /Version entry in the document's Catalog dictionary,
particularly in files updated incrementally.13 Accurately determining the effective
version requires parsing the Catalog, not just the header.13
2. Body: This constitutes the bulk of the file and contains a sequence of indirect
objects that define the document's content – pages, fonts, images, annotations,
etc..15
3. Cross-Reference Table (XRef): This table provides the byte offsets for each
indirect object within the file's body, enabling random access without reading the
entire file.15 Each entry is typically 20 bytes long.15
4. Trailer: Located near the end of the file, the trailer provides the location (byte
offset) of the XRef table via the startxref keyword and the offset of the last XRef
section.16 It also contains the /Root entry, which points to the document's Catalog
dictionary (the root of the document's object hierarchy), and potentially other
essential information like the /Info dictionary (metadata) and /ID (file identifier). In
updated files, it may also contain a /Prev entry pointing to the previous XRef table,
forming a linked list of updates.10

PDF files are designed to be read starting from the end. The reader locates the
startxref value, jumps to the specified XRef table, reads the object locations, and uses
the /Root reference in the trailer to find the Document Catalog, which serves as the
entry point to the document's page structure and resources.16

C. PDF Objects: The Building Blocks

The content and structure of a PDF are defined using a set of basic data types,
referred to as objects.16 These objects can be either direct (defined inline) or indirect
(assigned an object number and generation number, allowing them to be referenced
from multiple places).17 The primary object types are:
● Boolean: Represents logical true or false, using the keywords true and false.18
Tolerant readers might accept variations in capitalization.18
● Numeric (Integer and Real): Standard integer and floating-point numbers.
● String: Sequences of bytes enclosed in parentheses () for literal strings or angle
brackets <> for hexadecimal data. Special characters within literal strings require
escaping.
● Name: Unique identifiers starting with a / character (e.g., /Type, /Pages, /Font).
They are used as keys in dictionaries and to represent predefined values.
Non-regular characters are encoded using a # followed by their two-digit hex
code.18 Custom names should follow specific conventions (second-class names)
for validity.18
● Array: Ordered sequences of objects enclosed in square brackets [...]. Arrays can
contain a mix of object types, including other arrays, and can be empty.18 They are
used for various purposes, such as defining page boundaries (/MediaBox), color
values, or transformation matrices.
● Dictionary: Unordered collections of key-value pairs enclosed in double angle
brackets <<...>>. Keys must be Name objects, while values can be any type of PDF
object, including references to other indirect objects.17 Dictionaries are the
primary structuring mechanism in PDF, defining pages, fonts, annotations,
resources, and the overall document catalog.19
● Stream: Consists of a dictionary followed by a sequence of bytes between the
keywords stream and endstream.16 The dictionary provides metadata about the
stream, such as its /Length (number of bytes) and potentially a /Filter entry
specifying compression or encoding schemes applied to the byte sequence.15
Streams are used to hold large amounts of data, such as page content
descriptions, embedded font programs, or image data.15 The /Length entry is
critical for determining how many bytes constitute the stream's data.15 Streams
can also reference external files (/F key) or contain multiple compressed indirect
objects (Object Streams, introduced in PDF 1.5).15
● Null: Represents the absence of a value, using the keyword null.17

Indirect objects are fundamental to PDF's structure. They are defined using an object
number, a generation number (usually 0 for new objects), and the obj and endobj
keywords (e.g., 4 0 obj... endobj).16 They are referenced using the object number,
generation number, and the R keyword (e.g., 4 0 R).17 This mechanism allows objects,
such as fonts or images used on multiple pages, to be defined once and reused,
reducing file size.19 It also facilitates incremental updates, where modified objects can
be appended to the file along with a new XRef section without rewriting the entire
document.10

The combination of dictionaries and indirect references forms the backbone of a

PDF's internal organization.19 Dictionaries group related attributes, using Name objects
as keys to provide semantic meaning (e.g., /Type /Page, /Contents 5 0 R, /Resources 6
0 R).19 Values in these dictionaries are often indirect references pointing to other
objects, such as content streams, resource dictionaries, or font descriptors.16 This
creates an interconnected, graph-like structure within the file. The Document Catalog,
located via the /Root entry in the trailer, serves as the entry point to this graph.16
Navigation through the document, particularly the page structure, involves traversing
this graph by resolving indirect references found within dictionaries. For instance, the
Page Tree, a hierarchical structure organizing potentially thousands of document
pages, is built using /Pages (intermediate node) and /Page (leaf node) dictionaries
linked via indirect references, enabling efficient access to individual page objects.17
Consequently, parsing a PDF fundamentally involves resolving these references and
interpreting the semantic meaning of standard keys within dictionaries as defined by
the ISO 32000 specification. A robust parser must implement a mechanism to load
objects on demand by looking up their byte offsets in the XRef table and parsing their
content according to their type.

III. Low-Level PDF Parsing Techniques

Parsing a PDF file requires specific techniques to navigate its structure, locate
objects, and decode content.

A. Reading the Cross-Reference Table (XRef)

The process begins at the end of the file. The reader must locate the startxref
keyword, which is followed by an integer representing the byte offset from the
beginning of the file to the start of the last cross-reference section.16

Once at the specified offset, the parser reads the XRef information. A traditional XRef
section starts with the keyword xref, followed by one or more subsection headers.
Each header consists of two numbers: the starting object number for that subsection
and the count of entries in it.15 Following the header are the entries themselves, each
exactly 20 bytes long (including the line terminator).15 Each entry corresponds to one
object number and contains:
1. A 10-digit byte offset from the start of the file to the beginning of the object
definition.
2. A space.
3. A 5-digit generation number.
4. A space.
5. A keyword: n if the object is in use, or f if the object is free (deleted).15

The first object (number 0) is always marked free (f) and has a generation number of
65535; it serves as the head of the linked list of free objects.15

PDF files can be updated incrementally by appending changes, including new or

modified objects and a new XRef section and trailer.10 The trailer associated with this
new XRef section will contain a /Prev key whose value is the byte offset of the previous
XRef section.10 A parser must check for the /Prev key and, if present, recursively read
and parse the previous XRef section(s), following the chain back to the original XRef
table. The information from all XRef sections must be merged, with entries from later
sections overriding those from earlier sections for the same object number, to build a
complete and accurate map of object locations.10

Starting with PDF 1.5, an alternative, more compact format called an XRef stream was
introduced. This encodes the cross-reference information within a stream object
itself, offering compression and greater flexibility than the fixed-format table. Parsing
an XRef stream involves decoding the stream content (often using /FlateDecode) and
interpreting its structured data according to rules defined in the specification (ISO
32000-1, Section 7.5.8). A reader must be prepared to handle both traditional XRef
tables and XRef streams.

B. Object Location and Retrieval

After parsing all relevant XRef sections (or streams) and merging the information
(typically into an in-memory map or table), the reader can locate any indirect object.
When an indirect reference (e.g., 12 0 R) is encountered during parsing, the reader
uses the object number (12) to look up its byte offset in the consolidated XRef data.15

The parser then performs a file seek operation to position the file pointer at that byte
offset.15 It reads the object definition, starting with the object number, generation
number, and obj keyword, and ending with the endobj keyword.16 The content between
obj and endobj is then parsed according to the basic PDF object syntax rules
(detecting Booleans, Numbers, Strings, Names, Arrays, Dictionaries, or Streams).14

To optimize performance, especially for objects referenced multiple times (like shared
resource dictionaries or fonts), it is common practice to cache parsed indirect objects
in memory. Subsequent references to the same object can then retrieve the cached
version, avoiding redundant file I/O and parsing.

C. Decoding Content Streams

Page content, images, and embedded fonts are often stored within stream objects.16
The raw bytes within the stream...endstream delimiters are frequently compressed or
encoded to reduce file size.20 The stream's associated dictionary provides the
necessary metadata for decoding.15

Crucially, the stream dictionary contains a /Length entry specifying the number of
bytes in the (potentially encoded) stream data.15 It may also contain a /Filter entry,
whose value is either a single Name object or an Array of Name objects identifying the
filter(s) applied.15 If multiple filters are specified, they must be applied in the order
listed to decode the data.15 An optional /DecodeParms entry (a dictionary or an array
of dictionaries) provides parameters specific to each filter, if needed.15

The parser reads the number of raw bytes specified by /Length and then applies the
corresponding decoding algorithm(s) indicated by /Filter. Common filters include:
● /FlateDecode: Decompresses data using the zlib/deflate algorithm (very common
for text and image data).15
● /LZWDecode: Uses Lempel-Ziv-Welch compression.15
● /ASCIIHexDecode: Decodes data represented as ASCII hexadecimal characters.15
● /ASCII85Decode: Decodes ASCII base-85 encoded data, more compact than
hex.15
● /CCITTFaxDecode: Decompresses image data using CCITT Group 3 or Group 4
fax standards.15
● /DCTDecode: Decompresses JPEG baseline image data.15
● /RunLengthDecode: Simple run-length encoding scheme.15
● /JBIG2Decode: For bi-level (black and white) image data, often used in scanned
documents.15

The following table summarizes common filters:

Filter Name Description Typical Use Specification

Reference (ISO
32000-1)

FlateDecode zlib/deflate General content 7.4.3

compression streams, images

LZWDecode Lempel-Ziv-Welch Older files, general 7.4.4

compression content, images

ASCIIHexDecode ASCII Hexadecimal Simple binary data 7.4.1

encoding encoding, debug

ASCII85Decode ASCII Base-85 More compact binary 7.4.2

encoding data encoding

CCITTFaxDecode Group 3 or Group 4 Bi-level (black & 7.4.5

CCITT fax white) images
compression

DCTDecode JPEG baseline DCT Photographic images 7.4.7

compression

RunLengthDecode Simple run-length Repetitive data, some 7.4.6

compression image types

JBIG2Decode JBIG2 compression Bi-level images 7.4.9

(lossy/lossless) (scanned docs)

Successfully decoding these streams is essential to access the actual page

description operators, image pixels, or font data needed for rendering.

The process of parsing a PDF reveals its non-linear nature. Locating the XRef requires
reading the end of the file 17, potentially following a chain of /Prev links backward
through updates 10, and then jumping to various byte offsets scattered throughout the
file based on the XRef data.15 Parsing a single object, like a Page dictionary, often
necessitates resolving indirect references within it, leading to the parsing of other
objects, such as its /Contents stream or its /Resources dictionary.16 Furthermore,
interpreting a stream object requires parsing its dictionary first to determine its length
and any required decoding filters.15 The introduction of object streams adds another
layer of indirection, where parsing the container stream is a prerequisite to accessing
the individual objects compressed within it.15 This interconnectedness and
dependency imply that a PDF parser cannot operate purely sequentially. It must
handle object loading on demand, potentially triggering further lookups and parsing
steps in a recursive or iterative manner. Careful state management and robustness
against broken links or invalid object definitions are crucial for successfully navigating
the potentially complex graph structure of a PDF document.

IV. Text Extraction and Representation

Preserving page structure inherently involves accurately extracting and representing
text elements, including their position, font, and style.

A. Text Objects and Operators

The visual content of a page, including text, is defined within one or more content
streams.16 These streams are typically referenced by the /Contents entry in the Page
dictionary object.16 A content stream is a sequence of operands followed by operators
that manipulate the graphics state and render content onto the page.

Text rendering is governed by a specific subset of the graphics state known as the
text state.21 Key parameters within the text state include the current font and size (set
by Tf), character spacing, word spacing, horizontal scaling, text leading (line spacing),
and the text rendering mode (e.g., fill, stroke, clip).

Text elements are typically enclosed within a text object, delimited by the BT (Begin
Text) and ET (End Text) operators.16 Most operators that affect the text state or draw
text glyphs are only valid between BT and ET.

Several core operators are used for positioning and drawing text:
● Tf (Set font): Takes a font resource name (defined in the page's /Resources) and
a size as operands (e.g., /F1 12 Tf) to set the active font and text size.16
● Text Positioning Operators (Td, TD, Tm, T*): These operators manipulate the
text matrix, which controls the position and orientation of text being drawn. Td
moves the start of the next line relative to the current line start. TD is similar but
also sets the leading. T* moves to the start of the next line using the current
leading. Tm directly sets the text matrix.21
● Tj (Show text): Takes a string operand and draws the corresponding glyphs at
the current text position, advancing the position after each glyph based on its
width.
● TJ (Show text with adjustments): Takes an array operand containing strings and
numeric adjustments. It draws the strings, applying the specified numeric
adjustments (in thousandths of text space units) between them. This allows for
precise control over glyph spacing and kerning [e.g., [(W) -120 (o) -110 (r) -110 (l)
-110 (d)] TJ].
● ' (Move to next line and show text): Equivalent to T* followed by Tj.
● " (Set word/char spacing, move to next line, show text): Equivalent to setting
word and character spacing, then performing '.

B. Extracting Positional Data

Extracting the precise position of each character or glyph requires understanding
PDF's coordinate systems and how text operators interact with them. Graphics
operations occur within user space, a coordinate system that can be scaled, rotated,
and translated.10 The Current Transformation Matrix (CTM) maps coordinates from
user space to device space (e.g., screen pixels or printer dots).10

Within a text object (BT/ET), glyphs are positioned relative to the origin using the text
matrix (Tm). This matrix is initialized at the start of the text object and updated by
operators like Td, TD, T*, and Tm itself.21 When a text-showing operator like Tj or TJ is
encountered, the position of each glyph is determined relative to the current text
matrix.

For a Tj operator showing a string, the first glyph is placed at the origin defined by the
text matrix. Subsequent glyphs are placed sequentially, with the text matrix being
updated after each glyph based on its width (obtained from the font's metrics) and
any active character or word spacing parameters. The TJ operator provides more
explicit control, allowing numeric adjustments to be inserted between strings or
individual characters within the array operand, directly modifying the spacing relative
to the implicit width-based advancement.

To calculate the final, absolute position of a glyph on the page canvas (necessary for
rendering or layout analysis), its position in text space must be transformed first by
the current text matrix and then by the CTM that was active when the text object
began. The effective transformation is FinalMatrix = TextMatrix * CTM. Tracking both
matrices accurately is crucial.

A key aspect of PDF text handling is that glyph positions are often implicit and
cumulative, rather than explicitly defined for every character.21 This stems from PDF's
PostScript heritage, where drawing commands sequentially modify a current point or
state.2 Operators like Td and T* specify movement relative to the current text line's
start or the previous position.21 The standard Tj operator relies entirely on the defined
widths of glyphs in the current font to determine how far to advance the text position
after drawing each character. The TJ operator allows fine-tuning this implicit advance
with explicit numeric adjustments. This cumulative approach is efficient for file size
but makes position extraction complex. A reader cannot simply find Tj operators; it
must simulate the stateful execution of the content stream within BT/ET blocks. This
involves maintaining the current text matrix and CTM, parsing operators sequentially,
accessing font metric information (glyph widths) to calculate advances, applying
spacing parameters, and performing the matrix multiplications (TextMatrix * CTM) to
determine the final coordinates for each glyph rendered by Tj or TJ. Failure to
accurately model this cumulative process will result in incorrect text positioning and
layout.

C. Font Information (Name, Size, Metrics, Encoding)

Accurate text rendering and extraction depend heavily on accessing detailed font
information. Fonts used on a page are declared as resources in the Page object's
/Resources dictionary, specifically within a sub-dictionary associated with the /Font
key.16 This dictionary maps resource names (e.g., /F1, /F2), used by the Tf operator in
the content stream, to indirect objects representing font dictionaries.

A font dictionary object contains crucial information about the font, including:
● /Type: Must be /Font.
● /Subtype: Specifies the font technology (e.g., /Type1, /TrueType, /Type3, /Type0
for composite fonts).
● /BaseFont: The PostScript name of the font (e.g., /Times-Roman).
● /Encoding: Defines the mapping from character codes (bytes) in the text strings
(Tj/TJ) to glyph names or character identifiers (CIDs) within the font. This can be a
predefined encoding name (e.g., /WinAnsiEncoding) or a reference to a custom
Encoding dictionary.
● /FirstChar, /LastChar, /Widths: For simple fonts, these define the range of
character codes covered and provide an array of corresponding glyph widths.
● /FontDescriptor: A reference to another dictionary containing detailed font
metrics (e.g., ascent, descent, cap height, italic angle, stem widths, font bounding
box) and potentially a reference to an embedded font program stream (e.g.,
/FontFile, /FontFile2, /FontFile3). (See ISO 32000, Section 9).

This information is critical. For rendering, the reader needs the correct font program
(either embedded or located on the system), the /Encoding to map character codes
to glyphs, and the metrics (/Widths, FontDescriptor data) for accurate positioning and
line spacing.21 For text extraction, especially aiming for structure preservation,
knowing the font name, size (from Tf), and style attributes (often inferred from the
/BaseFont name or /FontDescriptor flags like /ItalicAngle) helps in reconstructing the
visual appearance and identifying logical elements like headings. Precise positioning
for layout analysis relies heavily on accurate glyph widths derived from the /Widths
array or the font program itself.

V. Rendering Vector Graphics

PDF uses a vector graphics model based on paths, which define shapes, lines, and
curves.

A. Path Construction and Painting Operators

Vector graphics are created by first defining a path using a sequence of path
construction operators. The path itself is not visible until a path painting operator is
applied. Key path construction operators include:
● m (moveto): Takes x, y coordinates; starts a new subpath at (x, y).
● l (lineto): Takes x, y coordinates; adds a straight line segment from the current
point to (x, y).
● c (curveto): Takes six coordinates (x1, y1, x2, y2, x3, y3); adds a cubic Bézier curve
segment from the current point to (x3, y3), using (x1, y1) and (x2, y2) as control
points.
● v, y: Variants of curveto with implicit control points.
● h (closepath): Closes the current subpath by adding a straight line segment from
the current point back to the starting point of the subpath.
● re (rectangle): Takes x, y, width, height; adds a complete rectangular subpath to
the current path.

Once a path (or multiple subpaths) is constructed, it can be rendered using painting
operators:
● S (Stroke): Draws the path outlines using the current stroke color and line style
parameters.
● f or F (Fill): Fills the interior of the path using the current non-stroking color,
applying the non-zero winding number rule to determine the inside region.
● f* (eoFill): Fills the path using the even-odd rule.
● B, B*, b, b*: Various combinations of filling and stroking the path simultaneously.

These operators typically consume coordinate operands or other parameters from the
content stream stack (e.g., 100 200 m 300 200 l S draws a horizontal line).

B. Shapes, Fills, and Strokes

The visual appearance of stroked lines and filled shapes is controlled by parameters
within the current graphics state.19
● Fill Color: Set using operators like rg (DeviceRGB), g (DeviceGray), k
(DeviceCMYK), sc or scn (ICCBased, Separation, DeviceN, etc.) for non-stroking
operations.21 The choice of operator depends on the active non-stroking color
space, also part of the graphics state.
● Stroke Color: Set using corresponding uppercase operators (RG, G, K, SC, SCN)
for stroking operations.
● Stroke Attributes: Other graphics state parameters dictate the appearance of
stroked lines:
○ w (Set line width)
○ J (Set line cap style: butt, round, projecting square)
○ j (Set line join style: miter, round, bevel)
○ M (Set miter limit)
○ d (Set dash pattern)

C. Coordinate Systems and Transformations (CTM)

As mentioned earlier, graphics are defined in user space, which is mapped to device
space via the Current Transformation Matrix (CTM).10 The CTM is a 3x3 matrix that can
represent translation, scaling, rotation, and skewing.

The cm (concatenate matrix) operator allows direct modification of the CTM by

multiplying it with a specified matrix, enabling arbitrary affine transformations to be
applied to subsequent drawing operations.10 Other operators implicitly modify the
CTM (e.g., translate, scale, rotate operators, though cm is the general form).

Crucially, the graphics state, including the CTM, can be saved onto a stack using the q
(save graphics state) operator and restored using the Q (restore graphics state)
operator. This allows transformations or other state changes (like setting a clipping
path) to be applied locally to a specific set of drawing operations without affecting
the surrounding content.

The rendering process for vector graphics involves:

1. Interpreting path construction operators (m, l, c, h, re) to build an internal
representation of the path in user space coordinates.
2. Applying the current CTM to transform the path's coordinates into device space.
3. If a painting operator (S, f, B, etc.) is encountered, rasterizing the transformed
path onto the output canvas (e.g., screen buffer) using the current color, line
style, and fill rule parameters from the graphics state.

The graphics state in PDF operates in a modal and hierarchical fashion. Operators like
w, J, rg, cm, or gs (set graphics state from ExtGState dictionary resource) modify the
current state, and these settings persist, affecting all subsequent drawing operations
until they are explicitly changed again or the graphics state is popped from the stack
using the Q operator.19 The q operator pushes the entire current graphics state onto a
stack.10 This modality reduces redundancy, as parameters don't need to be repeated
for every command.21 The q/Q mechanism enables localized changes; for example,
one might save the state (q), apply a rotation (cm), draw an object, and then restore
the state (Q) to revert the rotation before drawing subsequent objects.10 Similarly, the
gs operator allows complex sets of parameters (like transparency settings) stored in
an /ExtGState resource dictionary to be applied with a single command.19 This stateful,
stack-based model requires the renderer to meticulously track all components of the
current graphics state – including the CTM, current colors and color spaces, line
styles, text state parameters, clipping path, transparency settings, etc. – and manage
the graphics state stack correctly. Errors in state management are a common source
of rendering bugs, leading to incorrect positions, sizes, colors, or styles.

VI. Font Handling Mechanisms

Reliable text rendering that preserves the original document's appearance hinges on
correct font handling.

A. Embedded vs. System Fonts

The primary goal of PDF's font handling is to ensure documents appear consistently,
regardless of the fonts installed on the viewer's system.1 The most reliable way to
achieve this is through embedded fonts. The actual font program data (or a subset of
it containing only the used glyphs) can be included within the PDF as a stream object,
typically referenced from the font's /FontDescriptor dictionary.1 This guarantees that
the reader has access to the exact glyph shapes intended by the document creator.

If a font is not embedded, the PDF reader must attempt to find a suitable substitute
font on the host operating system. This relies on matching the font name specified in
the PDF (e.g., /BaseFont) with available system fonts. This approach is inherently less
reliable. A matching font might not be found, or a font with the same name might have
different character metrics or glyph shapes, leading to incorrect text layout, spacing,
line breaks, or even missing characters (often displayed as '.notdef' glyphs or blanks).
While the specification defines 14 "Standard Type 1 Fonts" (e.g., Times-Roman,
Helvetica, Courier, Symbol, ZapfDingbats) that viewers were historically expected to
provide, modern best practice strongly favors embedding all necessary fonts or font
subsets for maximum portability and fidelity.1

To minimize file size impact, PDF supports font subsetting. Instead of embedding the
entire font file (which can be large), only the data for the glyphs actually used within
the document is included.1 The font name is typically modified (e.g., prefixed with six
random letters and a plus sign) to indicate it's a subset and avoid conflicts with system
fonts.

B. Font Encodings
An encoding defines the crucial mapping between the numerical character codes
used within text strings (operands to Tj and TJ) and the actual glyph descriptions
within the font program. A single byte in a string might map to the glyph 'A' under one
encoding but '€' or an accented character under another.

PDF supports various encoding mechanisms:

● Predefined Encodings: Names like /StandardEncoding, /MacRomanEncoding,
/WinAnsiEncoding, /PDFDocEncoding refer to built-in mappings primarily for
Latin-based scripts.
● Custom Encoding Dictionaries: A font dictionary can reference an /Encoding
dictionary object that explicitly defines the mapping from character codes
(0-255) to glyph names (e.g., /A, /adieresis).
● Identity-H / Identity-V: Used with CID-keyed fonts (see below), where character
codes directly index glyphs (often multi-byte codes).
● CID-Keyed Fonts (Type 0): These composite fonts are designed for large
character sets, such as those used in Chinese, Japanese, and Korean (CJK)
languages. Instead of a simple encoding, they use a Character Map (/CMap
resource) to map potentially multi-byte character codes from text strings to
Character Identifiers (CIDs), which are then used to retrieve glyph descriptions
from a CIDFont program. (See ISO 32000-1, Section 9.7).

For reliable text extraction (copy-paste, search, accessibility), the reader needs to
reverse this mapping. The optional /ToUnicode entry in a font dictionary references a
special CMap that provides a mapping from character codes back to standard
Unicode values. This is particularly important for fonts with custom encodings or
symbolic fonts where the character codes don't correspond to standard values. (See
ISO 32000-1, Section 9.10.2).

C. Glyph Rendering and Font Metrics

The process of rendering a character involves:
1. Glyph Selection: Using the character code from the text string (Tj/TJ) and the
font's active encoding (or CMap), the reader determines the corresponding glyph
identifier (e.g., glyph name like /A or a CID like 1234).
2. Glyph Retrieval: The reader accesses the glyph description data associated with
that identifier from the font program (e.g., TrueType, Type 1, CFF, OpenType),
which might be embedded in the PDF or located on the system.
3. Glyph Rendering: The reader interprets the glyph description (typically outlines
defined by lines and curves) and rasterizes it onto the display canvas at the
position determined by the text matrix and CTM, applying appropriate size
scaling. This often involves complex techniques like hinting (adjusting outlines to
fit the pixel grid better at small sizes) and anti-aliasing (smoothing jagged edges).
As previously noted, font metrics are essential throughout this process. Glyph widths
(from /Widths array or font program) determine character spacing and line length
calculations.21 Vertical metrics like ascent and descent (from /FontDescriptor) are
needed for positioning glyphs vertically relative to the baseline and calculating line
spacing (leading).22 Bounding box information helps determine the overall extent of
text for layout analysis and hit testing (e.g., for text selection).

The diversity of font technologies (Type 1, TrueType, Type 3 user-defined fonts,

CID-keyed fonts, OpenType CFF/TTF flavors), combined with multiple encoding
schemes (predefined, custom, CMaps), embedding options (full, subset, none), and
the fallback reliance on system font substitution, makes font handling one of the most
complex and error-prone aspects of PDF reader development.1 Achieving accurate
rendering requires sophisticated font parsing and rasterization capabilities, often
leveraging external libraries like FreeType. Similarly, reliable text extraction
necessitates careful handling of encodings and /ToUnicode maps to convert internal
character codes back into meaningful Unicode text.22 Variations in font handling are a
primary reason why the same PDF might render slightly differently or allow different
text extraction results across various PDF viewers.

VII. Document Layout Analysis for Structure Preservation

Preserving page structure, as requested by the user query, goes beyond simply
rendering the visual appearance. It implies understanding the logical organization of
content elements like paragraphs, columns, tables, and their intended reading order.

A. Identifying Logical Structures (Paragraphs, Columns, Tables)

Standard PDF content streams primarily describe where to place glyphs and graphics,
not necessarily their logical roles.1 A single paragraph might be composed of multiple
text fragments drawn with separate Tj or TJ operators, possibly interspersed with
graphics commands or even placed out of sequence in the content stream itself.
Reconstructing the intended logical structure from this purely visual layout
information requires sophisticated analysis.

This typically involves applying heuristic algorithms based on geometric proximity

and typographical conventions:
● Text Block Grouping: Identifying individual text blocks (sequences of characters
with consistent font attributes and baseline alignment) and grouping nearby
blocks based on spatial proximity, horizontal/vertical alignment, and consistent
font size/style.
● Whitespace Analysis: Detecting significant horizontal and vertical gaps (gutters,
paragraph spacing) to segment the page into potential columns, sections, or
paragraphs.
● Table Detection: Identifying grid-like arrangements of text blocks, often aided by
the presence of explicit ruling lines (vector graphics), by analyzing alignments and
repeating spatial patterns.
● Heading Identification: Inferring headings and subheadings based on changes
in font size, weight (bold), style (italic), and spacing relative to surrounding text.

These heuristic methods are complex to implement robustly and can be prone to
errors, especially with unconventional layouts or poorly structured documents.

B. Role of Tagged PDF

Recognizing the limitations of purely visual descriptions, the PDF specification
introduced Tagged PDF (formally defined in ISO 32000, Section 14.8).1 Tagged PDF
provides an official mechanism to explicitly embed logical structure information
directly within the file, separate from the visual content stream instructions.1

It achieves this through a Structure Tree Root, referenced from the document
Catalog. This tree organizes page content items (text fragments, images, vector
graphics) into a hierarchy of logical elements, using standard structure types
analogous to HTML tags, such as:
● <P> for paragraphs
● <H1>, <H2>, etc. for headings
● <L> for lists (containing <LI> list items)
● <Table> for tables (containing <TR>, <TH>, <TD>)
● <Figure> for images or graphics
● <Artifact> for elements purely decorative, not part of the logical content.

Tagged PDF offers significant advantages:

● Reliable Text Extraction: Enables extraction of text content in its correct logical
reading order.
● Accessibility: Provides semantic information crucial for assistive technologies
(AT) like screen readers to navigate and present the document meaningfully to
users with disabilities.1 The PDF/UA (Universal Accessibility) standard (ISO 14289)
builds upon Tagged PDF requirements.23
● Content Reuse: Facilitates repurposing content, such as reflowing text for small
screens or exporting to other formats like HTML or Word, while preserving the
logical structure.

However, creating Tagged PDFs is optional for PDF writers, and support among
consuming devices (including AT) has historically been uneven, partly due to
ambiguities in early specifications.1 While PDF 2.0 improved the definitions 1, many
existing PDFs are untagged. Therefore, a general-purpose PDF reader aiming for
structure preservation still needs to implement heuristic layout analysis as a fallback
for untagged documents.

C. Determining Reading Order

A critical aspect of structure preservation is determining the correct reading order,
especially in multi-column layouts, pages with sidebars, or complex graphic designs.
The visual order (top-to-bottom, left-to-right) may not match the logical flow, and the
order of elements within the content stream itself is often unrelated to reading order.
● Tagged PDF: If a PDF is properly tagged, the Structure Tree explicitly defines the
logical reading order of content elements.1
● Heuristic Order: For untagged PDFs, reading order must be inferred
algorithmically. Common approaches involve analyzing the geometric layout of
text blocks, attempting to identify columns, and ordering blocks within columns
(typically top-to-bottom, then left-to-right between columns). This is a
challenging problem in document image analysis and is prone to errors.

Preserving page structure thus involves two complementary aspects: accurately

rendering the visual appearance, and capturing the underlying logical structure and
reading order. The latter is essential for usability features like coherent text selection
across columns, meaningful copy-paste operations, document reflow, and
accessibility compliance.

Achieving genuine structure preservation requires moving beyond simply executing

the drawing commands in the content stream.1 While rendering operators dictate the
visual placement of glyphs and graphics 16, they don't inherently convey the
relationships between these elements (e.g., which text fragments form a paragraph,
or the flow across columns). Understanding these relationships necessitates either
parsing the explicit structural information provided by Tagged PDF 1 or applying
complex document layout analysis algorithms to infer structure from the visual
arrangement.1 Relying solely on content stream rendering preserves the visual
snapshot but fails to capture the deeper logical organization needed for advanced
functionality and true structure preservation. This significantly increases the
implementation complexity beyond that of a basic visual renderer.

VIII. Graphics State and Content Layering

The final appearance of a PDF page results from the sequential application of content
stream operators, heavily influenced by the graphics state, which controls how
elements are drawn and layered.

A. Graphics State Parameters (CTM, Clipping, Color, Transparency)

We have already encountered several key graphics state parameters: the Current
Transformation Matrix (CTM) 10, color spaces and current colors for stroking and
non-stroking operations 21, text state parameters 21, and line style attributes (width,
cap, join, dash pattern).

Two other crucial aspects of the graphics state are clipping and transparency:
● Clipping Paths: Any path constructed using operators like m, l, c, re can be
designated as the current clipping path using the W (non-zero winding rule) or W*
(even-odd rule) operators. Subsequent drawing operations will only affect the
areas of the page inside the clipping path. The clipping path is part of the
graphics state and is saved and restored by q and Q.
● Transparency: Introduced in PDF 1.4, the transparency model allows objects to
be drawn partially or fully transparent, blending with the content beneath them.
Key concepts include:
○ Alpha: Objects can have constant stroking (/CA) and non-stroking (/ca) alpha
values, controlling their opacity. These are typically set via the gs operator
referencing an /ExtGState dictionary.
○ Blend Modes (/BM): Define how the color of a transparent object interacts
with the color of the background (e.g., /Normal, /Multiply, /Screen, /Overlay).
Set via /ExtGState.
○ Soft Masks (/SMask): Allow the transparency of an object to be determined
spatially by the luminosity or alpha values of another graphical object (often
an image or a transparency group), enabling effects like feathered edges or
complex masking. Set via /ExtGState.
○ Transparency Groups: A sequence of objects can be grouped together
(using a /Group dictionary with /S /Transparency). The objects within the
group are composited against each other in an isolated context (often an
offscreen buffer), and the final result of the group is then composited onto
the page backdrop. This allows controlling how transparency effects interact
within a set of objects. (See ISO 32000-1 Section 11, ISO 32000-2 Section 11
22
).

Many advanced graphics state parameters, particularly those related to transparency,

blend modes, overprint control, rendering intent, and soft masks, are not set by
dedicated operators but are packaged within Extended Graphics State (/ExtGState)
resource dictionaries. The gs operator takes the name of an /ExtGState resource as an
operand and applies all the parameters defined within that dictionary to the current
graphics state.19

B. How Elements are Combined and Layered

PDF fundamentally uses a painter's model for rendering: objects defined later in the
content stream are painted on top of objects defined earlier. For opaque objects, this
simply means later objects obscure earlier ones where they overlap.

However, the introduction of transparency significantly complicates this model.22

When a transparent object is drawn, its color must be mathematically blended with
the color of the existing background at each pixel location. The resulting color
depends on the object's color, its alpha value(s), the background color, and the active
blend mode.22 This means the renderer needs to know the color of the backdrop
before drawing the transparent object.

Transparency Groups add further complexity. When rendering a transparency group,

the renderer typically needs to:
1. Create an offscreen buffer (initialized based on group attributes like isolated (/I)
or knockout (/K)).
2. Render the contents of the group into this offscreen buffer, performing blending
calculations within the group context.
3. Composite the final resulting image from the offscreen buffer onto the main page
backdrop, applying the group's overall alpha and blend mode. 22

Knockout groups (/K true) modify behavior further, causing elements within the
group to composite only against the group's initial backdrop, effectively "knocking
out" the contribution of other elements within the group that lie beneath them.

The rendering order specified in the content stream becomes absolutely critical when
transparency is involved. Changing the sequence of drawing operations can lead to
dramatically different visual results due to the cumulative nature of blending.

The presence of transparency fundamentally breaks the simplicity of a purely

sequential, opaque painter's model.22 A renderer cannot just draw objects directly to
the final output one after another if transparency is involved. It requires access to the
current backdrop color at each pixel to perform the blending calculations.22 This often
necessitates the use of intermediate offscreen buffers or tiling strategies to manage
the compositing process. Implementing the full PDF transparency model, including all
standard blend modes, isolated and knockout groups, soft masks, and interactions
with different color spaces, is computationally intensive and technically demanding.22
It represents a significant hurdle in developing a fully compliant PDF reader, going far
beyond basic rendering capabilities.

IX. Common Challenges and Robustness Considerations

Building a PDF reader that works reliably with real-world documents requires
addressing numerous challenges beyond the core specification.

A. Handling Malformed or Non-Standard PDFs

While ISO 32000 defines the standard, many PDFs encountered in practice may
deviate from it due to bugs in PDF creation software, damage during transmission, or
use of obsolete or proprietary features.1 Common issues include:
● Corrupted or missing XRef tables or trailers.
● Incorrect /Length values for streams.
● Invalid object syntax or references to non-existent objects.
● Use of undocumented or deprecated features.
● Incorrectly formed Tagged PDF structures.1

A robust reader should not simply crash when encountering such files. It needs error
handling and recovery mechanisms:
● Attempting to rebuild the XRef table by scanning the file for obj/endobj markers if
the startxref or /Prev links are invalid.
● Skipping over objects that cannot be parsed, potentially logging an error.
● Providing sensible default values or fallbacks for missing resources (e.g.,
substituting a default font if a declared font is missing and not embedded).
● Implementing lenient parsing where appropriate, tolerating minor syntax
deviations commonly produced by certain writers, while balancing this against the
risk of misinterpreting the content.

There is an inherent trade-off between strict adherence to the standard and the
practical need to open and render the vast number of imperfect PDFs found in the
real world.
B. Complex Transparency and Color Management
As discussed (Insight 9), correctly implementing the full transparency model is
challenging.22 Edge cases involving interactions between multiple transparent layers,
different blend modes, nested transparency groups (isolated vs. knockout), and
complex soft masks can be particularly difficult to get right and computationally
expensive.

Accurate color management is another significant challenge. PDFs can utilize a wide
variety of color spaces: device-dependent spaces (DeviceGray, DeviceRGB,
DeviceCMYK), CIE-based spaces (CalGray, CalRGB, Lab), ICCBased spaces (using
embedded ICC profiles), and special spaces (Indexed, Pattern, Separation, DeviceN).22
To render colors consistently across different display devices and printers, the reader
must:
● Correctly interpret color values according to the active color space.
● Perform color transformations, potentially using ICC profiles or standard
colorimetric calculations, to map colors from the PDF's source color space to the
output device's color profile (gamut).
● Respect the specified rendering intent (e.g., Perceptual, Relative Colorimetric)
which guides how out-of-gamut colors are handled.

Implementing comprehensive color management adds substantial complexity to the

rendering pipeline.

C. Interactive Elements and Encryption (Basic Overview)

PDFs are not just static documents; they can contain various interactive elements:
● Annotations: Comments, text markup, links, file attachments, stamps.
● Form Fields: Text input boxes, checkboxes, radio buttons, combo boxes, push
buttons (often with associated JavaScript actions). 4
● Digital Signatures: To verify document authenticity and integrity.1
● Multimedia: Embedded audio or video content.
● Actions: Triggered by events like opening a page, clicking a link or button (can
include navigation, JavaScript execution, etc.).

While a basic reader might focus solely on rendering the static appearance,
supporting interactivity (even just following links or displaying form field appearances)
significantly enhances usability. Full support requires implementing event handling,
form field logic, potentially a JavaScript engine, and signature validation capabilities.

PDFs can also be encrypted to protect their content.1 Standard security handlers use
passwords (user password to open, owner password to restrict permissions like
printing or editing) and encryption algorithms (historically RC4, now predominantly
AES-128 or AES-256 for PDF 2.0 1).14 A reader must detect encrypted files, prompt for
a password if needed (for user password-protected files), and implement the
specified decryption algorithms to access the content. Handling different revisions of
the security handler and potential third-party encryption schemes adds further
complexity.

D. Versioning and Extensions

Identifying the correct PDF version can be tricky due to the potential mismatch
between the header version and the /Version entry in the Catalog.13 The reader must
determine the effective version to know which features might be present and should
be supported.

PDF 1.7 introduced a formal extension mechanism allowing vendors to add custom
data and features beyond the core ISO standard, using registered prefixes (e.g.,
Adobe uses ADBE) within an /Extensions dictionary.13 While a reader is not obligated
to understand or render vendor-specific extensions, it should ideally parse and ignore
unknown extensions gracefully rather than failing.13 The intended public registry for
these extensions appears not to be actively maintained, adding ambiguity.13

The PDF ecosystem is vast, diverse, and constantly evolving. Files encountered may
be decades old, created by a multitude of writing applications of varying quality,
potentially using complex features like advanced transparency or encryption, or even
containing vendor-specific extensions.1 The standard itself evolves, with PDF 2.0
clarifying ambiguities and adding new capabilities.4 Consequently, building a truly
robust, general-purpose PDF reader demands more than a simple implementation of
the core specification. It requires significant investment in defensive programming,
sophisticated error recovery strategies (e.g., for malformed XRefs or objects), graceful
degradation for unsupported or unknown features, and potentially continuous
adaptation to specification updates and common implementation quirks observed in
real-world files. Developers must make conscious decisions about the trade-offs
between strict standards conformance and the pragmatic handling of imperfect but
prevalent documents.

X. Conclusion and Recommendations

Building a PDF reader from scratch, especially one focused on preserving page
structure, is a formidable technical challenge. It requires deep engagement with the
complex and extensive ISO 32000 specification, meticulous implementation of
parsing logic, sophisticated font handling and rendering capabilities, and robust error
management.

The core challenges identified include:

● Specification Complexity: Navigating and correctly interpreting the dense
technical details of the ISO 32000 standard.
● Robust Parsing: Handling the intricacies of XRef tables and streams (including
incremental updates), object lookup, and decoding various stream filters.
● Accurate Rendering: Faithfully reproducing text (requiring precise positioning
based on cumulative state and font metrics), vector graphics (managing the
modal graphics state, CTM, colors, and styles), and images.
● Font Management: Dealing with diverse font types, encodings, embedding,
subsetting, and system font substitution.
● Structure Preservation: Moving beyond visual rendering to capture logical
structure, either by parsing Tagged PDF or implementing complex heuristic layout
analysis algorithms.
● Advanced Features: Implementing computationally intensive features like the full
transparency model and comprehensive color management.
● Real-World Imperfections: Developing strategies to handle malformed files,
version differences, and unknown extensions gracefully.

Achieving the goal of page structure preservation necessitates addressing both the
visual layout and the underlying logical organization. Relying solely on rendering
commands is insufficient; understanding paragraph breaks, column flow, and reading
order requires dedicated mechanisms like Tagged PDF parsing or layout analysis
heuristics.

Based on this analysis, the following recommendations are proposed for

development:
1. Prioritize the Specification: ISO 32000-1 (PDF 1.7) and ISO 32000-2 (PDF 2.0)
must serve as the primary reference guides throughout development.3 Adherence
to the standard is paramount for compatibility.
2. Adopt a Phased Approach: Begin by implementing core functionalities:
○ Basic file structure parsing (Header, Trailer, XRef).
○ Indirect object location and parsing for basic types.
○ Decoding essential filters (e.g., /FlateDecode).
○ Rendering simple opaque vector graphics and text using standard fonts and
basic encodings. Gradually add support for more complex features:
embedded fonts (TrueType, Type 1), advanced graphics state (clipping,
ExtGState), basic transparency, color spaces, layout analysis/Tagged PDF,
and finally, more challenging aspects like full transparency, color
management, and interactive features.
3. Employ Modular Design: Structure the reader into distinct components (e.g.,
File Parser, Object Model, Content Stream Interpreter, Font Manager, Graphics
Engine, Layout Analyzer). This helps manage complexity and facilitates testing
and maintenance.
4. Invest Heavily in Testing: Create or acquire a diverse test suite of PDF files
representing different versions, creator applications, features (fonts, graphics,
transparency, tags), and known edge cases or malformations. Rigorous testing is
crucial for robustness.
5. Consider Leveraging Existing Libraries: While the goal is building "from
scratch," the sheer complexity of certain sub-problems (e.g., font rasterization,
image decompression, zlib implementation) might warrant considering the use of
well-established, specialized open-source libraries (e.g., FreeType for fonts,
libjpeg/libpng/openjpeg for images, zlib for Flate). This allows development effort
to focus on the unique challenges of PDF parsing, structure analysis, and
rendering logic, rather than re-implementing standard algorithms, unless
absolute self-containment is a strict requirement.

In conclusion, creating a PDF reader with structure preservation is a significant

software engineering undertaking requiring expertise in file format parsing, 2D
graphics rendering, typography, and potentially document analysis. A systematic,
specification-driven approach, combined with modular design and rigorous testing, is
essential for success.

Works cited

1. PDF - Wikipedia, accessed April 15, 2025, https://en.wikipedia.org/wiki/PDF

2. Pdf Reference: Adobe Portable Document Format Version 1.4 - Amazon.com,
accessed April 15, 2025,
https://www.amazon.com/PDF-Reference-Version-1-4-3rd/dp/0201758393
3. ISO 32000-1 - PDF Association, accessed April 15, 2025,
https://pdfa.org/resource/iso-32000-1/
4. INTERNATIONAL STANDARD ISO 32000-2, accessed April 15, 2025,
https://cdn.standards.iteh.ai/samples/75839/b0970522e8464fffa9081aa71b03ca6
c/ISO-32000-2-2020.pdf
5. PDF Reference, fifth edition: Adobe Portable Document Format version 1.6,
accessed April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.6.pdf
6. Portable Document Format Reference Manual - Adobe Open Source, accessed
April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.0.pdf
7. PDF Reference, Second Edition - Adobe Open Source, accessed April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.3.pdf
8. PDF Reference, Third Edition - Adobe Open Source, accessed April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.4.pdf
9. PDF Reference, fourth edition: Adobe Portable Document Format version 1.5,
accessed April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.5_v6.pdf
10.PDF Reference, version 1.7 - Adobe Open Source, accessed April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/pdfreference
1.7old.pdf
11. ISO 32000 - Talking PDF, accessed April 15, 2025,
https://talkingpdf.org/iso-32000/
12.ISO 32000-2 - PDF Association, accessed April 15, 2025,
https://pdfa.org/resource/iso-32000-2/
13.PDF, Version 1.7 (ISO 32000-1:2008) - The Library of Congress, accessed April 15,
2025, https://www.loc.gov/preservation/digital/formats/fdd/fdd000277.shtml
14.Portable document format — Part 1: PDF 1.7 - Adobe Open Source, accessed
April 15, 2025,
https://opensource.adobe.com/dc-acrobat-sdk-docs/pdfstandards/PDF32000_2
008.pdf
15.PDF File Format: Basic Structure - Protean Security, accessed April 15, 2025,
https://www.proteansec.com/exploit-development/pdf-file-format-basic-structu
re/
16.PDF Specifications - EdShare, accessed April 15, 2025,
http://edshare.soton.ac.uk/id/document/289340
17.PDF syntax 101: Understanding PDF Object Types - Nutrient, accessed April 15,
2025, https://www.nutrient.io/blog/pdf-syntax-101/
18.1. PDF Syntax - Developing with PDF [Book] - O'Reilly, accessed April 15, 2025,
https://www.oreilly.com/library/view/developing-with-pdf/9781449327903/ch01.ht
ml
19.Summary of the Structure of PDF files - Mapsoft, accessed April 15, 2025,
https://mapsoft.com/summary-of-structure-of-pdf-files
20.Understanding the PDF File Format, accessed April 15, 2025,
https://blog.idrsolutions.com/understanding-the-pdf-file-format/
21.How do Text Objects in PDF work? - Stack Overflow, accessed April 15, 2025,
https://stackoverflow.com/questions/45216120/how-do-text-objects-in-pdf-work
22.ISO 32000-2 FDIS - Adobe Developer, accessed April 15, 2025,
https://developer.adobe.com/document-services/docs/assets/5b15559b96303194
340b99820d3a70fa/PDF_ISO_32000-2.pdf
23.INTERNATIONAL STANDARD ISO 32000-2, accessed April 15, 2025,
https://cdn.standards.iteh.ai/samples/63534/4229145d18fa4313a506de4cb46ad7f
a/ISO-32000-2-2017.pdf
24.[MS-PDF]: Microsoft Edge ISO 32000-1 Portable Document Format (PDF)
Standards Support Document, accessed April 15, 2025,
https://learn.microsoft.com/en-us/openspecs/ie_standards/ms-pdf/570b03e0-31
95-4162-85b9-4a0be3042b65

STE UNIT-1 Notes
No ratings yet
STE UNIT-1 Notes
14 pages
COMEN Brochure V1.6 20230722
No ratings yet
COMEN Brochure V1.6 20230722
41 pages
Google My Business 101
100% (1)
Google My Business 101
28 pages
Interior Ballistics Simulation of Modular Charge Gun System Using Matlab
100% (1)
Interior Ballistics Simulation of Modular Charge Gun System Using Matlab
7 pages
EDR Vs XDR
No ratings yet
EDR Vs XDR
23 pages
HCI 4th Chapt Notes
No ratings yet
HCI 4th Chapt Notes
13 pages
Milk Vending Machine
50% (2)
Milk Vending Machine
21 pages
Mastering SciPy - Sample Chapter
No ratings yet
Mastering SciPy - Sample Chapter
45 pages
Sales Guide - October 2020
No ratings yet
Sales Guide - October 2020
9 pages
OceanStor Dorado V6 6.0.0 Initial Configuration
No ratings yet
OceanStor Dorado V6 6.0.0 Initial Configuration
33 pages
Apsmo: Olympiad
No ratings yet
Apsmo: Olympiad
4 pages
Clean Architecture - Overengineering or Necessity
No ratings yet
Clean Architecture - Overengineering or Necessity
24 pages
User Manual M5 - Protocol 2 - V6.03
No ratings yet
User Manual M5 - Protocol 2 - V6.03
14 pages
Itext Pdfabc
No ratings yet
Itext Pdfabc
152 pages
Slickdeals The Best Deals, Coupons, Promo Codes & Discounts
No ratings yet
Slickdeals The Best Deals, Coupons, Promo Codes & Discounts
75 pages
Degree Completion Checklist For BSC in CSE (201 To 232)
No ratings yet
Degree Completion Checklist For BSC in CSE (201 To 232)
5 pages
1 Document
No ratings yet
1 Document
1 page
3 Document
No ratings yet
3 Document
1 page
2 Document
No ratings yet
2 Document
1 page
LTCC Process Overview
No ratings yet
LTCC Process Overview
1 page
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
No ratings yet
Instant Download Antennas From Theory To Practice 1st Edition Huang Y. PDF All Chapters
51 pages
GPT4All: A Free-To-Use, Locally Running, Privacy-Aware Chatbot. No GPU or Internet Required
No ratings yet
GPT4All: A Free-To-Use, Locally Running, Privacy-Aware Chatbot. No GPU or Internet Required
31 pages
Index
No ratings yet
Index
1 page
Cisco Packet Tracer Installation Steps: o .Exe o o o .DMG
No ratings yet
Cisco Packet Tracer Installation Steps: o .Exe o o o .DMG
3 pages
Untitled Document
No ratings yet
Untitled Document
12 pages
The Role of Open-Source Software: Enterprise Architecture - Chapter 6
No ratings yet
The Role of Open-Source Software: Enterprise Architecture - Chapter 6
111 pages
Bank Risk Management Interview Preparation Guide
No ratings yet
Bank Risk Management Interview Preparation Guide
9 pages
Combinatorics 11695
No ratings yet
Combinatorics 11695
41 pages
Bosch 0 227 100 211 Ignition Control With MegaSquirt-II
No ratings yet
Bosch 0 227 100 211 Ignition Control With MegaSquirt-II
4 pages
Luck Portal VFX For Beginners and Advanced Users of Adobe After
No ratings yet
Luck Portal VFX For Beginners and Advanced Users of Adobe After
2 pages
Plankalkül
No ratings yet
Plankalkül
7 pages
Regression Analysis of Gapminder Data
No ratings yet
Regression Analysis of Gapminder Data
41 pages
Unit 1: Introduction To Markup Languages
No ratings yet
Unit 1: Introduction To Markup Languages
41 pages
Pricelist Anandam - Id 11 Juli 2023
No ratings yet
Pricelist Anandam - Id 11 Juli 2023
2 pages
Proceedings of Spie
No ratings yet
Proceedings of Spie
20 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Untitled Document
No ratings yet
Untitled Document
2 pages
Rockwell Operation Manual v0
No ratings yet
Rockwell Operation Manual v0
25 pages
Dasari Teja Sree: Career Objective
No ratings yet
Dasari Teja Sree: Career Objective
3 pages
5 Generated Record
No ratings yet
5 Generated Record
1 page
3 Generated Artifact
No ratings yet
3 Generated Artifact
1 page
Applied Epic en Us
No ratings yet
Applied Epic en Us
2 pages
What Is PDF File Extension
No ratings yet
What Is PDF File Extension
3 pages
3PDF - Wikipedia
No ratings yet
3PDF - Wikipedia
5 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
91 pages
The Structure of A PDF File. Introduction - by Jay Berkenbilt - Medium
No ratings yet
The Structure of A PDF File. Introduction - by Jay Berkenbilt - Medium
13 pages
Source
No ratings yet
Source
8 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
31 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
5 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
25 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
24 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
23 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
30 pages
Beginning XML
From Everand
Beginning XML
Joe Fawcett
3/5 (1)
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
24 pages
History: Portable Document Format (PDF), Standardized As ISO 32000, Is A File Format Developed
No ratings yet
History: Portable Document Format (PDF), Standardized As ISO 32000, Is A File Format Developed
13 pages
PDF - Wiki
No ratings yet
PDF - Wiki
24 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
23 pages
Upload 1
No ratings yet
Upload 1
174 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
32 pages
91 Languages: Article Talk Read Edit View History
No ratings yet
91 Languages: Article Talk Read Edit View History
4 pages
PDF 1
No ratings yet
PDF 1
3 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
22 pages
PDF File Format: Internal Document Structure Explained: Save Emails To PDF
No ratings yet
PDF File Format: Internal Document Structure Explained: Save Emails To PDF
7 pages
PDF - Wikipedia
No ratings yet
PDF - Wikipedia
23 pages
Find Text
No ratings yet
Find Text
5 pages
Nice Document - Expert Review
No ratings yet
Nice Document - Expert Review
20 pages
PostScript Language
No ratings yet
PostScript Language
2 pages
Comments Using 8
No ratings yet
Comments Using 8
2 pages
Part 5
No ratings yet
Part 5
8 pages
PDF 2
No ratings yet
PDF 2
4 pages
TR11 Wolf OMG PDF PDF
No ratings yet
TR11 Wolf OMG PDF PDF
197 pages
Jump To Navigation Jump To Search: For Other Uses, See
No ratings yet
Jump To Navigation Jump To Search: For Other Uses, See
16 pages
Technical Details: Postscript Language
No ratings yet
Technical Details: Postscript Language
2 pages
Caradoc: A Pragmatic Approach To PDF Parsing and Validation: Guillaume Endignoux Olivier Levillain Jean-Yves Migeon
No ratings yet
Caradoc: A Pragmatic Approach To PDF Parsing and Validation: Guillaume Endignoux Olivier Levillain Jean-Yves Migeon
14 pages
Basics
No ratings yet
Basics
2 pages
Law 2
No ratings yet
Law 2
2 pages
Malicious Origami in PDF
No ratings yet
Malicious Origami in PDF
27 pages
XML Forms Data Format (XFDF)
No ratings yet
XML Forms Data Format (XFDF)
6 pages
Tutor 5
No ratings yet
Tutor 5
2 pages
Portable Document Format
No ratings yet
Portable Document Format
14 pages
Portable Document Format: History and Standardization Technical Foundations Technical Overview
No ratings yet
Portable Document Format: History and Standardization Technical Foundations Technical Overview
5 pages
Documentation PDF
No ratings yet
Documentation PDF
3 pages
O o o o
No ratings yet
O o o o
6 pages
The ABC of PDF With Itext
No ratings yet
The ABC of PDF With Itext
32 pages
PDF Basics CheatSheet
No ratings yet
PDF Basics CheatSheet
2 pages
Portable Document Format: For Other Uses, See
No ratings yet
Portable Document Format: For Other Uses, See
2 pages
PDF Documents: A Primer For Data Curators: Portable Document Format PDF
No ratings yet
PDF Documents: A Primer For Data Curators: Portable Document Format PDF
11 pages
0906 0867 PDF
No ratings yet
0906 0867 PDF
8 pages
Minimal PDF: Adobe PDF Specification ("ISO Approved Copy of The ISO 32000-1 Standards Document") Tips
No ratings yet
Minimal PDF: Adobe PDF Specification ("ISO Approved Copy of The ISO 32000-1 Standards Document") Tips
3 pages
Frequently Asked Questions (Faqs) : Pdf/A-1
No ratings yet
Frequently Asked Questions (Faqs) : Pdf/A-1
8 pages
PDF-A PDF Model
No ratings yet
PDF-A PDF Model
5 pages
PDF - Wikipedia: About 489,000,000 Results (0.52 Seconds)
No ratings yet
PDF - Wikipedia: About 489,000,000 Results (0.52 Seconds)
2 pages
Pdf/A: Navigation Search
No ratings yet
Pdf/A: Navigation Search
4 pages
Redundantly File Format Adobe Documents Application Software Hardware Operating Systems Postscript Fonts Vector Graphics Raster Images Open Format
No ratings yet
Redundantly File Format Adobe Documents Application Software Hardware Operating Systems Postscript Fonts Vector Graphics Raster Images Open Format
4 pages
PDF 1
No ratings yet
PDF 1
11 pages
Visualizing Data Structures
From Everand
Visualizing Data Structures
Rhonda Hoenigman
No ratings yet
Portable Document Format (PDF), Standardized As ISO 32000, Is A
No ratings yet
Portable Document Format (PDF), Standardized As ISO 32000, Is A
12 pages
PDF File Format - What Is A PDF
No ratings yet
PDF File Format - What Is A PDF
4 pages
Beginning HTML and CSS
From Everand
Beginning HTML and CSS
Rob Larsen
No ratings yet
Anatomy of Malicious PDF Documents
No ratings yet
Anatomy of Malicious PDF Documents
6 pages

PDF Reader From Scratch

Uploaded by

PDF Reader From Scratch

Uploaded by

Building a PDF Reader with Page Structure Preservation: A

Technical Deep Dive

II. Understanding the PDF Specification and File Structure

B. Core File Structure Components

C. PDF Objects: The Building Blocks

The combination of dictionaries and indirect references forms the backbone of a

III. Low-Level PDF Parsing Techniques

A. Reading the Cross-Reference Table (XRef)

PDF files can be updated incrementally by appending changes, including new or

B. Object Location and Retrieval

C. Decoding Content Streams

The following table summarizes common filters:

Filter Name Description Typical Use Specification

FlateDecode zlib/deflate General content 7.4.3

LZWDecode Lempel-Ziv-Welch Older files, general 7.4.4

ASCIIHexDecode ASCII Hexadecimal Simple binary data 7.4.1

ASCII85Decode ASCII Base-85 More compact binary 7.4.2

CCITTFaxDecode Group 3 or Group 4 Bi-level (black & 7.4.5

DCTDecode JPEG baseline DCT Photographic images 7.4.7

RunLengthDecode Simple run-length Repetitive data, some 7.4.6

JBIG2Decode JBIG2 compression Bi-level images 7.4.9

Successfully decoding these streams is essential to access the actual page

IV. Text Extraction and Representation

A. Text Objects and Operators

B. Extracting Positional Data

C. Font Information (Name, Size, Metrics, Encoding)

V. Rendering Vector Graphics

A. Path Construction and Painting Operators

B. Shapes, Fills, and Strokes

C. Coordinate Systems and Transformations (CTM)

The cm (concatenate matrix) operator allows direct modification of the CTM by

The rendering process for vector graphics involves:

VI. Font Handling Mechanisms

A. Embedded vs. System Fonts

PDF supports various encoding mechanisms:

C. Glyph Rendering and Font Metrics

The diversity of font technologies (Type 1, TrueType, Type 3 user-defined fonts,

VII. Document Layout Analysis for Structure Preservation

A. Identifying Logical Structures (Paragraphs, Columns, Tables)

This typically involves applying heuristic algorithms based on geometric proximity

B. Role of Tagged PDF

Tagged PDF offers significant advantages:

C. Determining Reading Order

Preserving page structure thus involves two complementary aspects: accurately

Achieving genuine structure preservation requires moving beyond simply executing

VIII. Graphics State and Content Layering

A. Graphics State Parameters (CTM, Clipping, Color, Transparency)

Many advanced graphics state parameters, particularly those related to transparency,

B. How Elements are Combined and Layered

However, the introduction of transparency significantly complicates this model.22

Transparency Groups add further complexity. When rendering a transparency group,

The presence of transparency fundamentally breaks the simplicity of a purely

IX. Common Challenges and Robustness Considerations

A. Handling Malformed or Non-Standard PDFs

Implementing comprehensive color management adds substantial complexity to the

C. Interactive Elements and Encryption (Basic Overview)

D. Versioning and Extensions

X. Conclusion and Recommendations

The core challenges identified include:

Based on this analysis, the following recommendations are proposed for

In conclusion, creating a PDF reader with structure preservation is a significant

1.​ PDF - Wikipedia, accessed April 15, 2025, https://en.wikipedia.org/wiki/PDF

You might also like

1. PDF - Wikipedia, accessed April 15, 2025, https://en.wikipedia.org/wiki/PDF