Portable Document Format: History and Standardization Technical Foundations Technical Overview
Portable Document Format: History and Standardization Technical Foundations Technical Overview
PDF
The Portable Document Format (PDF) is a file format developed by Adobe in the 1990s to present
documents, including text formatting and images, in a manner independent of application software, Portable Document
hardware, and operating systems.[2][3] Based on the PostScript language, each PDF file encapsulates a Format
complete description of a fixed-layout flat document, including the text, fonts, vector graphics, raster
images and other information needed to display it. PDF was standardized as ISO 32000 in 2008, and
no longer requires any royalties for its implementation.[4]
PDF files may contain a variety of content besides flat text and graphics including logical structuring
elements, interactive elements such as annotations and form-fields, layers, rich media (including video
content) and three dimensional objects using U3D or PRC, and various other data formats. The PDF
Adobe PDF icon
specification also provides for encryption and digital signatures, file attachments and metadata to
enable workflows requiring these features.
Contents
History and standardization Filename .pdf[note 1]
extension
Technical foundations
Internet application/pdf,[1]
PostScript media type
application/x-pdf
Technical overview
application/x-
File structure
bzpdf
Imaging model
application/x-
Vector graphics
gzpdf
Raster images
Text Type code 'PDF '[1] (including
Fonts a single space)
Standard Type 1 Fonts (Standard 14 Fonts) Uniform Type com.adobe.pdf
Identifier (UTI)
Encodings
Magic number %PDF
Transparency
Developed by Adobe Inc. (1993–
Interactive elements
2008)
AcroForms
https://en.wikipedia.org/wiki/PDF 1/23
19/8/2020 PDF - Wikipedia
https://en.wikipedia.org/wiki/PDF 2/23
19/8/2020 PDF - Wikipedia
Adobe Systems made the PDF specification available free of charge in 1993. In the early years PDF was popular mainly in desktop
publishing workflows, and competed with a variety of formats such as DjVu, Envoy, Common Ground Digital Paper, Farallon Replica and
even Adobe's own PostScript format.
PDF was a proprietary format controlled by Adobe until it was released as an open standard on July 1, 2008, and published by the
International Organization for Standardization as ISO 32000-1:2008,[5][6] at which time control of the specification passed to an ISO
Committee of volunteer industry experts. In 2008, Adobe published a Public Patent License to ISO 32000-1 granting royalty-free rights for
all patents owned by Adobe that are necessary to make, use, sell, and distribute PDF-compliant implementations.[7]
PDF 1.7, the sixth edition of the PDF specification that became ISO 32000-1, includes some proprietary technologies defined only by
Adobe, such as Adobe XML Forms Architecture (XFA) and JavaScript extension for Acrobat, which are referenced by ISO 32000-1 as
normative and indispensable for the full implementation of the ISO 32000-1 specification. These proprietary technologies are not
standardized and their specification is published only on Adobe's website.[8][9][10][11][12] Many of them are also not supported by popular
third-party implementations of PDF.
On July 28, 2017, ISO 32000-2:2017 (PDF 2.0) was published.[13] ISO 32000-2 does not include any proprietary technologies as
normative references.[14]
Technical foundations
The PDF combines three technologies:
A subset of the PostScript page description programming language, for generating the layout and graphics.
A font-embedding/replacement system to allow fonts to travel with the documents.
A structured storage system to bundle these elements and any associated content into a single file, with data compression where
appropriate.
PostScript
PostScript is a page description language run in an interpreter to generate an image, a process requiring many resources. It can handle
graphics and standard features of programming languages such as if and loop commands. PDF is largely based on PostScript but
simplified to remove flow control features like these, while graphics commands such as lineto remain.
Often, the PostScript-like PDF code is generated from a source PostScript file. The graphics commands that are output by the PostScript
code are collected and tokenized. Any files, graphics, or fonts to which the document refers also are collected. Then, everything is
compressed to a single file. Therefore, the entire PostScript world (fonts, layout, measurements) remains intact.
https://en.wikipedia.org/wiki/PDF 3/23
19/8/2020 PDF - Wikipedia
PDF contains tokenized and interpreted results of the PostScript source code, for direct correspondence between changes to items in
the PDF page description and changes to the resulting page appearance.
PDF (from version 1.4) supports transparent graphics; PostScript does not.
PostScript is an interpreted programming language with an implicit global state, so instructions accompanying the description of one
page can affect the appearance of any following page. Therefore, all preceding pages in a PostScript document must be processed to
determine the correct appearance of a given page, whereas each page in a PDF document is unaffected by the others. As a result,
PDF viewers allow the user to quickly jump to the final pages of a long document, whereas a PostScript viewer needs to process all
pages sequentially before being able to display the destination page (unless the optional PostScript Document Structuring Conventions
have been carefully complied and included).
Technical overview
File structure
A PDF file is a 7-bit ASCII file, except for certain elements that may have binary content. A PDF file starts with a header containing the
magic number and the version of the format such as %PDF-1.7. The format is a subset of a COS ("Carousel" Object Structure) format.[15] A
COS tree file consists primarily of objects, of which there are eight types:[16]
Furthermore, there may be comments, introduced with the percent sign (%). Comments may contain 8-bit characters.
Objects may be either direct (embedded in another object) or indirect. Indirect objects are numbered with an object number and a
generation number and defined between the obj and endobj keywords if residing in the document root. Beginning with PDF version 1.5,
indirect objects (except other streams) may also be located in special streams known as object streams (marked /Type /ObjStm). This
https://en.wikipedia.org/wiki/PDF 4/23
19/8/2020 PDF - Wikipedia
technique enables non-stream objects to have standard stream filters applied to them, reduces the size of files that have large numbers of
small indirect objects and is especially useful for Tagged PDF. Object streams do not support specifying an object's generation number
(other than 0).
An index table, also called the cross-reference table, is typically located near the end of the file and gives the byte offset of each indirect
object from the start of the file.[17] This design allows for efficient random access to the objects in the file, and also allows for small changes
to be made without rewriting the entire file (incremental update). Before PDF version 1.5, the table would always be in a special ASCII
format, be marked with the xref keyword, and follow the main body composed of indirect objects. Version 1.5 introduced optional cross-
reference streams, which have the form of a standard stream object, possibly with filters applied. Such a stream may be used instead of the
ASCII cross-reference table and contains the offsets and other information in binary format. The format is flexible in that it allows for
integer width specification (using the /W array), so that for example a document not exceeding 64 KiB in size may dedicate only 2 bytes for
object offsets.
The startxref keyword followed by an offset to the start of the cross-reference table (starting with the xref keyword) or the cross-
reference stream object
And the %%EOF end-of-file marker.
If a cross-reference stream is not being used, the footer is preceded by the trailer keyword followed by a dictionary containing
information that would otherwise be contained in the cross-reference stream object's dictionary:
A reference to the root object of the tree structure, also known as the catalog (/Root)
The count of indirect objects in the cross-reference table (/Size)
And other optional information.
There are two layouts to the PDF files: non-linear (not "optimized") and linear ("optimized"). Non-linear PDF files consume less disk space
than their linear counterparts, though they are slower to access because portions of the data required to assemble pages of the document
are scattered throughout the PDF file. Linear PDF files (also called "optimized" or "web optimized" PDF files) are constructed in a manner
that enables them to be read in a Web browser plugin without waiting for the entire file to download, since they are written to disk in a
linear (as in page order) fashion.[18] PDF files may be optimized using Adobe Acrobat software or QPDF.
Imaging model
The basic design of how graphics are represented in PDF is very similar to that of PostScript, except for the use of transparency, which was
added in PDF 1.4.
https://en.wikipedia.org/wiki/PDF 5/23