The Panel: Beebe@math - Utah.edu
The Panel: Beebe@math - Utah.edu
Nelson H. F. Beebe
Center for Scientific Computing
University of Utah
Department of Mathematics, 322 INSCC
155 S 1400 E RM 233
Salt Lake City, UT 84112-0090
USA
Email: [email protected], [email protected], [email protected], [email protected] (Internet)
URL: http://www.math.utah.edu/~beebe
Telephone: +1 801 581 5254
FAX: +1 801 585 1640, +1 801 581 4148
TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting 181
Nelson H. F. Beebe
These goals have all been met in Adobe’s im- effort. Regrettably, output quality is sometimes in-
plementations of PDF processing software. ferior to what Distiller can produce, leading to user
Interestingly, TEX DVI files, defined fifteen confusion and dissatisfaction. Adobe Acrobat 5 now
years earlier, fulfill all of these requirements, except installs a PostScript driver instead of PDFWriter.
for font embedding, encryption, and incremental PDF supports the notion of ‘thumbnails’: small
updating. bitmap images of pages that can be quite helpful
in navigating through those documents where pages
PDF advantages have recognizably different appearance. It also has
Publishers and print shops like PDF, because such bookmarks and hypertext links.
files are less troublesome to deal with than Post- PDF viewers also offer magnification, which can
Script files often are. Numerous magazines and be quite helpful in overcoming low screen resolution,
newspapers are now printed locally from master or compensating for vision impairment.
PDF files shipped electronically, saving the signifi- Newer PDF viewers provide for page rotation,
cant expense and delay of long-distance transporta- which is essential for reading documents with tables
tion of printed matter. in landscape orientation.
Some printer vendors exploit page indepen- Adobe offers a free PDF file creation service
dence to achieve very high performance: IBM has a on the Web at https://createpdf.adobe.com that
PDF printer with 24 CPUs simultaneously rendering can be used to convert files from a variety of current
PDF page images to print at more than 400 pages/ desktop publishing and bitmap graphics file formats
minute. to PDF.
At least one PDF file viewer is freely avail- PDF has been extended to handle forms: doc-
able for each of the major platforms, including uments with boxes to be filled out and transmitted
a hand-held Personal Digital Assistant (PDA), so electronically. The U.S. Internal Revenue Service
the vast majority of computer users can view PDF provides income tax forms this way.
files without cost. Besides Adobe’s free Acrobat
PDF and TEX
Reader, and their commercial Capture, Catalog,
Distiller, InDesign, Photoshop, and Illustrator tools, Hàn Thê´ Thành’s important Ph.D. thesis research
there are Ghostscript (http://sourceforge.net/ that led to pdfTEX has shown how TEX users can di-
projects/ghostscript/), Ghostview, gv, and xpdf rectly enjoy the benefits of PDF. The close coupling
(http://www.foolabs.com/xpdf/) for viewing and between typesetter and device driver makes some
printing, pdf2ps for printing, pdftotext for extract- things possible that would perhaps be impractical
ing raw text, and Ghostscript’s ps2pdf and Frank in the conventional TEX → DVI → PostScript →
Siegert’s PStill (http://www.wizards.de/~frank/ PDF production path.
pstill.html) for converting PostScript to PDF. Elsewhere in these proceedings, Don Story
The availability of multiple independent imple- shows how JavaScript can be used with TEX and
mentations is critical for demonstrating the suffi- PDF to create interactive documents, and Hans Ha-
ciency of the published PDF specification. It also gen’s fine work with ConTEXt and PDF is almost
promotes market competition, and gives users al- magical.
ternatives when the inevitable nasty software bug The hyperref package, written by Sebastian
arises. Rahtz, Heiko Oberdiek, and others, modifies LATEX
Apple’s MacOS X operating system uses PDF sectional and cross-referencing commands to emit
as the native screen description format. There were TEX \special commands to record hypertext
early attempts to use PostScript for that purpose by links that some DVIware, and pdfTEX, can deal
Sun, with the Network extensible Window System, with. PDF supports such links, so PDF file view-
NeWS [8], in the late 1980s, and by NeXT, with ing is automatically enhanced with navigational
Display PostScript [2, 9], in the early 1990s. Regret- links. The package is available in the CTAN
tably, processing power at the time was insufficient archives at ftp://ctan.tug.org/tex-archive/
to make those efforts successful. macros/latex/contrib/supported/hyperref/.
Adobe developed a special simplified generic
PDF-producing printer driver, PDFWriter, for Mi- PDF and document archiving
crosoft Windows and Apple MacOS. This has made In my view, the open specification and wide accep-
it possible for software vendors on those platforms tance of PDF is very likely to ensure that it can be
to add PDF output capability with relatively little used for ‘long-term’ document storage, something
182 TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting
The PDF Panel
that cannot be said for any of the proprietary desk- Aladdin Ghostscript cannot be included on TUG’s
top publishing formats. annual TEX Live CD-ROM. Instead, TEX Live
Nevertheless, because PDF is a page description has to use the approximately one-to-two-years-older
language, rather than a document markup language, GNU release of Ghostscript.
it is still best to preserve document input forms, It is never a good idea to rely on any software
provided those are open and, possibly de-facto, stan- product that has a sole implementation, or runs
dard. only on a single platform. Software is complex, and
even the yet-to-be-written perfect software package
PDF disadvantages: availability can be crippled by errors in the compiler, or run-
Despite the praise of the previous sections, PDF is time libraries, or operating system, or even hard-
imperfect. ware. Scientific experiments are never considered
PDF implementations do not always agree with reliable until they have been independently repro-
the specification, and Adobe’s software often pre- duced. Software use is, after all, just another kind
cedes the specification by months, or even a few of experiment, and experience should have taught us
years, as happened with PostScript Level 3. Third- to be highly skeptical of the outcome of any change
party software developers then face the Herculean to input data, or to program code.
task of trying to reverse engineer the specification Thanks to the fine work of Karel Skoupý and
from experiments with Adobe’s software. The de- the NTS team [16], even TEX now has an inde-
velopment of both Ghostscript and pdfTEX has been pendent implementation, although METAFONT still
significantly delayed by such problems. does not.
Adobe’s initial support of PDF for Apple Mac-
PDF disadvantages: complexity
OS, IBM PC DOS and OS/2, Microsoft Windows,
and several flavors of UNIX (Compaq/DEC OSF/1, PDF is compact because of data compression, and
GNU/Linux on Intel x86, Hewlett-Packard HP-UX, use of a binary, rather than ASCII, representation.
IBM AIX, and Sun SunOS and Solaris) was encour- Although the latter is possible, and was originally
aging. After all, a file format can hardly have the touted as an advantage of PDF [7], in practice,
term ‘Portable’ in its name if it is not usable almost binary encoding is now almost universally used.
everywhere. Compression and binary encoding both intro-
Sadly, Adobe’s original commitment to broad duce a serious problem: data transformations that
support of PDF has been sharply curtailed. While were formerly simple in uncompressed plain text
the free Acrobat Reader component is offered now become immensely more complicated. A great
for a number of platforms and human languages many of the problems posted to the PDF user and
(see http://www.adobe.com/products/acrobat/ developer mailing lists would have relatively simple
alternate.html), the Acrobat product family with solutions with plain text files.
Distiller and Exchange has been completely dropped What is needed is a standard tool for dump-
on all but MacOS and Windows. This is extremely ing PDF into a text format that can be edited,
troublesome, when Adobe markets PDF as a ubiq- then converted back to the binary form, much
uitous solution for page description. as Geoffrey Tobin’s extremely useful dv2dt and
The Acrobat releases for UNIX systems have dt2dv tools (ftp://ctan.tug.org/tex-archive/
an un-UNIX like command line, and lack support dviware/dtl) do for DVI files, and Lee Hethering-
for path searching to find needed files. They also ton’s and Eddie Kohler’s t1disasm and t1asm util-
ship without any manual pages, a deficiency that I ities (ftp://ctan.tug.org/tex-archive/fonts/
remedied locally. I donated my work back to Adobe utilities/t1utils) do for Type 1 outline font
for free and unfettered future distribution. Since files. To my knowledge, no such freely-distributable
the UNIX product line was dropped, that did not tool exists for PDF files.
happen, so I am willing to make that documentation PDF’s numbered, rather than named, object
available on request to licensees of the product. structure means that modifications generally require
For copyright reasons, I cannot place it in a public complete parsing of PDF, because objects must be
archive. renumbered if any are added or removed. Any future
Were it not for Aladdin Ghostscript, users on PDF disassembler/assembler tool must take this into
other platforms would be mostly unable to produce account: it should be possible to hide this design
PDF files at all. While the Aladdin Free Public blemish entirely.
License is quite generous, it does restrict commer-
cial re-use, which, among other things, means that
TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting 183
Nelson H. F. Beebe
The PDF file structure makes it impossible to so that the programming job could be done just once
simply concatenate multiple PDF documents to ob- for all input formats.2
tain a single document, something that is generally The begin/end marker lack is just a special
problem free with PostScript files. case of a more general problem: all current page
Until I wrote this article, I knew of no generally- description languages (DVI, PCL, PDF, PostScript,
available free software that can combine PDF files, . . . ) completely lose all logical markup that was
although there are commercial products for desktop present in the input. The PDF discussion lists again
systems that do so. provide ample evidence that what users really need
Now, with the TEX Live distribution,1 it is as is a page description language in which all logical
simple as this: markup is preserved, allowing recovery of the input
texexec --pdfarrange *.pdf --result=all and reliable translation into any markup system. It
is simply not the case that one can always go back to
The resulting all.pdf file will contain all of the PDF
the original document: often, that document is no
files listed on the command line.
longer available, or is in a proprietary format that
The binary format is also a serious problem
is no longer available or supported, or is not usable
for indexing of document collections, such as by
on the current platform.
Web search engines, or search tools like glimpse (see
The recent PDF version 1.3 [3, Section 8.4.3]
http://webglimpse.net/) or mg [19]. All of these
has some logical structure facilities, and PDF ver-
need a PDF disassembler. Adobe’s Acrobat Catalog
sion 1.4 [4] introduces the notion of ‘Tagged PDF’.
product for indexing PDF file collections is platform
These may supply the needed features to preserve
specific, and GUI based, making it useless for many
logical markup. One reviewer, however, expressed
applications.
reservations at their complexity, and it remains to be
Except for dvipdfm (ftp://ctan.tug.org/
seen whether PDF-producing applications will take
tex-archive/dviware/dvipdfm/), TEX DVI
advantage of them.
drivers are incapable of dealing with PDF. pdfTEX
can import PDF figures, but it cannot handle PDF disadvantages: design limitations
PostScript figures, or support the wizardry of
the pstricks package (ftp://ctan.tug.org/ Cut-and-paste with Acrobat Reader is deficient: lig-
tex-archive/graphics/pstricks/). atures (fi, fl, ffi, ffl, . . . ) are lost, or corrupted, on
No PDF viewers provide information about the every MacOS, UNIX, and Windows platform that
properties (font, color, texture, metric, . . . ) of user- I’ve used. xpdf does not have this problem.
selected displayed text. While PDF viewers offer page selection for
printing, only the now-dropped Acrobat Exchange
PDF disadvantages: no logical markup viewer had the ability to clip out a rectangular
region of a page and save it as a separate file, with
The PDF format lacks begin/end markers for identi-
the ‘supercrop’ toolbar item. That feature is poorly
fying words, lines, paragraphs, sections, . . . . This is
documented, hard to use, imposes an obnoxious
a serious design flaw that TEX DVI and PostScript
minimum crop size, and requires installation of an
also share. UNIX troff at least outputs word and
additional plugin software component. Borrowing
line markers. The reason that these boundaries are
figures and text snippets from other documents is a
important is that some operations can reliably only
common need in document preparation, so perhaps
be done on the formatted text, that is, the text
it is fear of copyright violation that discourages
that actually appears on the page image. Such
software developers from including the capability in
operations include text extraction, cataloging and
PDF viewers.
indexing, spell checking, grammar checking, and
Although the Acrobat product family is re-
string searching. Attempting to do so on the input
leased in numbered versions for multiple platforms,
files is problematic: it is unreliable in the presence of
the viewer features differ between platforms. For
macro expansion (such as in TEX files), and the job
example, Acrobat Reader 4 on Apple MacOS has a
must be done differently for each possible document
page cropping feature that is absent from the same
input format. It would be far better to perform
version on Microsoft Windows, and the toolbar and
these actions on the final typeset text in PDF form,
menus differ between the two versions. While the
1 The T X Live CD-ROM lacked space to include precom-
E
piled formats for ConTEXt, so you first have to build them, 2 The dvispell utility, announced by Daniel Taupin on the
184 TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting
The PDF Panel
differences are not major, they still require a certain amount of grief, since PDF files that do not include
amount of mental retooling for the human user. full embedding and subsetting are rejected.
In my view, such differences are simply poor PDF version 1.4 [4] introduced a transparency
software design and management. The window feature, something that is completely absent from
system interface, while platform-dependent, should the PostScript imaging model of opaque paint. It
be a relatively small portion of the PDF viewer code, is uncertain how such documents will be converted
most of which has the much more difficult task of back to PostScript. So far, the transparency feature
dealing with complex PDF and font file formats. For is little used, because most software cannot yet
example, in xpdf version 0.92, less than 10% of the produce it. Its omission from PostScript, along with
code deals with the window system (as evidenced by support for 3-D coordinates (and 4-D homogeneous
inclusion of window-system-related header files), out coordinates), are the major flaws in that language
of a total of 175,000 lines of C++ code (about nine that prevent PostScript from serving as a universal
times as much as either TEX or METAFONT have). output format for modern computer graphics.
The color matching problem is still not sat-
isfactorily solved, although other page description PDF disadvantages: bugs
formats have the same problem. We have no tech- In order to simplify, or compress, complex Post-
nological way yet to guarantee that colors that the Script files that use language features (see [1, Ap-
author used are very close to what the remote reader pendix H.2.4]) that prevent their inclusion in other
or printer gets. documents as Encapsulated PostScript figures, it
Text searching, and page changing, in all cur- can be helpful to convert such files to PDF, and then
rent PDF viewers are vastly slower than those op- back to PostScript.
erations in a good text editor on a similarly-sized Unfortunately, Distiller has an automatic page
body of text on the same platform, and regular- rotation feature that is beyond user control, even
expression pattern matching searches are unavail- though there is an option for it. I posted an example
able. PDF viewer startup times are also far too to the PDF developers list showing two small Post-
long. Sometimes, performance gets worse instead of Script files differing only by a single comment: one
better: version 0.91 of xpdf introduced a much more was rotated by Distiller, and the other was not.
powerful font rendering engine that has dramatically This has to be a bug, and it completely prevents
slowed that viewer. A test of viewing each page of automated PostScript → PDF → PostScript cleanup
this document showed that the new version runs two of collections of figure files. ps2pdf does not have this
to fourteen times slower, depending on whether the problem.
file server and display are local or remote. Fortu- Several PDF producers incorrectly rename sub-
nately, the new rendering can be turned off with setted fonts, causing the UniqueID problem discussed
a command-line option, restoring performance to in the TEX Font Panel article elsewhere in these
about the same as that of Acrobat Reader and gv. proceedings.
The original PDF specification anointed 14 One audience member reported that the Hewlett-
fonts as standard, requiring them to be supported Packard 4550N has problems with some PDF files
by all viewers, and therefore, eliminating the need that other HP printer models with PostScript Level
to store them in PDF files. When fonts are omitted 3 support do not have. This may perhaps be traced
from PDF files, their metrics are still stored, so to the lag between specification and software: the
that when the required font cannot be found, PDF former should always come first.
viewers can substitute other fonts and obtain correct Some PDF viewers incorrectly handle fonts with
letter spacing, even though the letter shapes are characters in positions 0 . . . 31 or 128 . . . 159.
wrong. Despite the fact that Adobe’s own co-founder,
With the release of Acrobat version 4, the stan- and chief architect of PostScript, showed over
dard font set was abandoned, and some viewers twenty years ago how to use monitor gray
changed their default fonts, so that displayed doc- scale for effective display of fonts [17], Acro-
uments now look different. Unfortunately, Distiller bat Reader does a completely unacceptable job
does not give the user sufficient control to ensure of displaying PDF files that use bitmap fonts
that all fonts will be embedded or subsetted, so (see http://www.math.utah.edu/~beebe/fonts/
users may not be able to ensure the same appear- outline-vs-bitmap-fonts.html for further dis-
ance everywhere for their PDF files. This particular cussion, and visual comparisons). There is no
flaw has caused users of the U.S. National Science excuse for this! PostScript and PDF are capa-
Foundation FastLane grant proposal process a huge ble of handling several different font formats, and
TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting 185
Nelson H. F. Beebe
186 TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting
The PDF Panel
TUGboat, Volume 22 (2001), No. 3 — Proceedings of the 2001 Annual Meeting 187