PDF Archiving
PDF Archiving
PDF as a Standard
for Archiving
The Portable Document Format (PDF) as a TABLE OF CONTENTS
1 Benefits and Requirements
for Electronic Archives
solution for creating archives for both paper 1 Today’s Archiving
Challenges
and electronic documents 1 Benefits of Electronic
Archives
2 Establishing Requirements
Benefits and Requirements for Electronic Archives for Adequacy of Records
Today’s Archiving Challenges 2 Requirements in the Context
If you suddenly couldn’t access the building where your vital records were kept, what would you do? of PDF Files
In today’s business and political climate, corporations and governments around the world are begin- 4 PDF Overview
ning to pay close attention to their processes and the resulting record archives that they are—or are 4 History of PDF
not—keeping. In a paper-based world, traditional archiving has meant storage of paper, but what hap- 4 PDF Basics
pens as more and more records are created electronically? How do you preserve both paper-based and 5 PDF as an Archiving Format
electronic records in a consistent format? How do you eliminate the need for paper records? How do 6 File Format and Metadata
Standards
you preserve the exact look and feel of a document today or 30 years from today? How do you provide
6 Standards: De Jure, De
consistency in the integrity of your archives? Facto, and Mandated
7 The Role of Metadata in
The introduction of personal computers into business has drastically changed the archiving environ-
Electronic Archives
ment. Prior to the 1990’s, most offices still had typing pools and word processing groups and kept
9 The Archiving Process
records on paper in centralized fi les. But once computers became the norm for the majority of workers,
9 The Workflow from Creation
the usefulness of the centralized fi le room disappeared. It became everyone’s responsibility to cre- to Archive
ate, fi le, and maintain his or her own documents. As a result, corporations and governments have lost 10 Migrating Archives to Ensure
control over these records. Preservation
11 The Future of Digital
Benefits of Electronic Archives Archives
The advantages of establishing and continually building an electronic archive for an organization are 11 A Worldwide Initiative
numerous. Electronic archives unlock information that was previously difficult to access in paper 11 Resources
form, enable more effective sharing of information, and contribute to knowledge flows. No longer are
archives the domain of those few who truly understand the fi ling system. With electronic archives,
information can be made available to anyone in the organization by granting access privileges.
Additionally, electronic archives can contribute to extensive cost savings within an organization. The
cost associated with maintaining paper-based archives can be great, and electronic archives can help
to significantly reduce this cost. For instance, a well-known study from PricewaterhouseCoopers LLP
found that for every 12 fi ling cabinets in an organization one additional employee is required to main-
tain them. Further, while professionals spend only 5% to 15% of their time reading information, they
spend up to 50% of their time looking for it.
Establishing Requirements for Adequacy of Records
Electronic archives can provide reliable evidence of past actions and decisions, but to do so they must
be managed so as to retain the integrity and authenticity of the records. Achieving this goal requires
paying attention to the records management program and expanding currently held definitions of
records to encompass not only paper but electronic records and other media as well. Establishing and
maintaining an electronic archive requires policy decisions, procedures, and organization-wide plan-
ning along with a commitment to follow the organizational standards.
Preserving the content, context, and structure of records is not a new concern for records manage-
ment. Luciana Duranti of the University of British Columbia (UBC) employed the science of diplo-
matics as the theoretical foundation of electronic records research, using the rules of diplomatics to
establish the reliability and authenticity of electronic records. UBC’s research concluded that organiza-
tions need to use the same records policies and procedures regardless of whether the record is created
on paper, kept electronically, or converted to microfiche. By treating all records in the same manner,
the authenticity and integrity of the records are enhanced.
The requirements for the adequacy of records are determined by each organization’s internal busi-
ness and legal needs, as well as external regulations or requirements. Thus, the requirements for each
organization will be different. A thorough risk analysis must be performed with the full participation
of the organization’s legal department to determine the technological approach that is right for that
organization. The assessment team should include:
• Auditors and lawyers: Knowledge of the organization’s business structure, procedures, and laws and
policies that apply to the organization’s records
• Records managers and archivists: Knowledge of who accesses the records, why the records are ac-
cessed, and how long the records need to remain accessible
• Record creators and users: Knowledge of the records’ business purpose and operational value
• Authentic: It must be possible to prove that a record is what it purports to be, that it has been created
or sent by the person who claims to have created or sent it, and that it was sent at the time alleged.
This can be accomplished by use of metadata, which is data about the data. In the case of PDF files,
metadata can be programmatically embedded inside of the fi le, thereby ensuring that it is what it
purports to be. The creation, receipt, and transmission of records need to be controlled to ensure that
record creators are authorized and identified. While this is usually a function of the overall electronic
records management system, there are certain features of PDF fi les, such as security settings, that
support the establishment of authenticity. Electronic signatures are an additional level of authenticity
that can be applied to PDF fi les.
• Reliable: It must be possible to trust that the content of a record is an accurate representation of the
transaction to which it attests. It should be created and captured in a timely manner by an individual
who has direct knowledge of the event, or it should be generated automatically by processes routinely
used by the organization to conduct the transaction. This is particularly true for records of electronic
transactions. Using the PDF specification, a system integration firm can automate the capture of
digital records in PDF directly from the source application, whether it is a database, a word proces-
sor, or a spreadsheet program.
• Complete and unaltered: It must be possible to protect a record against unauthorized alteration and
to monitor and track any authorized annotation, addition, or deletion. Records management poli-
cies and procedures should specify what, if any, additions or annotations may be made to a record
after it is created, under what circumstances additions or annotations may be authorized, and who
is authorized to make them. This is typically in the realm of metadata changes not changes to the
record itself. The records management system where PDF fi les are stored will typically provide a high
level of security. PDF fi les can also be secured with password protection and encryption. Addition-
ally, there are third-party digital signature and public key infrastructure (PKI) solutions for PDF
documents from companies such as Entrust, Inc., and VeriSign, Inc. Their products work within the
Adobe Acrobat ® application as plug-ins.
• Usable: It must be possible to locate, retrieve, render, and interpret a record and understand the
sequence of activities in which it was created and used for as long as such evidence is required. The
newest feature of encapsulating XML metadata into the PDF fi le ensures that record-quality meta-
data will be readable and accessible into the future. Additionally, PDF fi les are available for full-
text search. Many well-known software vendors, such as Verity, Hummingbird, and Convera, have
integrated PDF fi les into their full-text search engines for many years. This is because the PDF file
specification and software development kit (SDK) are publicly available, and a complete PDF library
used for PDF software development and integration is available for a nominal fee.
• System integrity: It must be possible to implement control measures such as access monitoring, user
verification, authorized destruction, security, and disaster mitigation to ensure the integrity of the
records. The reliability of systems is important to ensuring integrity, and records management ap-
plications excel at this because they are designed with this in mind. Using a well-documented fi le
format as an archival standard is also important to ensuring individual document integrity. The PDF
specification has been used by scores of vendors to create unique applications that help ensure integ-
rity. The direct integration of third-party digital signature and PKI solutions is one example.
In 1992, John Warnock, co-founder of Adobe Systems Incorporated, speaking about the goals of a devel-
opment project known as Camelot, said, “There is no universal way to communicate and view this printed
information electronically… What industries badly need is a universal way to communicate documents
across a wide variety of machine configurations, operating systems, and communication networks.”
The only attribute missing from his description was “over time.” The Camelot project developed the
technology known as PDF. PDF leveraged the ability of the PostScript language to render complex text
and graphics and brought this feature to the screen as well as the printer.
PDF Basics
PDF is a publicly available specification, regardless of the fact that Adobe created it and advances the
specification through subsequent releases. Many people confuse PDF, the data format, with Adobe FOR MORE INFORMATION
Acrobat, the software suite that Adobe sells to create, view, and enhance PDF documents. In 1993, the Two excellent sources for
information regarding
first PDF specification was published at the same time the first Adobe Acrobat products were intro- Adobe Acrobat and third-
duced. Since then, updated versions of the PDF specification continue to be available from Adobe via party vendors can be found
at www.pdfzone.com and at
the Web. The current version of PDF specification at the date of this publication is version 1.4 and is
www.pdfplanet.com.
available at http://partners.adobe.com/asn/developer/acrosdk/docs.html. All of the revisions for which
specifications have been published are backward compatible, that is, if your computer can read version
1.4, it can also read version 1.3 and so on. Since Adobe chose to publish the PDF specification, there is
an ever-growing list of creation, viewing, and manipulation tools available from other vendors.
The term Portable Document Format, or PDF, was coined to illustrate that a fi le conforming to this
specification can be viewed and printed on any platform—UNIX ®, Mac OS, Microsoft ® Windows ®, and
several mobile devices as well—with the same fidelity. A PDF document is the same for any of these
platforms. It consists of a sequence of pages, with each page including the text, font specifications, mar-
gins, layout, graphical elements, and background and text colors. With all of this information present,
the PDF fi le can be imaged accurately for the screen and the printing device. It can also include other
items such as metadata, hyperlinks, and form fields.
In order to ensure the specification can be used by third-party developers, Adobe has provided both
an SDK and the Adobe PDF Library. Entire solutions can be developed outside of the Acrobat product
family, or the Acrobat products can be modified with the development of internal plug-ins. Developers
have even used just the PDF specification to create their own PDF viewers or creators. Every aspect of
the fi le format and the manner in which it can be created, read, and manipulated is detailed in these
documents. By providing this level of support, Adobe has encouraged support and use of PDF from a
variety of sources.
created. With the addition of XML metadata to the PDF fi le, we can have both fidelity and accessibil- 3. Make sure no file is
submitted with pass-
ity. Because PDF is a publicly available specification, the information about the fi le format will always words or encryption.
be in the public domain, making it a very attractive format to select for electronic archives. People 4. Discourage the use
with disabilities can also access the information using assistive technology. For instance, a visually of embedded
executable code.
impaired person might use a screen reader, available from vendors such as Freedom Scientific, Dolphin
5. Standardize the method
Oceanic, and GW Micro, to verbalize the text. This is done through embedded tags in the PDF fi le
of linking between
structure. These tags can be created automatically from the originating application or entered as part files (e.g., use relative
of an enhancement process. links if files are
submitted together).
Many organizations that are using electronic archives are implementing procedures that limit the
formats of records they will receive and store. This reduces the number of file format investigations
and support mechanisms that are required. The Dutch National Archives is currently supporting
electronic document archive formats like PDF
and XML. The Australian Victorian Electronic
Record Strategy (VERS) uses XML to encapsulate
PDF records along with standardized metadata.
The U.K. Public Record Office limits its formats
for transfer into the archives to PostScript, TIFF,
SGML, and PDF.
De jure standards take a long time to develop and must be approved by every organization that is a
member of the standards organization with interests in the area covered. These standards bodies gener-
ally include industry members, technology developers, engineers, and specifications experts.
Standards must be clear and concise, not left to interpretation. Vague standards can cause new interop-
erability problems or continue the same problems that were in existence before the standards were
made. Examples of de jure standards include:
• Z39.50: Z39.50 refers to the International Standard, ISO 23950: “Information Retrieval (Z39.50): Ap-
plication Service Definition and Protocol Specification,” and to ANSI/NISO Z39.50.
• MARC 21: MARC is the acronym for Machine-Readable Cataloging. It defines a data format that
emerged from a U.S. Library of Congress-led initiative that was begun 30 years ago. MARC became
USMARC in the 1980’s and MARC 21 in the late 1990’s. It provides the mechanism by which comput-
ers exchange, use, and interpret bibliographic information, and its data elements make up the founda-
tion of most library catalogs used today.
• JPEG: JPEG is a standardized image compression mechanism. JPEG stands for Joint Photographic
Experts Group, the original name of the committee that wrote the standard.
De facto standards spring up in response to an immediate industry need. They gain in use and popu-
larity through market dictates. They are usually maintained by the group or business that originated
them, and they have no community review. These standards tend to be narrower in scope and designed
for one specific purpose. They penetrate the market and become a standard by virtue of the fact that
they solve key industry problems. PostScript and PDF are both examples of de facto standards.
Mandated standards include those that are either regulated or “suggested for compatibility.” Examples
of organizations that mandate standards are the U.S. Food and Drug Administration (FDA) regarding
New Drug Applications (NDAs). These can be submitted electronically, but the FDA wants the docu-
ment portion as a PDF fi le with very specific attributes, such as bookmarks and hyperlinks. In fact, the
FDA publishes guidance documents on the subject of acceptable electronic formats at www.fda.gov/cder/
guidance. Australia’s Victorian Electronic Records Strategy (VERS) project mandates PDF for records
that are subsequently wrapped with XML records metadata. By establishing appropriate policies and
procedures, individual organizations are effectively mandating standards for their own internal use.
Another example of a mandated standard is the use of PDF/X. PDF/X is an ISO standard that was
initially created in the press and advertising communities. In those industries, whether an organization
creates or receives PDF documents, they must be ready for output right away. This is a PDF for graphic
art professionals and has a specific subset of standards that comply with the ISO standard 15930-1:
2001. There is more information about PDF/X at www.pdfx.info.
Dublin Core Metadata Initiative uses Resource Description Framework (RDF) because RDF allows
metadata schemes to be read by humans as well as parsed by machines and allows multiple objects to
be described without specifying additional detail. The underlying glue, XML, simply requires that all
namespaces be defined. Once they are defined, they can then be used to the extent needed by the pro-
vider of the metadata. Dublin Core metadata elements can be contained within PDF files. The follow-
ing example of metadata extracted from a PDF file identifies the Dublin Core namespace (xmlns=‘http:
//purl.org/dc/elements/1.1/’) and identifies three pieces of metadata: creator, title, and description.
PDF fi les contain, but are not limited to, metadata expressed by the document properties. Any changes
made in the Acrobat Document Properties dialog box are reflected in the metadata. Because metadata
is in XML format, it can be extended and modified using third-party products. By examining the meta-
data in the PDF fi les, it is apparent that the PDF specification has embraced the Dublin Core initiative.
<rdf:Description about=’’
xmlns=’http://purl.org/dc/elements/1.1/’
xmlns:dc=’http://purl.org/dc/elements/1.1/’>
<dc:creator>Adobe Systems, Incorporated</dc:creator>
<dc:title>Adobe Acrobat Help</dc:title>
<dc:description>Adobe Acrobat</dc:description>
</rdf:Description>
PDF as a Standard for Archiving 7
Organizations must establish an organizational metadata standard that will specify the type of infor-
mation that will describe the identity, authenticity, content, structure, context, and essential
management requirements of records. This standard, descriptive information will enable reliable,
meaningful, and accessible records to be carried forward through time to satisfy business needs and
evidential requirements.
There are a variety of international efforts to establish metadata standards. These efforts can provide
good beginning points for an organization to consider standards in metadata practices.
• Victorian Electronic Records Strategy Project, “VERS Metadata Scheme: Public Record Office Stan-
dard, PROS 99/007, Specification 2,” www.prov.vic.gov.au/vers, August 2002
To address an electronic archive, an organization needs a central record management system. This
system needs to address records that are “born” electronic, those that will be converted from paper
to electronic (scanned), and those that will never be electronic (due to perceived value or costs). For
records that are born electronic, desktop conversion, automated processes or server-based processes
can accomplish the conversion to PDF documents. Examples of software that perform these functions
are Adobe Acrobat software and Adobe Acrobat Distiller ® Server. To convert paper to electronic fi les,
Adobe Acrobat Capture can be used to scan and convert paper documents to PDF documents. Using
Adobe Acrobat Capture is a quick way of digitally enabling an organization’s paper archives.
PDF fi les are very useful as an archive format because the text in a PDF file is accessible to the full-
text search indexing engine available in most records management systems. Thus, the archive can be
searched across its metadata and its full text. If it was necessary to find memos created between certain
dates with the words decision and bankruptcy in them, it can be done. Even the paper documents that
are scanned and converted to PDF documents can be made searchable by use of an optical character
recognition (OCR) engine. This technology identifies the appearance of text on a page and can convert
the scanned image to recognizable text.
Emulation is the re-creation of the technical environment required to use older digital objects, for
example, running a DOS program in a Microsoft Windows operating system. Migration is routinely
moving the data to new hardware and software configurations. Each move must be documented and
checked for completeness. Migration is all the more reliable if all electronic records conform to a lim-
ited set of standardized formats.
One of the most significant costs associated with the life cycle maintenance of an electronic archive
can be the migration cost of moving a document from one version of software to another. The effort
associated with this migration of the fi le format can be as simple as opening the document and saving
it as the new format. However, experience has shown that migration is usually not this simple. Open-
ing documents that were created in earlier software versions can create problems with the page layout,
heading numbering, graphics, and so forth. Sometimes these problems are due to the software, but
they can also be caused by the manner in which the user employed, or attempted to employ, a software
feature. In these situations, a user may have to spend time reformatting the document to make sure
that it looks identical to the original document. If there is not a hard copy of the original document
available and the user does not have a working copy of the previous version of the software, then it
may be impossible to reformat the document to look exactly like the original. The hidden cost in these
migration efforts is the manpower needed to ensure that the record still maintains its integrity. That
is why PDF is being used by many organizations as the format to store electronic records. A PDF file
represents the printed page and does not change when opened, unlike a document saved in a word pro-
cessing format, which can change.
Because the PDF specification is publicly available, your organization could even create an archive
checker that could automatically scan incoming PDF fi les. Such automated checks could be used to
flag malformed PDF fi les or fi les that fail to meet archival standards, thus avoiding expensive manual
checks on all incoming fi les.
Because the PDF specification is publicly available, archival organizations can be assured that PDF
will be supported for years to come. Anyone, at any time, using any hardware or software, can create
programs to access electronic archives. In fact, there is already a robust community of developers in
existence today who create tools for PDF. This community continues to grow every year.
The combination of worldwide standards, extensive tools, and technology, along with a publicly avail-
able standard, make the decision to use PDF as the format for electronic archives an easy one. Using
PDF will assure streamlined and continuous access to archives and records for years to come.
Resources
“The Long-Term Preservation of Authentic Electronic Records: Findings of the InterPARES Project”
www.interpares.org/book/index.cfm
Adobe Systems Incorporated • 345 Park Avenue, San Jose, CA 95110-2704 USA • www.adobe.com
Adobe, the Adobe logo, Acrobat, Acrobat Capture, the Adobe PDF logo, Distiller, PostScript, and “Tools for the New Work” are either registered trademarks or
trademarks of Adobe Systems Incorporated in the United States and/or other countries. Mac is a trademark of Apple Computer, Inc., registered in the United
States and other countries. Microsoft and Windows are either registered trademarks or trademarks of Microsoft Corporation in the United States and/or other
countries. UNIX is a registered trademark of The Open Group. All other trademarks are the property of their respective owners.
© 2002 Adobe Systems Incorporated. All rights reserved. Printed in the USA. 95001210 2/03