Pdfbox 586 4
Pdfbox 586 4
by Rainer Klute
1. HPSF Internals
1.1. Introduction
A Microsoft Office document is internally organized like a filesystem with directory and
files. Microsoft calls these files streams. A document can have properties attached to it, like
author, title, number of words etc. These metadata are not stored in the main stream of, say, a
Word document, but instead in a dedicated stream with a special format. Usually this
stream's name is \005SummaryInformation, where \005 represents the character with
a decimal value of 5.
A single piece of information in the stream is called a property, for example the document
title. Each property has an integral ID (e.g. 2 for title), a type (telling that the title is a string
of bytes) and a value (what this is should be obvious). A stream containing properties is
called a property set stream.
This document describes the internal structure of a property set stream, i.e. the HPSF. It does
not describe how a Microsoft Office document is organized internally and how to retrieve a
stream from it. See the POIFS documentation for that kind of stuff.
The HPSF is not only used in the Summary Information stream in the top-level document of
a Microsoft Office document. Often there is also a property set stream named
\005DocumentSummaryInformation with additional properties. Embedded
documents may have their own property set streams. You cannot tell by a stream's name
whether it is a property set stream or not. Instead you have to open the stream and look at its
bytes.
Page 1
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
represented by a byte 0x12 followed by another byte 0x34, you are right. This is called the
big endian format. In the little endian format, however, this order is reversed and the
low-value byte comes first: 0x3412.
The following table gives an overview about some important data types:
Name Length Example (Big Endian) Example (Little Endian)
Bytes 1 byte 0x12 0x12
Page 2
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
2 Word 0x0000
Page 3
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
the section count field in the header says. The Summary Information stream contains a single
section, the Document Summary Information stream contains two.
Type Contents Remarks
ClassID Section format ID 0xF29F85E04FF91068AB9108002B27B3D9
for the single section in the
Summary Information stream.
0xD5CDD5022E9C101B939708002B2CF9AE
for the first section in the
Document Summary
Information stream.
1.6. Section
A section is divided into three parts: the section header (with the section length and the
number of properties in the section), the properties list (with type and offset of each
property), and the properties themselves. Here are the details:
Type Contents Remarks
Section header DWord Length The length of the
section in bytes.
Page 4
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
details.
Page 5
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
infinite wisdom the section format ID is not part of the section. Thus if you have only a
section without the stream it is in, you cannot make any sense of the properties because you
do not know what they mean.
So each section format ID has its own name space of property IDs. Microsoft defined some
"well-known" property IDs for the Summary Information and the Document Summary
Information streams. You can extend them by your own additional IDs. This will be
described below.
Page 6
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
Page 7
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
Page 8
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
68 VT_STREAMED_OBJECT
[P] Stream contains an
object
Page 9
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
object
0x8000 VT_RESERVED
0xFFFF VT_ILLEGAL
0xFFF VT_ILLEGALMASKED
0xFFF VT_TYPEMASK
The dictionary entries follow the header. Each one looks like this:
Name Data type Description
The entries are not aligned, i.e. each one follows its predecessor without any gap or fill
Page 10
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.
Apache POI - HPSF Internals
characters.
1.10. References
In order to assemble the HPSF description I used information publically available on the
Internet only. The references given below have been very helpful. If you have any
amendments or corrections, please let us know! Thank you!
1. In Understanding OLE documents, Ken Kyler gives an introduction to OLE2 documents
and especially to property sets. He names the property names, types, and IDs of the
Summary Information and Document Summary Information stream.
2. The ActiveX Programmer's Reference at http://www.dwam.net/docs/oleref/ seems a little
outdated, but that's what I have found.
3. An overview of the VT_ types is in Variant Type Definitions.
4. What is a FILETIME? The answer can be found under ,
http://www.vbapi.com/ref/f/filetime.html or
http://www.cs.rpi.edu/courses/fall01/os/FILETIME.html. In short: The FILETIME
structure holds a date and time associated with a file. The structure identifies a 64-bit
integer specifying the number of 100-nanosecond intervals which have passed since
January 1, 1601. This 64-bit value is split into the two dwords stored in the structure.
5. Microsoft provides some public information in the MSDN Library. Use the search
function to try to find what you are looking for, e.g. "codepage" or "document summary
information" etc.
6. This documentation origins from the HPSF description available at
http://www.rainer-klute.de/~klute/Software/poibrowser/doc/HPSF-Description.html.
Page 11
Copyright © 2002-2007 The Apache Software Foundation All rights reserved.