How To Retrieve Text From A Binary
How To Retrieve Text From A Binary
doc File
DIaLOGIKa/makz/math/wk/divo 4 March 2008
Contents
Introduction................................................................................................................................................... 2
How Text is Stored in a .doc File ................................................................................................................... 2
Streams in a .doc File ................................................................................................................................. 2
The piece table ........................................................................................................................................... 2
Algorithm for Retrieving Text ........................................................................................................................ 3
Loading the FIB .......................................................................................................................................... 3
Loading and Parsing the piece table .......................................................................................................... 3
Introduction
This document explains how the text content in a binary .doc file can be identified and extracted.
The following descriptions and algorithms are based on the Microsoft Word 2007 Binary File Format
Documentation and our own findings.
Note: For information about the structured storage format please have a look at the “Windows
Compound Binary File Format Specification”
Another stream, called 0Table or 1Table (a flag in the FIB determines which name is actually used for
this stream, see below) contains information about the fragmentation of the text part. This
information is called piece table.
Word splits the text into several pieces if different encodings are used for different paragraphs or
text runs, example:
Hello World
…some more CP1252 text…
αβγ – Greek is nice!
The first piece contains the all the characters up to the Greek characters and is encoded
using the codepage 1252.
The second piece contains the Greek and remaining characters and has Unicode encoding
The following algorithm assembles the text of the document and saves it in a string variable:
Note: The Word file format specification describes this FC/LCB pair as “Offset in table stream of
beginning of information for complex files.” It should be borne in mind that this is not correct since
the piece table information is also used for non-complex files.
To load the piece table we must determine how the table stream was named by Word. The FIB
contains a flag which decides if the stream was saved as 0Table or as 1Table. Bit 0x0200 in word
0x000A of the FIB determines how the table stream is named:
(0000) EC A5 C1 00 7D 80 09 04 00 00 F0 12 BF 00 00 00 ....}...........
(0010) 00 00 00 10 00 00 00 00 00 08 00 00 DE 4C 00 00 .............L..
…
(0190) 00 00 A7 19 00 00 A2 02 00 00 49 1C 00 00 74 00 ..........I...t.
(01A0) 00 00 96 17 00 00 2D 00 00 00 00 00 00 00 00 00 ......-.........
(01B0) 00 00 00 00 00 00 00 00 00 00 55 15 00 00 00 00 ..........U.....
…
0x0000002D: complex information length in 1Table
Note: All the number values shown in this and the following hex dumps are in swapped byte order
(“little endian”).
After that we load the data structure holding the piece table into the byte array clx:
The clx byte array can contain multiple substructures and only one of these substructures is the piece
table. Each substructure starts with a byte which denotes the type of the substructure, followed by a
value indicating the length of the substructure.
If the substructure describes a piece table the value of this byte is 2, otherwise 1. The length of the
entry is a 32 bit value for a piece table and an 8 bit value for all other entries.
In order to identify the piece table in the clx byte array, the following algorithm can be used:
if (typeEntry == 2)
{
//this entry is the piece table
goOn = false;
Int32 lcbPieceTable = System.BitConverter.ToInt32(clx, pos + 1);
byte[] pieceTable = new byte[lcbPieceTable];
Array.Copy(bytes, pos + 5, pieceTable, 0, pieceTable.Length);
}
else if (typeEntry == 1)
{
//skip this entry
pos = pos + 1 + 1 + clx[pos + 1];
}
else
{
goOn = false;
}
}
The first array contains n+1 logical character positions (n is the number of pieces). The
entries are the logical start and end positions of the pieces in the text sequence, i.e. the first
piece starts at logical position 1 and extends to position 2, the second starts at position 2,
etc. Logical position x means that this is the x-th character in the document, i.e. this is not the
file character position in the WordDocument stream. The positions are 32 bit values.
The second array contains n piece descriptor structures. Each structure has a length of 8
bytes. The physical location of the piece inside of the WordDocument stream and the
encoding of the text can be found in these 8 bytes from byte 3 to byte 6. This file character
(FC) position is a 32 bit integer value. The second most significant bit is a flag that specifies
the encoding of the piece: if the bit is set, the piece is CP1252-encoded and the FC is a word
pointer; otherwise, the piece is Unicode-encoded and the FC is a byte pointer.
The interpretation of the encoding flag and the calculation of the FC pointer are as follows:
Now, the text which is described by that piece can be appended to our text string as follows:
Iterating over all the pieces in the piece table will finally append all the text in the document to our
text string.
Note: It should be borne in mind that the text of header and footer lines, footnotes, endnotes, etc. is
also stored inside the text part of the WordDocument stream; consequently, our text string will also
contain the text of these elements.