0% found this document useful (0 votes)

85 views7 pages

How To Retrieve Text From A Binary

The document explains how text is stored and can be extracted from a binary .doc file. A .doc file stores text and formatting in streams, with the main text stored fragmented across a WordDocument stream. The piece table stored in another stream (0Table or 1Table) describes how the text is fragmented and in what order it should be assembled. To retrieve the text, the file information block and piece table are loaded and parsed to determine the text encodings and locations of each text fragment, which can then be reassembled into a single string.

Uploaded by

AbhisheK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

85 views7 pages

How To Retrieve Text From A Binary

Uploaded by

AbhisheK

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

How to Retrieve Text from a Binary .

doc File
DIaLOGIKa/makz/math/wk/divo 4 March 2008

Contents
Introduction................................................................................................................................................... 2
How Text is Stored in a .doc File ................................................................................................................... 2
Streams in a .doc File ................................................................................................................................. 2
The piece table ........................................................................................................................................... 2
Algorithm for Retrieving Text ........................................................................................................................ 3
Loading the FIB .......................................................................................................................................... 3
Loading and Parsing the piece table .......................................................................................................... 3
Introduction
This document explains how the text content in a binary .doc file can be identified and extracted.

The following descriptions and algorithms are based on the Microsoft Word 2007 Binary File Format
Documentation and our own findings.

How Text is Stored in a .doc File

The binary .doc files use the structured storage format (aka compound binary file format) to save
several streams in one file. The text and its attributes are stored in two of these streams.

Note: For information about the structured storage format please have a look at the “Windows
Compound Binary File Format Specification”

Streams in a .doc File

A .doc file contains a stream called WordDocument consisting of a header part and a text part. The
header, called File Information Block or FIB, contains information about the document and pointers
into the text part and into other streams. The text part contains all text of the document (including
footnotes, header and footer lines, etc.), but, not necessarily consecutively, i.e. it might be
fragmented.

Another stream, called 0Table or 1Table (a flag in the FIB determines which name is actually used for
this stream, see below) contains information about the fragmentation of the text part. This
information is called piece table.

The piece table

The piece table is a data structure that describes the logical sequence of characters in the document:
The text part in the WordDocument stream can be divided into several subparts or pieces. Each piece
contains information about the encoding and the logical place in the text.

Word splits the text into several pieces if different encodings are used for different paragraphs or
text runs, example:

Hello World
…some more CP1252 text…
αβγ – Greek is nice!

The text in this example would be divided into two pieces:

The first piece contains the all the characters up to the Greek characters and is encoded
using the codepage 1252.
The second piece contains the Greek and remaining characters and has Unicode encoding

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 2

Note: We assume that Word does this switch between CP1252 and Unicode encoding due to
optimization reasons. If the CP1252 text part is shorter than a certain (unknown) threshold all
characters are Unicode-encoded.

Algorithm for Retrieving Text

To retrieve text from a document, the piece table has to be loaded and parsed and the pieces
assembled in the right sequence.

The following algorithm assembles the text of the document and saves it in a string variable:

string text = "";

Loading the FIB

The FIB has a fixed length of 1472 bytes (Word 2003 and higher; earlier versions might have a smaller
FIB) and starts at the first byte in the WordDocument stream of the .doc file. For the following
examples, we assume that the streams of a .doc file can be treated as C# streams (we actually use
our own assembly – called StructuredStorageReader – to access the structured storage file format.
This assembly provides for a stream-like object).

Stream wordDocumentStream = new Stream("WordDocument");

byte[] fib = new byte[1472];
wordDocumentStream.Read(fib, 0, 1472);

Loading and Parsing the piece table

The FIB contains two variables (FC/LCB pair) that specify the beginning of the data structure holding
the piece table (FC = file character position) and the length (LCB = long count of bytes). This
information is placed in the table stream, i.e. the FC is a pointer to an address in that stream. Both
values are 32 bit integer values and are stored at offset 0x01A2 and offset 0x01A6, respectively.

Note: The Word file format specification describes this FC/LCB pair as “Offset in table stream of
beginning of information for complex files.” It should be borne in mind that this is not correct since
the piece table information is also used for non-complex files.

UInt32 fcClx = System.BitConverter.ToUInt32(fib, 0x01A2);

UInt32 lcbClx = System.BitConverter.ToUInt32(fib, 0x01A6);

To load the piece table we must determine how the table stream was named by Word. The FIB
contains a flag which decides if the stream was saved as 0Table or as 1Table. Bit 0x0200 in word
0x000A of the FIB determines how the table stream is named:

bool flag1Table = ( (fib[0x000A] & 0x0200) == 0x0200);

string tableStreamName = "0Table";
if(flag1Table)
tableStreamName = "1Table";

Excerpt from a hex dump of the FIB:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 3

0x12F0 AND 0x0200: 1Table is used

(0000) EC A5 C1 00 7D 80 09 04 00 00 F0 12 BF 00 00 00 ....}...........
(0010) 00 00 00 10 00 00 00 00 00 08 00 00 DE 4C 00 00 .............L..
…
(0190) 00 00 A7 19 00 00 A2 02 00 00 49 1C 00 00 74 00 ..........I...t.
(01A0) 00 00 96 17 00 00 2D 00 00 00 00 00 00 00 00 00 ......-.........
(01B0) 00 00 00 00 00 00 00 00 00 00 55 15 00 00 00 00 ..........U.....
…
0x0000002D: complex information length in 1Table

0x00001796: complex information position in 1Table

Note: All the number values shown in this and the following hex dumps are in swapped byte order
(“little endian”).

After that we load the data structure holding the piece table into the byte array clx:

Stream tableStream = new Stream(tableStreamName);

byte[] clx = new byte[lcbClx];
tableStream.Read(clx, fcClx, lcbClx);

The clx byte array can contain multiple substructures and only one of these substructures is the piece
table. Each substructure starts with a byte which denotes the type of the substructure, followed by a
value indicating the length of the substructure.

If the substructure describes a piece table the value of this byte is 2, otherwise 1. The length of the
entry is a 32 bit value for a piece table and an 8 bit value for all other entries.

In order to identify the piece table in the clx byte array, the following algorithm can be used:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 4

int pos = 0;
bool goOn = true;
while (goOn)
{
byte typeEntry = clx[pos];

if (typeEntry == 2)
{
//this entry is the piece table
goOn = false;
Int32 lcbPieceTable = System.BitConverter.ToInt32(clx, pos + 1);
byte[] pieceTable = new byte[lcbPieceTable];
Array.Copy(bytes, pos + 5, pieceTable, 0, pieceTable.Length);
}
else if (typeEntry == 1)
{
//skip this entry
pos = pos + 1 + 1 + clx[pos + 1];
}
else
{
goOn = false;
}
}

The piece table itself contains two arrays:

The first array contains n+1 logical character positions (n is the number of pieces). The
entries are the logical start and end positions of the pieces in the text sequence, i.e. the first
piece starts at logical position 1 and extends to position 2, the second starts at position 2,
etc. Logical position x means that this is the x-th character in the document, i.e. this is not the
file character position in the WordDocument stream. The positions are 32 bit values.
The second array contains n piece descriptor structures. Each structure has a length of 8
bytes. The physical location of the piece inside of the WordDocument stream and the
encoding of the text can be found in these 8 bytes from byte 3 to byte 6. This file character
(FC) position is a 32 bit integer value. The second most significant bit is a flag that specifies
the encoding of the piece: if the bit is set, the piece is CP1252-encoded and the FC is a word
pointer; otherwise, the piece is Unicode-encoded and the FC is a byte pointer.

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 5

Hex dump from a 1Table stream containing a piece table with 3 pieces:
piece table ID

length of piece table 0x00000028

…
(1780) 59 03 01 00 01 00 00 00 00 00 00 00 00 00 00 00 Y...............
(1790) 00 00 00 00 00 00 02 28 00 00 00 00 00 00 00 00 .......(........
(17A0) 3C 00 00 00 3D 00 00 6F 3D 00 00 70 00 00 10 00 <...=..o=..p....
(17B0) 40 00 00 70 00 00 44 00 00 00 00 70 00 00 4C 00 @..p..D....p..L.
(17C0) 00 00 00 FF FF 01 00 00 00 07 00 55 00 6E 00 6B ...........U.n.k
(17D0) 00 6E 00 6F 00 77 00 6E 00 FF FF 01 00 08 00 00 .n.o.w.n........
…
logical start of piece 1 0x40001000: CP1252
encoding (0x4) and
0x00003C00: logical end of
position 0x1000/2
piece 1 and start of piece 2

int pieceCount = (lcbPieceTable - 4) / 12;

for (int i = 0; i < pieceCount; i++)

{
//get the position
Int32 cpStart = System.BitConverter.ToInt32(pieceTable, i * 4);
Int32 cpEnd = System.BitConverter.ToInt32(pieceTable, (i+1) * 4);

//get the descriptor

byte[] pieceDescriptor = new byte[8];
int offsetPieceDescriptor = ((pieceCount +1)*4) + (i*8);
Array.Copy(pieceTable, offsetPieceDescriptor, pieceDescriptor, 0, 8);
}

The interpretation of the encoding flag and the calculation of the FC pointer are as follows:

UInt32 fcValue = System.BitConverter.ToUInt32(pieceDescriptor, 2);

bool isANSI = ( (fcValue & 0x40000000) == 0x40000000);
Int32 fc = fcValue & 0xBFFFFFFF;

Encoding encoding = Encoding.GetEncoding(1252);

Int32 cb = cpEnd – cpStart;
if (!isANSI)
{
encoding = Encoding.Unicode;
cb *= 2;
}

Now, the text which is described by that piece can be appended to our text string as follows:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 6

byte[] bytesOfText = new byte[cb];
wordDocumentStream.Read(bytesOfText, bytesOfText.Length, fc);
text += encoding.GetString(bytesOfText);

Iterating over all the pieces in the piece table will finally append all the text in the document to our
text string.

Note: It should be borne in mind that the text of header and footer lines, footnotes, endnotes, etc. is
also stored inside the text part of the WordDocument stream; consequently, our text string will also
contain the text of these elements.

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 7

BOIR System To System APIUserGuide
No ratings yet
BOIR System To System APIUserGuide
22 pages
Navi - Scan Reference Manual
No ratings yet
Navi - Scan Reference Manual
59 pages
Data Files PDF
No ratings yet
Data Files PDF
26 pages
American Constitution
No ratings yet
American Constitution
10 pages
RDZ Search Options
No ratings yet
RDZ Search Options
74 pages
Links From User Manuals Telemac-2D
No ratings yet
Links From User Manuals Telemac-2D
64 pages
VisualSFM - A Visual Structure From Motion System - Documentation
No ratings yet
VisualSFM - A Visual Structure From Motion System - Documentation
14 pages
S As Implementation Guide
No ratings yet
S As Implementation Guide
52 pages
Scanned by Camscanner
No ratings yet
Scanned by Camscanner
9 pages
Civil War Causes
No ratings yet
Civil War Causes
37 pages
Chapter 4c
No ratings yet
Chapter 4c
15 pages
CodeGuru - C# 4.0 Cheat Sheet
100% (6)
CodeGuru - C# 4.0 Cheat Sheet
2 pages
Two (2) Types of Streams: Handling Files
No ratings yet
Two (2) Types of Streams: Handling Files
2 pages
T5 Worksheet 5
No ratings yet
T5 Worksheet 5
3 pages
Programming in C++ CST-152
No ratings yet
Programming in C++ CST-152
30 pages
Random-Access Files: Example
No ratings yet
Random-Access Files: Example
16 pages
Lecture 1 Low Level Vs High Level Programming Languages
No ratings yet
Lecture 1 Low Level Vs High Level Programming Languages
2 pages
Study Material CS 2023-24 - 2 Data Files
No ratings yet
Study Material CS 2023-24 - 2 Data Files
4 pages
Common Language Infrastructure (CLI) Partition V: Debug Interchange Format
No ratings yet
Common Language Infrastructure (CLI) Partition V: Debug Interchange Format
8 pages
LZW (Lempel Ziv Welch) : 60.1 Brief History
No ratings yet
LZW (Lempel Ziv Welch) : 60.1 Brief History
4 pages
Share Websphere Application Server V6 Product Binaries Across Nodes
No ratings yet
Share Websphere Application Server V6 Product Binaries Across Nodes
10 pages
Bootstrap Corewar
100% (1)
Bootstrap Corewar
4 pages
14 PPS Unit 2 3 5 Questions
No ratings yet
14 PPS Unit 2 3 5 Questions
12 pages
Web Intelligence XI 3.0 Parameter Guide
0% (1)
Web Intelligence XI 3.0 Parameter Guide
10 pages
Predefined C# Value Types: Boolean Types Are Declared Using The Keyword, Bool. They Have Two Values: True or False
No ratings yet
Predefined C# Value Types: Boolean Types Are Declared Using The Keyword, Bool. They Have Two Values: True or False
25 pages
CDH4 Hue User Guide 4.1
No ratings yet
CDH4 Hue User Guide 4.1
42 pages
Binary File Operations
No ratings yet
Binary File Operations
6 pages
Win Spine
No ratings yet
Win Spine
40 pages
Hashing PDF
No ratings yet
Hashing PDF
61 pages
WinADCP User Guide
No ratings yet
WinADCP User Guide
28 pages
Hashing
No ratings yet
Hashing
14 pages
Streams ...
No ratings yet
Streams ...
6 pages
File Structures NOTES by Ashok Kumar PDF
No ratings yet
File Structures NOTES by Ashok Kumar PDF
132 pages
Sobel Filter HLS System Tutorial
No ratings yet
Sobel Filter HLS System Tutorial
20 pages
Microsoft Office Word 97-2003 Binary File Format
No ratings yet
Microsoft Office Word 97-2003 Binary File Format
2 pages
15is62 FS 25QB Prasadbs
No ratings yet
15is62 FS 25QB Prasadbs
21 pages
CH 9 11EM MCQ
No ratings yet
CH 9 11EM MCQ
9 pages
Was French Revolution A Bourgeoisie Revolution
No ratings yet
Was French Revolution A Bourgeoisie Revolution
8 pages
Understanding Information: Unit 5
No ratings yet
Understanding Information: Unit 5
77 pages
Do Cover View
No ratings yet
Do Cover View
3 pages
Io - 1
No ratings yet
Io - 1
53 pages
Grade 9 CS ATA 1 - 29th July 2024
No ratings yet
Grade 9 CS ATA 1 - 29th July 2024
3 pages
Hash Tables Slides
No ratings yet
Hash Tables Slides
110 pages
Survey of Unpacking Malware
No ratings yet
Survey of Unpacking Malware
17 pages
Model Test 7
No ratings yet
Model Test 7
24 pages
KMA SS05 Kap03 Compression
No ratings yet
KMA SS05 Kap03 Compression
54 pages
Iit Lecture Notes On Data Structure
No ratings yet
Iit Lecture Notes On Data Structure
36 pages
Lec2 PDF
No ratings yet
Lec2 PDF
38 pages
L-2.1.4 Paging - Segmentation
No ratings yet
L-2.1.4 Paging - Segmentation
23 pages
MIDI Files
No ratings yet
MIDI Files
21 pages
Logcat
No ratings yet
Logcat
76 pages
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
No ratings yet
Hash Tables: COT4810 Ken Pritchard 2 Sep 04
20 pages
Uploading and Downloading Files in Web Dynpro Tables
No ratings yet
Uploading and Downloading Files in Web Dynpro Tables
14 pages
Get and Put Stream Pointers
No ratings yet
Get and Put Stream Pointers
4 pages
File Handling in
No ratings yet
File Handling in
10 pages
5147 - C - CheatSheet - 2010 - Blue
No ratings yet
5147 - C - CheatSheet - 2010 - Blue
2 pages
Chapter 17 Binary I - O PPT Download
No ratings yet
Chapter 17 Binary I - O PPT Download
10 pages
File System
No ratings yet
File System
9 pages
Lession - 3 Data Types in C#
No ratings yet
Lession - 3 Data Types in C#
15 pages
Complete-Reference-Vb Net 61
No ratings yet
Complete-Reference-Vb Net 61
1 page
Logcat
No ratings yet
Logcat
166 pages
Chapter 5 - Hashing - Part1
No ratings yet
Chapter 5 - Hashing - Part1
28 pages
Bcs304module5slides 241029150507 B6e5bfba
No ratings yet
Bcs304module5slides 241029150507 B6e5bfba
41 pages
C++ Notes Unit 5
No ratings yet
C++ Notes Unit 5
14 pages
Week 4 - Strings and Text Files
No ratings yet
Week 4 - Strings and Text Files
33 pages
Data Structures (1) - 61-72
No ratings yet
Data Structures (1) - 61-72
12 pages
Huffman Encoding Supplement
No ratings yet
Huffman Encoding Supplement
10 pages
Lecture03 Hashing
No ratings yet
Lecture03 Hashing
12 pages
File Organisation Simple Structure1
No ratings yet
File Organisation Simple Structure1
31 pages
ATrack AS500 Protocol Document 1.0.6
No ratings yet
ATrack AS500 Protocol Document 1.0.6
60 pages
Question Bank Java BCS 403
No ratings yet
Question Bank Java BCS 403
70 pages
DSA2 Chapter 5 Hashing
No ratings yet
DSA2 Chapter 5 Hashing
44 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Lisp Interpreter in Rust
From Everand
Lisp Interpreter in Rust
Vishal Patil
1/5 (1)
Learn C++
From Everand
Learn C++
Durgesh
4.5/5 (9)
C# Functions and Tutorial - 50 Examples
From Everand
C# Functions and Tutorial - 50 Examples
Nino Paiotta
No ratings yet
C++ Functions and tutorial
From Everand
C++ Functions and tutorial
Nino Paiotta
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
Data Structures in C / C ++: Exercises and Solved Problems
From Everand
Data Structures in C / C ++: Exercises and Solved Problems
Fulbia Torres
No ratings yet
Principles of Digital Electronics
From Everand
Principles of Digital Electronics
Sapana Rane
No ratings yet
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
From Everand
Python: Advanced Guide to Programming Code with Python: Python Computer Programming, #4
Charlie Masterson
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Coding for beginners The basic syntax and structure of coding
From Everand
Coding for beginners The basic syntax and structure of coding
Diamond Moore
No ratings yet
Python: Advanced Guide to Programming Code with Python
From Everand
Python: Advanced Guide to Programming Code with Python
Charlie Masterson
No ratings yet
Python Programming Concepts
From Everand
Python Programming Concepts
MRB
No ratings yet
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
From Everand
Python for Data Science: Data Science Mastery by Nikhil Khan, #1
Nikhil Khan
No ratings yet
Digital Engineering: Complex System Design
From Everand
Digital Engineering: Complex System Design
S Mathioudakis
No ratings yet
Coding In C Decoded: Decoded, #1
From Everand
Coding In C Decoded: Decoded, #1
D Brown
No ratings yet
Mastering Python Programming: A Comprehensive Guide: The IT Collection
From Everand
Mastering Python Programming: A Comprehensive Guide: The IT Collection
Christopher Ford
5/5 (1)
A Beginner's guide to Python
From Everand
A Beginner's guide to Python
Steven Mcananey
No ratings yet
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
From Everand
Oracle Certified Professional Java Programmer OCPJP 1Z0 809
Manish Soni
No ratings yet
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
From Everand
IGNOU PGDCA MCS 208 Data Structure and Algorithm Previous Years Unsolved Papers
Manish Soni
No ratings yet

How To Retrieve Text From A Binary

Uploaded by

How To Retrieve Text From A Binary

Uploaded by

How to Retrieve Text from a Binary .

How Text is Stored in a .doc File

Streams in a .doc File

The piece table

The text in this example would be divided into two pieces:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 2

Algorithm for Retrieving Text

string text = "";

Loading the FIB

Stream wordDocumentStream = new Stream("WordDocument");

Loading and Parsing the piece table

UInt32 fcClx = System.BitConverter.ToUInt32(fib, 0x01A2);

bool flag1Table = ( (fib[0x000A] & 0x0200) == 0x0200);

Excerpt from a hex dump of the FIB:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 3

0x00001796: complex information position in 1Table

Stream tableStream = new Stream(tableStreamName);

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 4

The piece table itself contains two arrays:

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 5

length of piece table 0x00000028

int pieceCount = (lcbPieceTable - 4) / 12;

for (int i = 0; i < pieceCount; i++)

//get the descriptor

UInt32 fcValue = System.BitConverter.ToUInt32(pieceDescriptor, 2);

Encoding encoding = Encoding.GetEncoding(1252);

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 6

DIaLOGIKa / makz/math/wk 4 March 2008 |P a g e | 7

You might also like