100% found this document useful (5 votes)

1K views616 pages

File Structures PDF

Uploaded by

Roberta Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

100% found this document useful (5 votes)

1K views616 pages

File Structures PDF

Uploaded by

Roberta Costa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 616

Michael J.

Folk

Bill

Zoellick

File

Structures
Second Edition

Digitized by the Internet Archive

2011

http://www.archive.org/details/filestructuresOOfolk

File
SECOND
EDITION

Structures
MICHAEL

FOLK

University of Illinois

BILL ZOELLICK
Avalanche Development Company

TV Addison-Wesley

Publishing

Company,

Inc.

Menlo Park, California

New York
Don Mills, Ontario Wokingham, England Amsterdam
Bonn Sydney Singapore Tokyo Madrid
Reading, Massachusetts

San Juan

Milan

Paris

Sponsoring Editor Peter Shcpard

Production Administrator Juliet Silveri
Copyeditor Patricia Daly
Text Designer Melinda Grosser for silk
Cover Designer Peter Blaiwas
Technical Art Consultant Dick Morton
Scot Graphics

Illustrator

Roy Logan

Manufacturing Supervisor

Photographs on pages 126 and 187 courtesy of

Sukumar. Figure 10.7 on page 470

courtesy of International Business Machines Corporation.

Library of Congress Cataloging-in-Publication Data

Folk, Michael

File structures

Michael

Folk, Bill Zoellick.

2nd

ed.

cm.

Includes bibliographical references and index.

ISBN 0-201-55713-4
1.
II.

File

organization (Computer science)"

Zoellick, Bill.

Title.

QA76.9.F5F65

1992

005.74 dc20

91-16314

CIP

book have been included for their

They have been tested with care but are not guaranteed for any
purpose. The publisher does not offer any warranties or representations, nor does

The programs and

applications presented in this

instructional value.
particular
it

accept any liabilities with respect to the

Many

of the designations used

claimed

was aware

trademarks.
o\\\

Where

programs or

by manufacturers and

applications.

sellers to distinguish their

products are

those designations appear in this book, and Addison- Wesley

trademark claim, the designations have been printed

in initial caps or

all

caps.

1992 by Addison-Wesley Publishing

All rights reserved.

system, or transmitted,

Inc.

may be reproduced, stored in

any form or by any means, electronic, mechanical,

part of this publication

Company.

a retrieval

photocopying, recording, or otherwise, without the prior written permission of the

publisher. Printed in the United States of America.

3 4 5 6 7 8 9 10-DO-9594939291

Pauline and Rachel

and

To Karen, Joshua, and

Peter

Preface

We wrote the first edition to promote file structure literacy.

familiarity with the tools used to organize

story of

how

files. It

also

Knowing

the different tools have evolved.

Literacy implies

means knowing the

the story

the

basis for using the tools appropriately.

The first edition told the story of file structures up to about 1980. This
second edition continues the story, examining developments such as
extendible hashing and optical disc storage that have moved from being a
research topic at the start of the

last

decade to

mature technology by

its

end.

While the history of file structures provides the key organizing principle
for

much of this

text,

also find ourselves compelled, particularly in this

second edition, to attend to developments in computing hardware and

system software. In the last twenty years computers have evolved from
being expensive monoliths, maintained by a priesthood of specialists, to
being appliances as ubiquitous as toasters. No longer do we need to
confront a corps of analysts to get information in and out of a computer. We
do it ourselves. Today, more often than yesterday, programmers design
and build their own file structures.
This text shows you how to design and build efficient file structures.

you need is a good programming language, a good operating system,

and the conceptual tools that enable you to think through alternative file
structure designs that apply to the task at hand. The first six chapters of this
book give you the basic tools to design simple file structures from the
ground up. We provide examples of program code and, if you are a UNIX
user, we show you, whenever possible, how to use this operating system to
help with much of the work. Building on the first six chapters of foundation
work, the last five chapters introduce you to the most important high-level
+
trees, and
file structure designs, including sequential access, B-trees and B

All

hashing and extendible hashing.

PREFACE

The

last

ten years of development in software design are reason

we have

for this second edition, but

enough

also used this edition to discuss the

decreased cost and increased availability of computer storage hardware. For

computer configurations over

available RAM on
computers of all sizes. In 1986, when we completed the first edition of this
book, it was rare that a personal computer had more than 640 Kbytes of
RAM. Now, even for many mundane applications, four Mbytes is
common, and sometimes even mandatory. A decade ago, a sophisticated
mainframe system that was used extensively for sorting large files typically
had two to four Mbytes of primary memory; now 32 to 64 Mbytes is
common on workstations, and there are some computers with several
one of the most dramatic changes

instance,

the

decade

past

gigabytes of

RAM

that,

assuming

files

RAM

when

sorting

available

For example, most

sorting of large
is

the

amount of

RAM.

When more
differently.

increase in

the

disk.

available, sorting

second edition

can approach

it is

that

on disk

reflects this

structures problems

with the

One reason for this

much more viable than

always done on tape.

scarce, sorting

Now

file

earlier file structure texts deal

RAM

on tape

much

cheaper and more readily

not only viable, it is usually preferable. This

change and others that arise from changes in

computer hardware.

Using the Book as a College Text

The

first

edition has been used extensively as a text for

many

different kinds

of universities. Because the book is quite

readable, students typically are expected to read the entire book over the
course of a semester. The text covers the basics; class lectures can expand
and supplement the material presented in the text. The lecturer is free to
explore more complex topics and applications, relying on the text to supply

of students in

different kinds

the fundamentals.

A word
issues

of caution:

presented in the

material.

The

It is

much time on
Move quickly

easy to spend too

first

relatively large

six

chapters.

number of pages devoted

the low-level

through

this

to these matters

of the percentage of the course that should be spent on them.

The intent, instead, is to provide thorough coverage in the text so that the
instructor can simply assign these chapters as background reading, saving
precious lecture time for more important topics.
It is important to get students involved in writing file processing
programs early in the semester. Consider starting with a file reading and
not

a reflection

PREFACE

VII

due after the first week of class. The inclusion in

the text of sample programs in both C and Pascal makes it easier to work
in this hands-on style. We recommend that, by the time the students
encounter the B-tree chapter, they should have already written programs
that access a data set through a simple index structure. Since the students
then already have first-hand experience with the fundamental organizational
issues, it is possible for lectures to focus on the conceptual issues involved
writing assignment that

in B-tree design.

approximation of
the sequence of topics used in the book, especially through the first six
chapters. We have already stressed that we wrote the book so that it can be
read from cover to cover. It is not a reference work. Instead, we develop
Finally,

ideas as

makes

A Book

suggest that instructors adhere to

a close

proceed from chapter to chapter. Skipping around

difficult for students to

in the

book

follow this development.

Computing Professionals

for

Both authors used

to teach, but

we now

design and write programs for a

book with our colleagues in mind. The

style is conversational; the intent is to provide a book that you can read over
a number of evenings, coming away with a good sense of how to approach
file structure design problems. If you are already familiar with basic file

living.

wrote and revised

this

structure design concepts, skim through the

first six

chapters and begin

reading about cosequential access methods in Chapter 7. Subsequent

+
chapters introduce you to B-trees, B trees, hashing, and extendible
hashing. These are key design tools for any practicing

building

file

structures.

We have

tried to present

them

programmer who
in a

way

that

both

thorough and readable.

you

are not already a serious

UNIX

seven chapters will give you

environment in which to work with

user, the

a feel for

first

files.

UNIX

why UNIX

Similarly, the

material in the
is

several of the chapters provide an introduction to the use of

Also, if you need to build and access

text,

you may be

you can adapt

Finally,

able to use these

file

powerful

programs

with

files.

structures similar to the ones in the

programs

as a

source code toolkit that

your needs.

we know that an increasing number of computing

professionals

CD-ROM.

Appendix
with the need
in
introduced
principles
design
A not only provides an example of how the
this text are applied to this important medium, but it also gives you a good
to understand

are confronted

introduction to the

medium

itself.

and use

VIII

PREFACE

Acknowledgements
There are a number of people we would like to thank for help in preparing
this second edition. Peter Shepard, our editor at Addison- Wesley, initiated
the idea of a new edition, kept after us to get it done, and saw the
production through to completion. We thank our reviewers; James
Canning, Jan Carroll, Suzanne Dietrich, Terry Johnson, Theodore Norman, Gregory Riccardi, and Cliff Shaffer. We also thank Deebak Khanna
for comments and suggestions for improving the code.
Since the publication of the first edition, we have received a great deal
of feedback from readers. Their suggestions and contributions have had a
major effect on this second edition, and in fact are largely responsible for
our completely rewriting several of the chapters.
Colleagues with whom we work have also contributed to the second
edition, many without knowing they were doing so. We are grateful to
them for information, explanations, and ideas that have improved our own
understanding of many of the topics covered in the book. These colleagues
include Chin Chau Low, Tim Krauskopf, Joseph Hardin, Quincey Koziol,
Carlos Donohue, S. Sukumar, Mike Page, and Lee Fife.

Thanks

are

still

outstanding to people

who

contributed to the

initial

Marilyn Aiken, Art Crotzer, Mark Dalton, Don Fisher, Huey Liu,
Gail Meinert, and Jim Van Doren.
We thank J. S. Bach; whose magnificent contribution of music to work

edition:

this work possible.

Most important of all, we thank

by makes

Pauline, Rachel, Karen, Joshua and

Peter for putting

up with

write, are tired

day, and stay up too late

all

fathers

and husbands
at

who

night to

up too early to
write some more. It's
get

the price of fame.

Boulder, Colorado

B.Z.

Urbana,

M.F.

Illinois

Contents

Introduction to File Structures

1.1

The Heart of

1.2

A
A

1.3

File Structure

Short History of

Key Terms

Fundamental

File Processing

Physical Files and Logical Files

2.2

Opening

Operations

Files

2.3

Closing

2.4

Reading and Writing

Files

2.4.1 Read and Write Functions

2.4.2 A Program

to Display the

2.4.3 Detecting End-of-File

Seeking

File Structure Literacy

2.1

2.5

Design

File Structure

Conceptual Toolkit:

Summary

Design

Contents of a

File

2.5.1 Seeking

2.5.2 Seeking

Pascal

2.6

Special Characters in Files

2.7

The

2.8

Physical and Logical Files in

UNIX

Directory Structure

UNIX

2.8.1 Physical Devices as UNIX Files

23
23

CONTENTS

2.8.2 The Console, the Keyboard, and Standard Error

2.8.3

2.9

File-related

2.10

UNIX

Summary

and Pipes

I/O Redirection

Header

Key Terms

Further Readings

Files

Commands

Filesystem

Exercises

Secondary Storage and System Software

3.1

Disks

3.1.1 The Organization of Disks

3.1.2 Estimating Capacities and Space Needs

3.1.3 Organizing Tracks by Sector

3.1.4 Organizing Tracks by Block

3.1.5 Nondata Overhead

3.1.6 The Cost of a Disk Access

3.1.7 Effect of Block. Size on Performance: A UNIX Example

Magnetic Tape

3.2.1 Organization of Data on Tapes

3.2.2 Estimating Tape Length Requirements

3.2.3 Estimating Data Transmission Times

3.2.4 Tape Applications

3.3

Disk Versus Tape

3.4

Storage as

3.5

A Journey

Byte

3.5.1 The

File

3.5.2 The

I/O Buffer

Manager

3.5.3 The Byte Leaves RAM: The

3.6

Buffer

Management

3.6.2 Buffering Strategies

I/O

UNIX

I/O

Processor and Disk Controller

3.6.1 Buffer Bottlenecks

3.7

57
59

Hierarchy

3.1.8 Disk as Bottleneck

3.2

3.7.1 The Kernel

3.7.2 Linking

File

Names

3.7.3 Normal

Files,

to Files

Special Files, and Sockets

CONTENTS

3.7.4 Block

I/O

3.7.5 Device Drivers

3.7.6 The Kernel and Filesystems

3.7.7 Magnetic Tape and UNIX

Summary

Key Terms

Further Readings

Field

Exercises

Fundamental
4.1

79
80

File Structure

and Record Organization

4.1.1 A Stream

Concepts

File

4.1.2 Field Structures

4.1.3 Reading a Stream

of Fields

4.1.4 Record Structures

101

4.1.5 A Record Structure That Uses a Length Indicator

4.1.6 Mixing Numbers and Characters: Use of a
4.2

Record Access

File

109

4.2.2 A Sequential Search

4.2.3 UNIX Tools

111

for Sequential

Processing

4.2.4 Direct Access

115

More about Record

Structures

114

117

4.3.1 Choosing a Record Structure and Record Length

4.3.2 Header Records

4.4
4.5

File

Access and

Beyond Record

File

120

Organization

Structures

122

123

124

4.5.1 Abstract Data Models

4.5.2 Headers and Self-Describing Files

4.5.3 Metadata

125

4.5.4 Color Raster Images

4.5.5 Mixing Object Types
4.5.6 Object-oriented
4.5.7 Extensibility

4.6

103

Dump

109

4.2.1 Record Keys

4.3

File

128
in

One

Access

File

132

133

Portability and Standardization

134

4.6.1 Factors Affecting Portability

134

4.6.2 Achieving Portability

136

129

117

107

XJI

CONTENTS

Summary

142

Further Readings

Programs

Pascal

152

Programs

167

Files for

Data Compression

Performance

183

185

5.1.1 Using a Different Notation

5.1.2 Suppressing Repeating Sequences

5.1.3 Assigning Variable-length Codes

186
188

5.1.4 Irreversible Compression Techniques

5.2

146

Exercises

144

153

Organizing
5.1

Key Terms

5.1.5 Compression

UNIX

189

Reclaiming Space

in Files

190

189

5.2.1 Record Deletion- and Storage Compaction

190

5.2.2 Deleting Fixed-length Records for Reclaiming Space Dynamically

5.2.3 Deleting Variable-length Records

198

5.2.4 Storage Fragmentation

5.2.5 Placement Strategies
5.3

201

Finding Things Quickly:

Searching 203
5.3.1 Finding Things

Introduction to Internal Sorting and Binary

Simple Field and Record

5.3.2 Search by Guessing: Binary Search

RAM

File in

Files

203

204

5.3.3 Binary Search versus Sequential Search

5.3.4 Sorting a Disk

204

206

5.3.5 The Limitations of Binary Searching and Internal Sorting

5.4

Keysorting

209

5.4.2 Limitations of the Keysort Method

5.4.3 Another Solution:

Why

5.4.4 Pinned Records

213

Further Readings

207

208

5.4.1 Description of the Method

Summary 214

Key Terms
223

192

196

211

Bother to Write the File Back?

217

Exercises

219

212

CONTENTS

Indexing

225

6.1

What

6.2

6.3

Basic Operations on an Indexed, Entry-Sequenced File

6.4

Indexes That Are

6.5

Indexing to Provide Access by Multiple Keys

6.6

Retrieval

6.7

Improving the Secondary Index Structure: Inverted

an Index?

226

Simple Index with an Entry-Sequenced

6.7.1 A

Too Large

Hold

227

Memory

First

Attempt

6.8

Selective Indexes

6.9

Binding

230

234

235

Using Combinations of Secondary Keys

239
242

Lists

242

at a Solution

6.7.2 A Better Solution: Linking the

Summary

File

244

References

List of

248

249

Key Terms

250

Further Readings

252

253

Exercises

256

Cosequential Processing and the Sorting

of Large
7.1

Files

A Model

for

257

Implementing Cosequential Processes

7.1.1 Matching

Names

7.1.2 Merging Two Lists

7.1.3
7.2

Summary

Lists

Application of the

Model

7.2.1 The Problem

268

of the

to a General

Model

Extension of the Model

to the

7.3.2 A Selection Tree

Second Look

for

Model

266

Ledger Program

276

Merging Large Numbers

Sorting in

268

271

to Include Multiway Merging

7.3.1 A K-way Merge Algorithm

7.4

259

263

of the Cosequential Processing

7.2.2 Application
7.3

Two

259

RAM

7.4.1 Overlapping Processing and

Heapsort

I/O:

7.4.2 Building the Heap while Reading

7.4.3 Sorting while Writing out to the

of Lists

279

File

the File

283

280
281

278

XIII

CONTENTS

XIV

7.5

Merging
7.5.1

Way

as a

of Sorting Large

How Much Time Does

7.5.2 Sorting a

File

That

on Disk

Merge Sort Take?

Number

292

File Size

7.5.4 Hardware-based Improvements

285

287

290

Ten Times Larger

7.5.3 The Cost of Increasing the

7.5.5 Decreasing the

Files

293

Seeks Using Multiple-step Merges

7.5.6 Increasing Run Lengths Using Replacement Selection

304

7.5.7 Replacement Selection Plus Multistep Merging

7.5.8 Using Two Disk Drives with Replacement Selection

7.5.9 More Drives? More Processors?

7.5.10 Effects

Multiprogramming

7.5.11 A Conceptual Toolkit

7.6

Sorting Files on Tape

312

314

7.6.4 Tapes versus Disks

for External Sorting

Sort-Merge Packages

318

Sorting and Cosequential Processing in

7.8.1 Sorting and Merging

UNIX

7.8.2 Cosequential Processing

Summary

Key Terms

322

Further Readings

310

315

7.6.3 Multiphase Merges

310

for External Sorting

7.6.2 The K-way Balanced Merge

7.8

307

309

311

7.6.1 The Balanced Merge

7.7

295

298

UNIX

318

Utilities in

325

317

UNIX

Exercises

320

328

331

B-Trees and Other Tree-structured

File Organizations 333
The Invention of the B-Tree

8.1

Introduction:

8.2

Statement of the Problem

8.3

Binary Search Trees

8.4

AVL

8.5

Paged Binary Trees

8.6

The Problem with

8.7

B-Trees: Working up from the

8.8

Splitting

Trees

as a

334

336

Solution

337

340
343

the

Top-down Construction of Paged

and Promoting

347

Bottom

347

Trees

345

CONTENTS

8.9

Algorithms for B-Trcc Searching and Insertion

8.10

B-Tree Nomenclature

8.11

Formal Definition of B-Trce Properties

8.12

Worst-case Search Depth

8.13

Deletion, Redistribution, and Concatenation

364

Redistribution during Insertion:

A Way

Improve Storage

371

Utilization

8.15

B* Trees

8.16

Buffering of Pages: Virtual B-Trees

372

8.16.1 LRU Replacement

373

375
376

8.16.2 Replacement Based on Page Height

8.16.3 Importance

of Virtual B-Trees

377

8.17

Placement of Information Associated with the Key

8.18

Variable-length Records and Keys

Summary

Key Terms

380

Further Readings

Programs

Pascal

366

370

8.13.1 Redistribution

8.14

352

362

379
383

Exercises

387

to Insert

Programs

382

377

Keys

into a

Keys

to Insert

B-Tree

into a

389

B-Tree

397

The B + Tree Family and Indexed

Sequential

File Access 405

9.1

9.2

406

Indexed Sequential Access

Maintaining

Sequence Set
407

9.2.1 The Use of Blocks

9.2.2 Choice

Block Size

9.3

Adding

9.4

The Content of the

407

410

Simple Index to the Sequence Set

9.5

The Simple

9.6

Simple Prefix

Prefix

B+

Index: Separators Instead of

Tree

Keys

413

416

Tree Maintenance

417

9.6.1 Changes Localized to Single Blocks

9.6.2 Changes Involving Multiple Blocks

9.7

Index Set Block Size

9.8

Internal Structure of Index Set Blocks:

the Sequence Set

417

418

421

Variable-order B-Tree

422

CONTENTS

XVI

9.9

Loading
+

9.10

9.11

B-Trees,

B+

10.1

425

Trees, and Simple Prefix

Key Terms

436

B+

Trees in Perspective

437

Exercises

443

Further Readings

Hashing

Tree

429

Trees

Summary 434

B+

Simple Prefix

445
446

Introduction
10.1.1 What

Hashing?

447

448

10.1.2 Collisions

10.2

10.3

Hashing Functions and Record Distributions

Simple Hashing Algorithm

among Addresses

10.3.1 Distributing Records

10.3.2

Some

450

455

Other Hashing Methods

456

10.3.3 Predicting the Distribution of Records

10.3.4 Predicting Collisions

10.4

How Much

Extra

for a Full File

Memory

10.4.2 Predicting Collisions

Collision Resolution
10.5.1

How

Storing

for Different

Packing Densities

per Address: Buckets

10.8

476

Tombstones

of Deletions

480

for Insertions

481

and Additions on Performance

Other Collision Resolution Techniques

10.8.1 Double Hashing

483

10.8.2 Chained Progressive Overflow

484

10.8.3 Chaining with a Separate Overflow Area

10.8.4 Scatter Tables: Indexing Revisited

10.9

471

472

479

10.7.1 Tombstones for Handling Deletions

10.7.3 Effects

466

468

10.6.2 Implementation Issues

10.7.2 Implications

463

467

Buckets on Performance

Making Deletions

462

by Progressive Overflow

More Than One Record

10.6.1 Effects

10.7

Should Be Used?

Progressive Overflow Works

10.5.2 Search Length

10.6

461

462

10.4.1 Packing Density

10.5

453

454

Patterns of Record Access

488

487

486

482

431

CONTENTS

Summary 489

Key Terms

Further Readings

492

Extendible Hashing
11.1

Introduction

11.2

How

503

504

Extendible Hashing Works

505

11.2.1 Tries

11.2.2 Turning the Trie

11.2.3 Splitting

11.3

495

Exercises

501

Implementation

507

into a Directory

Handle Overflow

508

510

11.3.1 Creating the Addresses

11.3.2 Implementing the Top-level Operations

11.3.3 Bucket and Directory Operations

11.3.4 Implementation Summary

11.4

513

514

519

520

Deletion

11.4.1 Overview

of the Deletion

11.4.2 A Procedure

520

Process

Buddy Buckets

for Finding

522

11.4.4 Implementing the Deletion Operations

11.4.5 Summary
11.5

Extendible Hashing Performance

526

11.5.1 Space

526

11.5.2 Space Utilization

11.6

11.6.1 Dynamic Hashing

11.6.3 Approaches

Further Readings

Appendix A:

Appendix

Using

A. 2.

Exercises

File Structures

Introduction to

535

533
537

539

A.2

A. 2.1

528

to Controlling Splitting

Key Terms

A.l

this

527

528

530

11.6.2 Linear Hashing

534

Buckets

for the Directory

Alternative Approaches

Summary

526

of the Deletion Operation

Utilization for

520

522

11.4.3 Collapsing the Directory

CD-ROM

as a File

CD-ROM

542

CD-ROM

Short History of

543

CD-ROM

543

Structure Problem

545

541

XVII

XVIII

CONTENTS

A.3

Physical Organization of
A. 3.1 Reading

A. 3. 2

CLV

Pits

CAV

Instead of

547

549

A. 3. 4 Structure of a Sector

A.4

CD-ROM

Strengths and Weaknesses

A.4. 2 Data Transfer Rate

552

A. 4. 3 Storage Capacity
A. 4. 4

553

Read-Only Access

553

A.4. 5 Asymmetric Writing and Reading

A.5

Tree Structures on

Loading Procedures and Other Considerations

5 Trees as Secondary Indexes on

Hashed
A.

Files

CD-ROM

Bucket Size

A. 6. 3

How

The

CD-ROM

System

Helps

558

Design Exercise

560

A. 7. 3

A Hybrid Design

562

559

563

Appendix B: ASCII Table

566

Appendix C: String Functions

in Pascal: tools. pre

Functions and Procedures Used to Operate on strng

Appendix D: Comparing Disk Drives

Bibliography

Index

559

A. 7. 2

Summary

556

CD-ROM's Read-Only Status

File

The Problem

CD-ROM

557

558

the Size of

CD-ROM

A. 7.1

555

556

557

6.1 Design Exercises

A. 6. 2

A.6. 4 Advantages of

A.7

553

5.4 Virtual Trees and Buffering Blocks

A. 5.

A.6

553

554

5.2 Block Size

A. 5.3 Special
A.

CD-ROM

Design Exercises

A. 5.1

552

Seek Performance

A. 4.1

546

549

3 Addressing

A. 3.

CD-ROM

and Lands

581

575

567

572

567

Introduction to
File Structures

CHAPTER OBJECTIVES
Introduce the primary design issues that characterize
file

structure design.

Survey the history of

file

ing the developments in

much about how

structure design, since trac-

to design

Introduce the notions of

structures teaches us

file

our

file

own

file

structures.

structure literacy and of

a conceptual toolkit for file structure design.

CHAPTER OUTLINE
1.1

1.2

The Heart of File Structure Design

1.3

Conceptual Toolkit: File

Structure Literacy

Short History of File Structure

Design

1.1

The Heart

Design

of File Structure

Disks are slow. They are also technological marvels, packing hundreds of

megabytes on disks that can fit into a notebook computer. Only a few years
ago, disks with that kind of capacity looked like small washing machines.
However, relative to the other parts of a computer, disks are slow.
How slow? The time it takes to get information back from even
relatively slow electronic random access memory (RAM) is about 120
nanoseconds, or 120 billionths of a second. Getting the same information
from a typical disk might take 30 milliseconds, or 30 thousandths of a
second. To understand the size of this difference, we need an analogy.
Assume that RAM access is like finding something in the index of this
book. Let's say that this local, book-in-hand access takes 20 seconds.
Assume that disk access is like sending to a library for the information you
cannot find here in this book. Given that our "RAM access" takes 20
seconds,

how

long does the "disk access" to the library take, keeping the

ratio the

same

as that

of a

real

RAM access and disk access? The disk access

RAM access. This means that

quarter of a million times longer than the

getting information back

from the

library takes 5,000,000 seconds,

almost 58 days. Disks are very slow compared to

than

RAM.

On the other hand, disks provide enormous capacity at much less cost
RAM. They also keep the information stored on them when they are

turned off. The tension between

enormous, nonvolatile capacity
design.

Good

file

a disk's relatively
is

slow access time and

the driving force behind

structure design will give us access to

all

file

the capacity

without making our applications spend a lot of time waiting for the
how to develop such file designs.

This book shows you

its

structure

disk.

A SHORT HISTORY OF FILE STRUCTURE DESIGN

1.2

A Short

History of File Structure Design

Put another way, our goal is to show you how to think creatively about file
structure design problems. Part of our approach to doing this is based on
history: After introducing basic principles of design in the first part of this
book, we devote the last part to studying some of the key developments in
file design over the last 30 years. The problems that researchers struggle

same issues that you confront in addressing any substantial

Working through the approaches used to address major
design issues shows you a lot about how to approach new design

with
file
file

reflect the

design problem.

problems.

The general goals of research and development

drawn directly from our library analogy:

we would

in file structures can

we need with one acwe do not want to issue a

series of 58-day requests before we get what we want.
If it is impossible to get what we need in one access, we want strucIdeally,

like to get the

cess to the disk. In terms

information

of our analogy,

tures that allow us to find the target information with as

as possible.

For example, you

may remember from your

few accesses
studies of

data structures that a binary search allows us to find a particular

record

among

50,000 other records with no more than 16 compari-

to look 16 places on a disk before finding what we

want takes too much time. We need file structures that allow us to
find what we need with only two or three trips to the disk.
We want our file structures to group information so we are likely to

But having

sons.

we need with only one trip to the disk. If we need

name, address, phone number, and account balance, we

get everything
client's

would
to

prefer to get

all

that information at once, rather than having

look in several places for

It is

relatively easy to

these goals

when we have

it.

come up with
files

file

structure designs that

that never change.

that maintain these qualities as

files

Designing

file

meet

structures

change, growing and shrinking as

added and deleted, is much more difficult.

files presumed that files were on tape, since most files
were. Access was sequential, and the cost of access grew in direct
proportion to the size of the file. As files grew intolerably large for unaided
sequential access and as storage devices like disk drives became available,
indexes were added to files. The indexes made it possible to keep a list of
keys and pointers in a smaller file that could be searched more quickly;
given the key and pointer, the user had direct access to the large, primary

information
Early

file.

work with

INTRODUCTION TO

STRUCTURES

FILE

some of the same, sequential flavor

grew they too became

Unfortunately, simple indexes had

the data

themselves, and as the indexes

files

manage, especially for dynamic files in which the set of keys

changes. Then, m the early 1960s, the idea of applying tree structures
emerged as a potential solution. Unfortunately, trees can grow very
unevenly as records are added and deleted, resulting in long searches
difficult to

requiring

many

disk accesses to find a record.

1963 researchers developed the

AVL

tree,

an elegant, self-adjusting

RAM.

Other researchers began to look for

or something like them, to files. The problem

binary tree structure for data in

ways to apply AVL trees,

was that even with a balanced binary

tree,

dozens of accesses are required

even moderate-sized files. A way was needed to keep a

tree balanced when each node of the tree was not a single record, as in a
binary tree, but a file block containing dozens, perhaps even hundreds,
to find a record in

of records.
It took nearly 10 more years of design work before a solution emerged
in the form of the B-tree. Part of the reason that finding a solution took so
long was that the approach required for file structures was very different
from the approach that worked in RAM. Whereas AVL trees grow from

down

the top

records are added, B-trees

grow from

the

B-trees provided excellent access performance, but there

longer could

a file

bottom up.
was a cost: No

be accessed sequentially with efficiency. Fortunately,

this

problem was solved almost immediately by adding a linked list structure at

the bottom level of the B-tree. The combination of a B-tree and a sequential
linked

called a

list is

Over

B + tree.

the following 10 years B-trees and

many commercial

file

entries

practical terms, this

file

entry

among

trees

became

the basis for

systems, since they provide access times that

proportion to log^N, where

number of

B+

indexed in

means

the

number of entries

a single

in the file

grow

and k

the

block of the B-tree structure. In

that B-trees can guarantee that

you can

find

one

millions of others with only three or four trips to the disk.

Further, B-trees guarantee that as

you add and

delete entries,

performance

stays about the same.

Being able to get information back with just three or four accesses is
But how about our goal of being able to get what we want
with a single request? An approach called hashing is a good way to do that
with files that do not change size greatly over time. From early on, hashed
indexes were used to provide fast access to files. However, until recently,
pretty good.

hashing did not work well with

volatile, dynamic files that changed greatly

development of B-trees, researchers turned to work on
extendible, dynamic hashing that could retrieve information

in size. After the

svstems tor

A CONCEPTUAL TOOLKIT:

with one

most, two disk accesses no matter how big the file becomes.
book with a careful look at this work, which took place from

close this

A Conceptual

part of the 1980s.

first

Toolkit: File Structure Literacy

As we move through
decades, watching
first

STRUCTURE LITERACY

or, at

the late 1970s through the

1.3

FILE

the developments in

file

structures over the last three

structure design evolve as

addresses

dynamic

files

sequentially, then through tree structures, and finally through direct

same design problems and design tools keep

number of disk accesses by collecting data into
buffers, blocks, or buckets; we manage the growth of these collections by
splitting them, which requires that we find a way to increase our address or
index space, and so on. Progress takes the form of finding new ways to
combine these basic tools of file design.
We think of these tools as conceptual tools. They are ways of framing
access,

emerging.

see

that

the

decrease the

a design problem. Our own work in file structures has

by understanding the tools thoroughly, and by studying how
have been combined to produce such diverse approaches as B-trees

and addressing

shown

us that

the tools

and extendible hashing, we develop mastery and flexibility in our own use
of the tools. In other words, we acquire literacy with regard to file
This text

structures.

designed to help readers acquire

file

structure

through 6 introduce the basic tools; Chapters 7 through

11 introduce readers to the highlights of the past several decades of file
structure design, showing how the basic tools are used to handle efficient
+
trees, hashed indexes, and extendible,
sequential access, B-trees, B

literacy.

Chapters

dynamic hashed

files.

SUMMARY
The key design problem that shapes
large amount of time that is required
structure designs focus

file

structure design

the relatively

from disk. All file

and maximizing the

to get information

on minimizing disk

likelihood that the information the user will

accesses

want

already in

RAM.

This text begins by introducing the basic concepts and issues associated
with file structures. The last half of the book tracks the development of file
structure design as

30 years. The key problem

finding ways to minimize disk

has evolved over the

addressed throughout

this

evolution

last

INTRODUCTION TO

FILE

STRUCTURES

files that keep changing in content and size. Tracking these

developments takes us first through work on sequential file access, then
through developments in tree-structured access, and finally to relatively
recent work on direct access to information in files.
Our experience has been that the study of the principal research and
design contributions to file structures, focusing on how the design work
uses the same tools in new ways, provides a solid foundation for thinking
creatively about new problems in file structure design.

accesses for

KEY TERMS

AVL

tree.

self-adjusting binary tree structure that can guarantee

access times for data in

B-tree.

good

RAM.

tree structure that provides fast access to data stored in files.

Unlike binary trees, in which the branching factor from a node of

is two, the descendents from a node of a B-tree can be a
much larger number. We introduce B-trees in Chapter 8.

the tree

B+

tree.

variation

the B-tree structure that provides sequential ac+

discuss B
trees at

cess to the data as well as fast-indexed access.

length in Chapter

Extendible hashing. An approach to hashing that works well with files

that undergo substantial changes in size over time.
File structures. The organization of data on secondary storage devices
such

as disks.

Hashing. An

access

mechanism

that transforms the search

key into

storage address, thereby providing very fast access to stored data.

Sequential access. Access that takes records

first, then the next, and so on.

in order, looking at the

Fundamental

File

Processing Operations

CHAPTER OBJECTIVES
Describe the process of linking a logical file within
to an actual physical file or device.

program

Describe the procedures used to create, open, and

files.

Describe the procedures used for reading from and

writing to files.
Introduce the concept of position within a file and describe procedures for seeking different positions.

Provide an introduction to the organization of the

UNIX file system.
Present the
file

UNIX

operations and

view of

a file,

commands

and describe

based on

this

UNIX

view.

CHAPTER OUTLINE
2.1

Physical Files and Logical Files

2.6

Special Characters in Files

2.2

Opening

2.7

The

2.3

Closing Files

2.8

Physical and Logical Files in

2.4

Reading and Writing

Files

Directory Structure

UNIX
UNIX

2.8.1

Read and Write Functions

2.4.2 A Program to Display the

Physical Devices as

2.8.2

The Console,

Contents of a File
2.4.3 Detecting End-of-File

2.8.3 I/O Redirection and Pipes

2.4.1

2.5

UNIX

and Standard Error

Seeking
Seeking in C
2.5.2 Seeking in Pascal
2.5.1

2.1

Files

the Keyboard,

2.9

File-related

2.10

UNIX

File

Header

Files

System Commands

Physical Files and Logical Files

When we

talk

about

a file

collection of bytes stored there.

physically exists.
these physical

From

disk or tape,

when

word

file,

the

refer to a particular
is

used in

this sense,

disk drive might contain hundreds, even thousands, of

files.

the standpoint of an application program, the notion of a

file is

somewhat like a telephone line connected

to a telephone network. The program can receive bytes through this phone
line, or send bytes down it, but knows nothing about where these bytes
actually come from or where they go. The program knows only about its
own end of the phone line. Moreover, even though there may be thousands
of physical files on a disk, a single program is usually limited to the use of
different.

the program, a

only about 20

The

file is

files.

application

program

relies

on the operating system

to take care

the details of the telephone switching system, as illustrated in Fig. 2.1.

could be that bytes coming

an actual physical

file,

down

the line into the

or they might

come from

program

originate

from

some

other

the keyboard or

input device. Similarly, the bytes that the program sends

down

the line

might end up in a file, or they could appear on the terminal screen.

Although the program often doesn't know where bytes are coming from or
where they are going, it does know which line it is using. This line is usually

OPENING FILES

referred to as the logical file to distinguish this view

from the

physical

files

the disk or tape.

Before the program can open

making

receive instructions about

phone
such

line)

and some physical

IBM's OS/MVS,

file

a file for use, the

operating system must

hookup between

or device.

When

a logical file (e.g., a

using operating systems

these instructions are provided through job

minicomputers and microcomputers, more

as UNIX, MS-DOS, and VMS provide the
instructions within the program. For example, in Turbo Pascal^ the
association between a logical file called inp_jile and a physical file called
myfile.dat is made with the following statement:
control language (JCL).

modern operating systems such

assignCi np_f ile, 'myfile.dat

file named
make the hookup by assigning a logical file (phone
The number identifying the particular phone line that is assigned

This statement asks the operating system to find the physical

and then

myfile.dat
line) to
is

it.

returned through the FILE variable inp_file, which

name. This logical

name

what we use

the

to refer to the

file's

file

logical

inside the

program. Again, the telephone analogy applies: My office phone is

connected to six telephone lines. When I receive a call I get an intercom
message such as, "You have a call on line three." The receptionist does not
say, "You have a call from 918-123-4567." I need to have the call identified
logically,

2.2

not physically

Opening

Files

Once we have

hooked up to a physical file or device,

what we intend to do with the file. In general, we have
two options: (1) open an existing file or (2) create a new file, deleting any
existing contents in the physical file. Opening a file makes it ready for use
by the program. We are positioned at the beginning of the file and are ready
to start reading or writing. The file contents are not disturbed by the open

we need

a logical file identifier

to declare

statement. Creating a

file

also

opens the

file in

use after creation. Since a newly created

initially the

file

the sense that

compilers vary widely with regard to I/O procedures, since standard Pasway of I/O definition. Throughout this book we use the term Pasdiscussing features common to most Pascal implementations. When we refer to

cal contains little in the

when

ready for

only use that makes sense.

+ Different Pascal

cal

it is

has no contents, writing

the features of a specific implementation, such as

Turbo

Pascal,

say so.

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

Logical

files

Program
Limit of approximately

20 phone lines

Physical

files

Printer

FIGURE 2.1 The program

files

and physical

files

relies

on the operating system to make connections between

In Pascal the reset(

rewrite (

Turbo

logical

and devices.

statement
Pascal

statement

used to create

we might

ass ign( inp_f

reset ( i np_f

le)

use

used to open existing files and the

ones. For example, to open a file

new

sequence of statements such

myfile.dat*);

as:

OPENING FILES

CLEAN UP
YOUR MESS

SAFETY
FIRST

Operating system switchboard

Can make connections to thousands
of

Note

that

statement.

we use the
To create a

files

logical
file

or I/O devices

file

name, not the physical one, in the reset()

Pascal, the statements might read:

Turbo

'myfile.dat');
i 1 e
rewrite (inp_file)

a55 ign( out_f

FUNDAMENTAL

PROCESSING OPERATIONS

FILE

We can open an existing file or create a new one in C through the

UNIX system function open( This function takes two required argu).

ments and

a third

argument

that

fd = open(f ilename,

The

optional:

flags,

[pmode]);

return value fd and the arguments filename, flags, and pmode have the

following meanings:
fd

The fde

descriptor.

Using our

earlier analogy, this

the

phone

used to refer to the file within the

an integer. If there is an error in the attempt to

line (logical file identifier)

program. It is
open the file, this value
filename

flags

negative.

character string containing the physical

file

name. (Later

discuss pathnames that include directory information about

the

file's

The

argument can be

location. This

argument

pathname.)

an integer that controls the operation of

the open function, determining whether it opens an existing
file for reading or writing. It can also be used to indicate that
you want to create a new file, or open an existing file but delete its contents. The value of flags is set by performing a bitwise
of the following values, among others. 1
flags

D_APPEND

Append every
the

0_CREAT

write operation to the end of

file.

Create and open

a file for writing.

This has

effect if the file already exists.

CUE X C L

Return an error if 0_C R E A T E

and the file exists.

CURDONLY
CURDWR
D_TRUNC

Open

a file for

reading only.

Open

a file for

reading and writing.

If the file exists, truncate

zero, destroying

CUHRONLY

Open

its

specified

to a length

contents.

a file for writing only.

Some of these flags cannot be used in combination with one

another. Consult your documentation for details, as well as
for other options.

pmode

0_C RE AT

ment
pmode

specified,

pmode

specifies the protection

a three-digit octal

required. This integer argufile. In UNIX, the

mode for the

number

that indicates

how

the

file

can be used by the owner (first digit), by members of the

owner's group (second digit), and by everyone else (third
\
+ These

piler.

values are defined in an "include"

The name of

system.

the include

file is

file

packaged with your

often fcntl.h or file. h, but

UNIX

system or

C com-

can vary from system to

CLOSING FILES

digit).

The

sion, the

first bit of each octal digit indicates read permissecond write permission, and the third execute per-

mission. So,

owner has

if pmode is

the octal

number

0751, the

file's

and execute permission for the file; the

owner's group would have read and execute permission; and
everyone else has only execute permission:
read, write,

rwe rwe rwe

111

owner

group

PMODE = 0751=

world

Given this description of the open( ) function, we can develop some

examples to show how it can be used to open and create files in C. The
following function call opens an existing file for reading and writing, or
creates a new one if necessary. If the file exists it is opened without change;
reading or writing

would

start at the file's first byte.

fd = openCf ilename,

The following
already

a file

new

file

only

0_TRUNC, 0751);
if there is

this

not already

name

exists,

it is

negative value to indicate an error.

0_CREAT

If there is

contents are truncated.

with

in filename. If a file

openCf ilename, D_RDNR

and writing.

its

0_CREAT

not opened and the function returns

0_CREAT, 0751);

in filename,

a call that will create a

with the name specified

for reading

file

with the name specified

openCf ilename, 0_RDWR

Finally, here
file

call creates a

0_RDWR

D_EXCL,

0751);

is tied more to the host operating system than to a

For example, implementations of Pascal running on
systems that support file protection, such as VAX/VMS, often include
extensions to standard Pascal that let you associate a protection status with

File protection

specific language.

a file

2.3

when you

create

it.

Closing Files
file is like hanging up the
is available for taking
phone
line
phone,
the
the
hang
up
phone. When you
logical
file name or file
the
file,
close
a
or placing another call; when you

In terms of our telephone line analogy, closing a

descriptor

available for use with another

file.

Closing

a file that

has been

used for output also ensures that everything has been written to the file. As
you will learn in a later chapter, it is more efficient to move data to and from

secondary storage in blocks than

move

data one byte at a time.

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

Consequently, the operating system does not immediately send off the
we write, but saves them up in a buffer for transfer as a block of data.
Closing a file makes sure that the buffer for that file has been flushed of data
bytes

and that everything

we have

written has actually been sent to the

file.

automatically by the operating system when a

program terminates normally. Consequently, the explicit use of a CLOSE
Files are usually closed

statement within

program

needed only

as protection against data loss in

the event of program interruption and to free

logical filenames for reuse.

Some

languages, including Standard Pascal, do not even provide a CLOSE

statement. However, explicit file closing is possible in the C language,
Pascal, PL/I, and most other languages used for serious file processing

VAX

work.

Now

2.4

how to connect and disconnect programs to and

how to open the files, you are ready to start sending

you know

that

from physical

files

and receiving

data.

and

Reading and Writing

Reading and writing are fundamental to
that

make

file

processing; they are the actions

processing an input/output (I/O) operation.

The

actual

form of

the read and write statements used in different languages varies.

Some

languages provide very high-level access to reading and writing and

automatically take care of details for the programmer. Other languages

provide access
explore

at a

much lower

some of these

level.

Our

use of Pascal and

allows us to

differences.^

2.4.1 Read and Write Functions

We begin here with

reading and writing at a relatively low level. It is useful

have a kind of systems-level understanding of what happens when we
send and receive information to and from a file.
A low-level read call requires three pieces of information, expressed
here as arguments to a generic READ( ) function.
to

RE ADC Sour ce_f ile,Destinati on_addr Size)

Source_f i le

The RE AD(

to read.

name (phone
'To accentuate these differences and provide
systems level, we use the read( ) and write(
functions such as fgetc( ), jgets( ), and so on.

a
)

call

must know from where it

by logical file
through which data is re-

specify the source

line)

look at I/O operations at something closer to

system calls in C rather than higher-level

READING AND WRITING

(Remember, before we do any reading

already opened the file, so the
connection between a logical file and a specific
ceived.

we must have
physical

file

or device already exists.)

READ( must know

Dest inat ion_addr

where to place the inforfrom the input file. In this gefunction we specify the destination by
)

mation
neric

reads

giving the

address of the

first

where we want

Size

Finally,

READ(

memory

block

to store the data.

)

must know how much inin from the file. Here the

formation to bring

argument

A WRITE statement is

supplied as a byte count.

similar; the only difference

that the data

moves

the other direction:

NRITECDestinat ion_f i
Destinat i on_f

i 1

The

Source_addr

logical

name we

file

ze)

use for sending the

data.

Source_addr

WRITE(

must know where

mation that

ification as the first

Size

to find the infor-

provide this specaddress of the memory

will send.

block where the data

The number of bytes

to be written

stored.

must be

supplied.

2.4.2 A Program to Display the Contents of a

File

do some reading and writing to see how these functions are used. This
file processing program, which we call LIST, opens a file for
input and reads it, character by character, sending each character to the
screen after it is read from the file. LIST includes the following steps:
Let's

simple

first

Display a prompt for the

Read the

user's response

name of the

input

from the keyboard

file.

into a variable called file-

name.
3.

Open

While there

the

file

for input.

are

still

characters to be read

from the input

read a character from the

write the character to the terminal screen.

Close the input

file

file.

Figures 2.2 and 2.3 are, respectively,

tions

this

file,

and

program.

these implementations.

It is

and Pascal language implementabetween

instructive to look at the differences

FUNDAMENTAL

/*
**
*/

1 i

5 t

c
.

PROCESSING OPERATIONS

FILE

-program to read characters from

to the terminal screen

file and write them

^include <stdio.h>
^include <fcntl.h>
ma i n(

char

fd;
/ *
file descriptor */
char filename[20];
int

printf ("Enter the name of the file:

get s( f i 1 ename)
fd =open(f i lename, D_RD0NLY);

while (readCfd, &c

1)
wr iteCSTDOUT, &c

0)
1);

close(fd);
FIGURE 2.2 The LIST program

/*
/*
/*

Step
Step
Step

/*
/*

Step 4a */
Step 4b */

Step

*/
*/
*/

in C.

Steps 1 and 2 of the program involve writing and reading, but in each
of the implementations' this is accomplished through the usual functions for
handling the screen and keyboard. Step 4a, where we read from the input
the first instance of actual file I/O. Note that the read( ) call in the C
language parallels the low-level, generic READ( ) statement we described
file, is

system call in C as the model for our

argument gives the file descriptor
(C's version of a logical file name) as the source for the input, the second
argument gives the address of a character variable used as the destination for
the data, and the third argument specifies that only one byte will be read.

earlier; in truth,

low-level

READ(

The arguments
information

name

file

at a

used the read(

The

for

function's

first

the Pascal

read(

call

communicate the same

Once again, the first argument is the logical

source. The second argument gives the name of a

higher

level.

for the input

character variable used as a destination; given the name, Pascal can find the
address. Because of Pascal's strong emphasis

argument of the generic

that since

READ(

function

on variable

types, the third

not required. Pascal assumes

are reading data into a variable of type char,

we must want

read only one byte.

After a character

read,

again the differences between

I/O used

write

out to the screen in Step 4b.

Once

and Pascal indicate the range of approaches

different languages. Everything must be stated explicitly in

READING AND WRITING

the

write (

Using the

) call.

special, assigned file descriptor

STDOUT

to identify the terminal screen as the destination for our writing,

wnte( STDOUT,

means: "Write to the screen the contents from memory starting at the
address Sec. Write only one byte." Beginning C programmers should pay
special attention to the use

particular

call,

as a

of the

symbol

very low-level

provide the starting address

RAM

call,

in the write(

call here; this

requires that the

programmer

of the bytes to be transferred.

STDOUT,

which stands for "standard output," is an integer value

defined in the file stdio.h, which has been included at the top of the program. The actual value of STDOUT that is set in stdio.h is, by convention,
always

The concept of standard output and

counterpart "standard

its

input" are covered later in the section "Physical and Logical

Files

UNIX."

FIGURE 2.3 The LIST program

Pascal.

PROGRAM list (INPUT, OUTPUT);

{

reads input from

file and writes

screen

the terminal

VAR
c

char;
file of char;

logical file name

packed array [1..201 of char;

physical file name

infile

filename
BEGIN {main>

wnteCEnter

the name of the file:

readln(f i lename)
filename);
reset ( mf l le
while not ( eof ( i nf l 1 e ) ) DO
;

');

{
<
<

Step
Step
Step

BEGIN

readdnf ile,c)
write(c)

Step 4a
Step 4b

Step

>
>

END;

closednf lie)
END.

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

Again the Pascal code operates

name

specified in a writef

higher

at a

When no logical file

we are writing
char, Pascal assumes we

level.'*'

statement, Pascal assumes that

to the terminal screen. Since the variable

c is

of type

The statement becomes simply

are writing a single byte.

write(c)
As

in the read(

bytes; the

statement, Pascal takes care of finding the address of the

programmer need

specify only the

name of the

variable

that

associated with that address.

2.4.3 Detecting End-of-File

in Figs. 2.2 and 2.3 have to know when to end the while loop
and stop reading characters. Pascal and C signal the end-of-file condition
differently, illustrating two of the most commonly used approaches to

The programs

end-of-file detection.

Pascal supplies
end-of-file.

Boolean function,

As we read from

eof(

which can be used

to test for

operating system keeps track of our

a file, the

with a read/write pointer. This is necessary so when the

system knows where to get it. The eof( ) function
queries the system to see whether the read/ write pointer has moved past the
last element in the file. If it has, eof( ) returns true; otherwise it returns false.

location in the

next byte

file

read, the

Fig. 2.3 illustrates,

byte. For an

empty

file,

use the eof(

eof(

) call

before trying to read the next

immediately returns

true

and no bytes are

read.

In the
read(

2.5

language, the read(

) call

returns the

returns a value of zero, then the

file.

So, rather than using an eof(

run

long

function,

as the read( ) call finds

number of bytes

program has reached

something

read. If

the end of the

construct the while loop to

to read.

Seeking
In the preceding

sample programs

reading one byte after another until

byte

read, the operating

we read through the

we reach the end of the

system moves

file sequentially,
file.

Every time

the read/write pointer ahead, and

are ready to read the next byte.

C does not have similar high-level functions. In fact, the standard C

panoply of higher-level I/O functions, including putc( ), which functions
for characters exactly like the Pascal write( ) shown here. We have chosen to emphasize the
use of the lower-level C functions mainly for pedagogical reasons. They provide opportunities for us to understand more fully the way file I/O works.

""This is

not to say that

library provides a

SEEKING

Sometimes we want to read or write without taking the time to go

through every byte sequentially. Perhaps we know that the next piece of
information we need is 10,000 bytes away, and so we want to jump there
to begin reading. Or perhaps we need to jump to the end of the file so we
can add

new information there. To satisfy these needs we must

movement of the read/write pointer.

be able to

control the

The

action of

called seeking.

moving

directly to a certain position in a

seek requires at least

two

often

file is

pieces of information, expressed

SEEK(

here as arguments to the generic pseudocode function

SEEK ( Sou rce_f ile,Df f set)

Source_f
Offset

The

logical

file

The number of

moved from

Now,

name

which the seek

positions in the

the start of the

file

will occur.

the pointer

to be

file.

we want to move directly from the origin to the 373rd position in

we don't have to move sequentially through the first 372
positions first. Instead, we can say
if

a file called data,

SEEK( data,

2.5.1 Seeking

373

UNIX

been incorporated into many

implementations of the C language is the ability to view a file as a
potentially very large array of bytes that just happens to be kept on secondary
storage. In an array of bytes in RAM, we can move to any particular byte

One of

the features of

that has

through the use of a subscript. The

provides

any byte in

The

C language seek function,

a similar capability for files.

where the

called lseek(

function has the following form:

lseekCfd, byte_offset, origin)

j^^citz

variables have the following meanings:

pos

A long integer value returned by lseek( ) equal to the

position (in bytes) of the read/write pointer after it has
been moved.

The

file

descriptor of the

file

which the

lseek(

) is

be applied.
byte_offset

us set the read/ write pointer to

a file.

lseek(

pos =

It lets

The number of bytes to move from some origin in the

file. The byte offset must be specified as a long integer,
hence the name Iseek for long seek. When appropriate,
the byte_offset can be negative.

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

origin

value that specifies the starting position from which the

is to be taken. The origin can have the value 0, 1,

byte_offset

2-f

heek(

from

lseek(

from the current

2lseek(

from the end of the

the beginning of the

position;
file.

^CzAL^

file;

The following program fragment shows how you could

move

to a position that

long pos
i

f d

seek

( i

373 bytes into

fd,

2.5.2 Seeking

373L,

a file is a

byte-by-byte

long offset,

origin);

int

in Pascal differs

acter or integer, or
a file in

from the

view

sequence of bytes, so addressing within the

basis.

When we

particular type.

within

Pascal

seek to a position,

dress in terms of bytes. In Pascal a

some

use lseek(

a file.

0);

The view of a file as presented

two important respects:

B.AJk^

pos=lseek(fd

& Ll 9^-

^C^? r* -

S^C^-

3fc?

Pascal

may
is

file is a

record can be

in at least

file is

express the ad-

sequence of "records" of

simple scalar such as a char-

more complex

structure.

Addressing

terms of these records. For example,

if a

made up of 100-byte records, and we want to refer to the

fourth record, we would do so in Pascal simply by referencing
record number 4. In C, where the view is solely and always in terms
of bytes, we would have to address the fourth record as byte address
file is

400.

Standard Pascal actually does not provide for seeking. The model for

I/O for standard Pascal

tially.

magnetic

tape,

which must be read sequen-

In standard Pascal, adding data to the

reading the entire

file

from beginning

end of

a file

requires

to end, writing out the data

from the input file to a second, output file, and then adding the new
data to the end of the output file. However, many implementations
of Pascal such as VAX Pascal and Turbo Pascal have extended the
standard and do support seeking.

1, and 2 are almost always used here, they are not guaranteed to
implementations. Consult your documentation.

"'"Although the values 0,

work

tor

all

SPECIAL CHARACTERS

There

ANSI/IEEE

an extension to Pascal proposed by the Joint

Standards Committee (1984) that

the future.

may

IN FILES

Pascal

be included in the Pascal standard

includes the following procedures and functions that permit

seeking:

SeekWrite(f,n) A procedure that positions the file /on the element

with index n and places the file in write mode, so the selected
and following elements may be modified.

SeekRead(f,n) A procedure that positions the file /on the element

with index n and places the file in read mode, so the selected and
following elements
position

beyond

the end of the

Position(f)

may

file,

SeekRead(

then the

attempts to

positioned

file is

file.

function that returns the index value representing the

position of the current

EndPosition(f)

file

element.

function that returns the index value representing

the position of the last

Many

be examined.

the end of the

file

element.

Pascal implementations, recognizing the need to provide seeking

had already implemented seeking functions before these

Consequently, the mechanisms for handling
seeking vary widely among implementations.
capabilities,

proposals were set forth.

2.6

Special Characters

in Files

As you create the file structures described in this

some difficulty with extra, unexpected characters

you may encounter

up in your files,

text,

that turn

with characters that disappear, and with numeric counts that are inserted
your files. Here are some examples of the kinds of things you might

into

encounter:

On many

small computers you

value of 26)

appended

may

the end of your

use this to indicate end-of-file even

This

most

likely to

find that a

happen on

Control-Z (ASCII

Some

files.

applications

you have not placed

MS-DOS

there.

systems.

Some systems adopt

file

value of 13)

""When

characters

a convention of indicating end-of-line in a text

of characters consisting of a carriage return (CR: ASCII
and a line feed (LF: ASCII value of 10). Sometimes I/O

as a pair

use the term text file in this text,

a specific standard character

from

wise specified, the ASCII character

describes the

ASCII

character

set.

set

are referring to a

file

such as ASCII or
will be assumed. Appendix
set,

consisting entirely of

EBCDIC.

Unless other-

contains a table that

>
t

^~Q? (^

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

procedures written for such systems automatically expand single

characters or

characters into

CR-LF

pairs.

tion of characters can cause a great deal of difficulty. Again,

most

likely to

encounter

phenomenon on

this

Users of larger systems, such

the opposite problem. Certain
riage return characters

them with
as a line

a count

file

of the characters

you

are

systems.

find that they have just

formats under

file

from your

MS-DOS

VMS

remove car-

without asking you, replacing

what the system has perceived

text.

These are just

that record

VMS, may

This unrequested addi-

few examples of the kinds of uninvited modifications

or I/O support packages might make to

management systems

your files. You will find that they are usually associated with the concepts
of a line of text or end of a file. In general, these modifications to your files
are an attempt to make your life easier by doing things for you
automatically. This might, in fact, work out for users who want to do
nothing more than store some text in a file. Unfortunately, however,
programmers building sophisticated file structures must sometimes spend a
lot of time finding ways to disable this automatic assistance so they can have
complete control over what they are building. Forewarned is forearmed;
readers who encounter these kinds of difficulties as they build the file
structures described in this text can take some comfort from the knowledge
that the experience they gain in disabling automatic assistance will serve

them

2.7

well, over

and over,

in the future.

The UNIX Directory Structure

No matter what computer system you have,

even if it is a small PC, chances

hundreds or even thousands of files you have access to. To
provide convenient access to such large numbers of files, your computer has
are there are

some method for organizing its

The UNIX filesystem is a
with the

root

of the

including the root,

programs and

data,

UNIX

this is called the filesy stem.

by the character 7\ All

can contain two kinds of

and directories

references to devices, as

stored in a

tree-structured organization of

tree signified

drives are also treated like

name

files.

UNIX

files

shown

(Fig. 2.4).

UNIX,

files:

regular

directories,

directories,
files

with

Since devices such as tape

directories can also contain

in the dev directory in Fig. 2.4.

directory corresponds to

what we

call its

The

file

physical

name.
Since every file in a UNIX system is part of the filesystem that begins
with root, any file can be uniquely identified by giving its absolute pathname.
For instance, the true, unambiguous name of the file "addr" in Fig. 2.4 is

THE UNIX DIRECTORY STRUCTURE

(root)

adb

console kbd

yacc

libc.a

TAPE

libm.a

FIGURE 2.4 Sample UNIX directory structure.

7 is used both to indicate the root directory

and to separate directory names from the file name.)
When you issue commands to a UNIX system, you do so within some
directory, which is called your current directory. A pathname for a file that
does not begin with a 7 describes the location of a file relative to the
/usr6/mydir/addr. (Note that the

current directory. Hence, if your current directory in Fig. 2.4

uniquely identifies the

The

file

2.8

mydir, addr

special filename "." stands for the current directory,

stands for the parent of the current directory.

directory

/usr6/ mydir /addr.

/usr6/mydir/DF, "../addr" refers to the

Physical and Logical Files

Hence,
file

and ".."
your current

/usr6/mydir /addr.

UNIX

2.8.1 Physical Devices as UNIX Files

One of the most powerful ideas in UNIX is reflected in its notion of what
a file is. In UNIX, a file is a sequence of bytes, without any implication of
how or where the bytes are stored or where they originate. This simple

FUNDAMENTAL

PROCESSING OPERATIONS

FILE

conceptual view of

a file

makes

operations what might require

possible in

many

different operating system. For example,

disk as the source of a

things

also files

But

disks.

because

file,

UNIX,

in Fig. 2.4, /dev/kbd

pressed;

the

console

accepts

corresponding symbols on
allows so

very few

easy to think of a magnetic

are used to the idea of storing such

devices like the keyboard and the console are

and /dev / console respectively. The keyboard

sequence of bytes

when keys are

and displays their

a screen.

we say that the UNIX concept of a file is simple when it

many different physical things to be called files? Doesn't this make
more complicated, not

matter what physical representation

UNIX

operations on a

can

the situation

file is

logically

it is

do with

many

sequence of bytes that are sent to the computer

produces

How

UNIX

times as

the same. In

by an integer

its

the

simpler?
a file

The

may

simplest form, a

file

UNIX

trick in

take, the logical

UNIX

file is

descriptor. This integer

is that no
view of a

represented

an index to an

more complete information about the file. A keyboard, a disk file,

magnetic tape are all represented by integers. Once the integer that

array of

and

describes a
logical

file is

name of

whether the

file

program can access that file. If it knows the

program can access that file without knowing
comes from a disk, a tape, or a telephone.
identified, a

a file,

2.8.2 The Console, the Keyboard, and Standard Error

see an

program

example of the duality between devices and

/*
/*
/*

Step
Step
Step

1)
while (readCfd, &c
writeCSTDOUT, &c
1);

/*
/*

Step 4a */
Step 4b */

LIST

the

printf ("Enter the name of the file: ");

getsCf ilename)
fd =open(f ilename, 0_RD0NLY);
;

The

files in

in Fig. 2.2:

logical

file

some

STDOUT,

small integer value returned by the open(

assign this integer to the variable^ in Step

integer

defined as

earlier in the

In Step 4b,

program,

*/
*/
*/

) call.

use the

to identify the

file to be written to.

There are two other file descriptors that are special in UNIX: The
keyboard is called STDIN (standard input) and the error file is called
STDERR (standard error). Hence, STDIN is the keyboard on your terminal.

console as the

The statement
readCSTDIN,

1);

PHYSICAL AND LOGICAL FILES

reads a single character

STDOUT,

from your terminal.

usually just your console.

STDERR

UNIX

an error

When your

file

which,

compiler detects an

error, it generally writes the error message to this file, which means
normally that the error message turns up on your screen. As with STDIN,
the values STDIX and STDERR are usually defined in stdio.h.
Steps 1 and 2 of the LIST program also involve reading and writing

from

STDIX

STDOUT. Since an enormous amount of I/O involves

most programming languages have special functions to

these devices,

perform console input and output

in LIST, the C functions print/ and gets
are used. Ultimately, however, print/ and gets send their output through
STDOUT and STDIN, respectively. But these statements hide important
elements of the I/O process. For our purposes, the second set of read and
write statements is more interesting and instructive.

2.8.3

I/O Redirection

Suppose you would

to a regular

file,

the output of

and Pipes

change the LIST program so it writes its output

STDOUT. Or suppose you wanted to use
input to another program. Because it is common to

like to

rather than to

LIST

do both of these, UNIX provides convenient shortcuts

switching between standard I/O (STDIX and STDOUT) and regular
I/O. These shortcuts are called I/O redirection and pipes/
I/O redirection lets you specify at xecuj|op time alternate files
input or output. The notations for input and output redirection are

want

<
>

For example,

output from

STDOUT to

What

if,

you wanted

UNIX

pipes

for

(redirect STDIN to "file")

(redirect STDDUT to "file")

file
file

list

for
file

the executable

myf

i 1

LIST program

a file called

let

programl

immediately

you do
I

called "list,"

redirect the

instead of storing the output

to use

"myfile" by entering the line

this.

from the

list

program

in a

file,

another program to sort the results?

The notation

for a

UNIX

pipe

is '|\

Hence,

program2

Stnctly speaking, I/O redirection and pipes are part of a UNIX shell, which is the cominterpreter that sits on top of the core UNIX operating system, the kernel. For the
purpose of this discussion, this distinction is not important.

mand

FUNDAMENTAL

FILE

PROCESSING OPERATIONS

means take any S TDO UT output from program 1 and use it in place of any
STDIN input to program2. Since UNIX has a special program called sort,
which takes its input from STDIN, you can sort the output from the list
program, without using an intermediate file, by entering
list
Since

sort

writes

sort

its

output to

STDOUT,

the sorted listing appears

on your

terminal screen unless you use additional pipes or redirection to send

elsewhere.

2.9

File-related

UNIX,

Header

Files

file

operations. For example,

return a special value indicating end-of-file

beyond

names and values that you

some C functions
(EOF) when you try to read

like all operating systems, has special

must use when performing

the end of a

file.

you use in an open( ) call to indicate whether you

want read-only, write-only, or read/write access. Unless we know just
where to look, it is often not easy to find where these values are defined.
UNIX handles the problem by putting such definitions in special header
files such as /usr/include, which can be found in special directories.
Recall the flags that

Three header files relevant to the material in this chapter are stdio.h,
and file.h. EOF, for instance, is defined on many UNIX systems in
/usr/include /stdio.h, as are the file pointers STDIN, STDOUT, and
STDERR. And the flags 0_RDONLY, 0_WRONLY, and O.RDWR
can usually be found in /usr/ include /sys/file.h or possibly one of the files that

fcntl.h,

includes.
It

would be

instructive for

you

browse through these

files,

as well as

others that pique your curiosity.

2.10

UNIX Filesystem Commands

UNIX provides many commands for

manipulating

are relevant to the material in this chapter.

files.

list a

few

options, but the simplest uses of most should be obvious. Consult a

manual

for

cat

tail

more information on how

filenames

filename

that

Most of them have many

to use them.

Print the contents of the

named

Print the last 10 lines of

the text

text
file.

files,

UNIX

SUMMARY

filel ftle2

filel file2

filenames

chmod

Copy filel to file2.

Move (rename) filel to file2.
Remove (delete) the named files.
Change the protection mode on the named

mode filename

files.

List the contents

mkdir
rmdir

name

Create

name

Remove

of the directory.

a directory

the

with the given name.

named

directory.

SUMMARY
This

chapter

OPEN(

introduces

CREATE(

fundamental operations of

the

CLOSE(

READ(

WRITE(

systems:

file

and SEEK( ).
link between a

Each of these operations involves the creation or use of a

on a secondary device and a logical file that represents a
program's more abstract view of the same file. When the program describes

physical file stored

an operation using the

gets

performed on the corresponding physical

The

six operations

Sometimes they are

functions, and sometimes they
),

for instance,

Before

can use

many

different

commands, sometimes they

built-in

are direct calls to an operating system.

languages provide the user with

SEEK(

file.

appear in programming languages in

forms.

all

name, the equivalent physical operation

logical file

all

six

are

Not

The operation

operations.

not available in standard Pascal.

a physical file,

we must

link

to a logical

file.

some programming environments we do this with a statement (e.g., assign

Turbo Pascal) or with instructions outside of the program (e.g., job
control language QCL] instructions). In other languages the link between
the physical file and a logical file is made with OPEN(
or CREATE( ).
The operations CREATE( and OPEN( make files ready for reading

or writing.
operates

CREATE(

on an already

causes a

pointer to the beginning of the

between

a logical file

that the file buffer

to the

new

physical

existing physical

and

its

file.

file,

file

to be created.

OPEN(

usually setting the read/write

The CLOSE(

operation breaks the link

corresponding physical

file. It

also

flushed so everything that was written

makes sure

actually sent

file.

The I/O operations READ(

systems

level, require three

The

logical

and

WRITE(

when viewed

at a

items of information:

name of the

file

to be read

from or written

to;

low,

FUNDAMENTAL

address

computer"

PROCESSING OPERATIONS

FILE

memory

area to be used for the "inside of the

part of the exchange;

how much

indication of

data

and
to be read or written.

These three fundamental elements of the exchange are illustrated in Fig. 2.5.
READ( ) and WRITE( ) are sufficient for moving sequentially through
a file to any desired position, but this form of access is often very inefficient.

Some languages provide

a certain

position in a

operation.

The

giving us

lseek(

seek operations that

file.
)

operation

great deal of

let a

program move

directly to

provides direct access by means of the lseek(

lets

freedom

us view a

file as a

in deciding

Standard Pascal does not support direct

how

to organize a

access, but

file

kind of large array,

many

file.

dialects

Pascal do.

One

other useful

file

operation involves

has been reached. End-of-file detection

knowing when
handled

the end of a

in different

file

ways by

different languages.

Much

effort goes into shielding

the physical characteristics of

programmers from having

files,

about the physical organization of

When we

but inevitably there are

files

that

to deal
little

with

details

programmers must know.

have our program operate on files at a very low level (as we

do a great deal in this text), we must be on the lookout for little surprises
inserted in our file by the operating system or applications.

The

try to

UNIX

file

system, called the filesystem, organizes

files in a tree

and subdirectories expressable by their pathnames. It

is possible to navigate around the filesystem as you work with UNIX files.
UNIX views both physical devices and traditional disk files as files, so,
for example, a keyboard (STDIN), a console (STDOUT), and a tape drive
all are considered files.
This simple conceptual view of files makes it
possible in UNIX to do with a very few operations what might require
many times the operations on a different operating system.
I/O redirection and pipes are convenient shortcuts provided in UNIX for
transferring file data between files and standard I/O. Header files in UNIX,
such at stdio.h, contain special names and values that you must use when
structure,

with

all files

FIGURE 2.5 The exchange between memory and external device.

identified

Amount of data
to transfer

KEY TERMS

performing file operations. It is important to be aware of the most common

of these in use on your system.
The following section lists a sampling of UNIX commands for
manipulating files.

KEY TERMS
Access mode. Type of

access allowed.

file

The

modes

variety of access

permitted varies from operating system to operating system.

When

Buffering.

input or output

destination immediately,

saved up rather than sent off to

say that

it is

its

buffered. In later chapters,

can dramatically improve the performance of proand write data if we buffer the I/O.
Byte offset. The distance, measured in bytes, from the beginning of the
file. The very first byte in the file has an offset of 0, the second byte
has an offset of 1, and so on.
CLOSE( ). A function or system call that breaks the link between a logical file name and the corresponding physical file name.
CREATE( ). A function or system call that causes a file to be created
on secondary storage and may also bind a logical name to the file's
find that

grams

that read

physical

name

see

OPEN(

CREATE(

call to

the generation of information used

the system to

also results in

manage

the

file,

such as time of creation, physical location, and access privileges for

anticipated users of the

End-of-file (EOF).

file.

indicator within a

has occurred, a function that

countered

tells if

(e.g., eof( ) in Pascal),

file

that the

the end of a

file

end of the file

has been en-

or a system-specific value that

returned by file-processing functions indicating that the end of

a file

has been encountered in the process of carrying out the function

(e.g.,

EOF

UNIX

UNIX).

File descriptor.

open(

A
)

small, non-negative integer value returned

or creat(

) call

that

used

as a logical

name

for the

UNIX

file

system calls.
Filesystem. The name used in UNIX to describe a collection of files
and directories organized into a tree-structured hierarchy.
Header file. A file in a UNIX environment that contains definitions and
declarations commonly shared among many other files and applications. In C, header files are included in other files by means of the
in later

"#include" statement

(see Fig. 2.2).

The header

files stdio.h, file.h,

FUNDAMENTAL

PROCESSING OPERATIONS

FILE

and fcntl.h described

and definitions used

in this chapter contain

important declarations

in file processing.

I/O redirection. The redirection of a stream of input or output from its

normal place. For instance, the operator '>' can be used to redirect
to a file output that would normally be sent to the console.
Logical file. The file as seen by the program. The use of logical files
allows a program to describe operations to be performed on a file
without knowing what actual physical file will be used. The program may then be used to process any one of a number of different
files that share the same structure.
OPEN( ). A function or system call that makes a file ready for use. It

may

bind

also

file

include information

Pathname.

name to a physical file. Its arguments inname and the physical file name and may also

a logical file

clude the logical

on how

the

file is

expected to be accessed.

character string that describes the location of a

rectory. If the

pathname

starts

with

a 7',

then

file

or di-

gives the absolute

the complete path from the root directory to the file.

Otherwise it gives the relative pathname the path relative to the current working directory.
Physical file. A file that actually exists on secondary storage. It is the
file as known by the computer operating system and that appears in

pathname

its file

directory.

A UNIX

operator specified by the symbol '|' that carries data

from one process to another. The originating process specifies that
the data is to go to STDOUT, and the receiving process expects the
data from STDIN. For example, to send the standard output from a
program makedata to the standard input of a program called usedata,
use the command "makedata usedata".
Protection mode. An indication of how a file can be accessed by vari-

Pipe.

ous classes of users. In

octal

number

UNIX, the protection mode is a three-digit

how the file can be read, written to, and

that indicates

executed by the owner, by members of the owner's group, and by

everyone

READ(

device.
(1) a

else.

function or system

When viewed

Source_file logical

call

used to obtain input from a file or

it requires three arguments:

the lowest level,

name corresponding

to an

open

file; (2)

Destination_address for the bytes that are to be read; and

amount of data

SEEK(

to be read.

function or system

specified position in the

tions allow

(3)

the

the Size

programs

file.

call that sets

Languages

the read/write pointer to a

that provide seeking func-

to access specific elements of a

rather than having to read through a

file

from

file directly,

the beginning (sequen-

EXERCISES

each time a specific item

is desired. In C, the lseek( ) system

provides this capability. Standard Pascal does not have a seeking
capability, but many nonstandard dialects of Pascal do.

tially)

call

Standard I/O. The source and destination conventionally used

and output.

ways

for input

there are three types of standard I/O: standard

(STDOUT),

standard output

default

STDERR

STDIN is

and

the keyboard, and

STDERR

(standard

STDOUT and

I/O redirection and pipes provide

are the console screen.

to override these defaults.

WRITE(
(1) a

UNIX,

(STDIN),

input
error).

ties.

function or system

When viewed

Destination_file

call

lowest

at the

used to provide output capabili-

level,

requires three arguments:

name corresponding

to an

open

file; (2)

Source_address of the bytes that are to be written; and

amount of the data to be written.

(3)

the

the Size or

EXERCISES
1. Look up operations equivalent to OPEN( ), CEOSE( ), CREATE( ),
READ( ), WRITE( ), and SEEK in other high-level languages, such as
PL/I, COBOL, and Fortran. Compare them with the C or Pascal versions.

If
a)

you use C:

Make

of the different ways to perform the file operations

), CLOSE( ), READ( ), and WRITE( ). Why
there more than one way to do each operation?
a list

CREATE(
is

How

OPEN(

would you use

Show how

lseek(

to find the current position in a file?

change the permissions on a file myfile so the owner

has read and write permissions, group members have execute permission, and others have no permission.
d) What is the difference between pmode and 0_RDWR? What
pmodes and 0_RDWR are available on your system?
e) In some typical C environments, such as UNIX and MS-DOS, all
of the following represent ways to move data from one place to anc)

other:

scanfC
fscanf(
getc( )

)
)

Describe as
useful.

fgetcC
gets(
fgetsC

<
)

to the

cat (or type)

main (argc argv)
7

many of these

Which belong

system?

read(

you

can,

and indicate

how

they might be

language, and which belong to the operating

FUNDAMENTAL

PROCESSING OPERATIONS

FILE

you use Pascal:

What ways are provided

If
a)

file

operations

WRITE(
tell

)? If

why.

CREATE(
there

your version of Pascal

OPEN(

CLOSE(

more than one way

an operation

missing,

how

are

its

perform the
), and

READ(

a certain

operation,

functions carried

out?

Implement

SEEK(

function in your Pascal,

does not

al-

ready have one.

couple of years ago

One
new compiler

compiler.
that the

difference

a company we know of bought a new COBOL

between the new compiler and the old one was

did not automatically close

program terminated, whereas the old compiler

did this cause

when execution of a
What sorts of problems

files

did.

when some of the old software was

new compiler?

executed after having

been recompiled with the

Look

the

two LIST programs

Pascal, the sequence

test,

Why

write.

of steps

in the

in the text.

loop

is test,

Each has

a while loop. In

read, write. In C,

it is

loop construction used for

read,

What would happen in Pascal if we used

C? What would happen in C if we used

the difference?

the

Pascal loop construction?

In Fig. 2.4:

Give the full pathname for a file in directory DF.

Suppose your current directory is bin. Show how to copy the file
libdf.a to the directory DF without changing your current directory.

What

direct error
8.

the difference

Look up

the

Find

STDOUT and STDERR? Find how to

compilation on your system to

UNIX command

environment, and explain

between

messages from

why

STDERR.

we. Execute the following in a

gives the

number of files

UNIX

in the directory.

on your system, and find what value is used to indicate

Also examine fde.h or jend. h and describe in general what its

stdio.h

end-of-file.

contents are for.

Programming Exercises
10. Make the LIST program we provide in
compiler on your operating system.
11.

Write

program

program
open the

file and store

and read the string.

to create a
file

this

chapter

a string in

work with your

it.

Write another

FURTHER READINGS

12.

Try

file

with an access

setting the protection

mode on

a file to read-only,

mode of read/write. What

13. Implement the UNIX command

from the end of the file to be copied

tail -n,

then opening the

happens?

where

the

number of lines

STDOUT.

Change

the program LIST so it reads from the STDIN, rather than a

and writes to a file, rather than the STDOUT. Show how to execute
the new version of the program in a UNIX environment, given that the
input is actually in a file called instuff. (You can also do this in most
14.

file,

MS-DOS

environments.)

a program to read a series of names, one per line, from standard

and write out those names spelled in reverse order to standard
output. Use I/O redirection and pipes to do the following:
a. Input a series of names that are typed in from the keyboard and

15.

Write

input,

them out, reversed, to a file called filel.

Read the names in from filel; then write them

write
b.

out, re-reversed, to

a file called file2.

Read the names

the resulting

list

from jile2, reverse them

of reversed words using

again,

and then sort

sort.

FURTHER READINGS

Introductory textbooks on

only briefly,

if at all.

C and Pascal tend to treat the fundamental file operations

This

particularly true with regard to C, since there are

higher-level standard I/O functions in C, such as the read operations fgets(

) and
do provide treatment of the
fundamental file operations are Bourne (1984), Kernighan and Pike (1984), and
Kernighan and Ritchie (1978, 1988). These books also provide discussions of
higher-level I/O functions that we omitted from our text.
As for UNIX specifically, as of this writing there are two dominant flavors of
UNIX: UNIX System V from AT&T, the originators of UNIX, and 4.3BSD
(Berkeley Software Distribution) UNIX from the University of California at
Berkeley. The two versions are close enough that learning about either will give you

fgetc(

Some books on C and/or

UNIX

that

good understanding of UNIX generally. However, as you begin to use UNIX,

you will need reference material on the specific version that you are using. There are
many accessible texts on both versions, including Morgan and McGilton (1987) on
System V, and Wang (1988) on 4.3BSD. Less readable but absolutely essential to a
serious UNIX user is the 4.3BSD UNIX Programmers Reference Manual (U.C.
a

Berkeley, 1986) or the System

V Interface

Definition

(AT&T,

1986).

For Pascal, these operations vary so greatly from one implementation to

another that it is probably best to consult user's manuals and literature relating to

FUNDAMENTAL

your
as

specific

some

FILE

PROCESSING OPERATIONS

implementation. Cooper (1983) covers the ISO standard Pascal,

extensions. Jensen and Wirth (1974)

others arc based. Wirth (1975) discusses

file

operations in the section,

Problems: Files."

some

as well

on which all
with standard Pascal and

the definition of Pascal

difficulties

"An Important Concept and

a Persistent

Source of

Secondary Storage and

System Software

CHAPTER OBJECTIVES
Describe the organization of typical disk drives, including basic units of organization and their relationships.

and describe the factors affecting disk access

methods for estimating access
times and space requirements.
Identify

time, and describe

Describe magnetic tapes, identify some tape applications, and investigate the implications of block size
on space requirements and transmission speeds.
Identify fundamental differences
criteria that

between media and

can be used to match the right

medium

to an application.

Describe in general terms the events that occur when

data is transmitted between a program and a secondary storage device.

Introduce concepts and techniques of buffer management.

Illustrate

many of the

concepts introduced in the

chapter, especially system software concepts, in the

context of

UNIX.
35

CHAPTER OUTLINE
3.1

Disks

A Journey

3.5

3.1.1 The Organization of Disks

3.1.2 Estimating Capacities and Space

3.5.1

3.5.2

Needs

3.5.3

by Sector
Organizing Tracks by Block
Nondata Overhead
The Cost of a Disk Access
Effect of Block Size on
Performance: A UNIX Example

The File Manager

The I/O Buffer
The Byte Leaves RAM: The

3.1.3 Organizing Tracks

I/O Processor and Disk

3.1.4

Controller

3.1.5

3.1.6
3.1.7

Buffer

3.6

I/O in

3.7

Magnetic Tape
3.2.1

Organization of Data on Tapes

Tape Length

3.2.2 Estimating

The Kernel

3.7.

Linking

File

3.7.

Normal

Files, Special Files,

3.7.
3.7.
3.7.

3.4

Storage as a Hierarchy

Good

design

Names

always responsive to the constraints of the

the environment. This

to Files

and

Block I/O
Device Drivers
The Kernel and File Systems
Magnetic Tape and UNIX

3.7.

Times
3.2.4 Tape Applications

Disk versus Tape

UNIX

3.7.

Sockets

Requirements
3.2.3 Estimating Data Transmission

3.3

Management

Buffer Bottlenecks
3.6.2 Buffering Strategies
3.6.1

3.1.8 Disk as Bottleneck

3.2

of a Byte

as true for file structure

design as

medium and
it is

for designs

wood and

stone. Given the ability to create, open, and close files, and to
and write, we can perform the fundamental operations of file
construction. Now we need to look at the nature and limitations of the
devices and systems used to store and retrieve files, preparing ourselves for

seek, read,

file

design.
If files

were stored just

called file structures.

the tools

we would

The

RAM,

need to build

devices are very different

from

file

RAM. An
is

applications.

RAM. One

that accesses to secondary storage take

impact,

would be no

there

separate discipline

general study of data structures

would give

all

But secondary storage

difference, as already noted,

much more time

than do accesses to

even more important difference, measured in terms of design

that not

all

accesses are equal.

Good

knowledge of disk and tape performance

minimize access

costs.

file

structure design uses

to arrange data in

ways

that

DISKS

we examine the
on the constraints

In this chapter

devices, focusing

of secondary storage
that shape our design work in the

characteristics

a look at the major media used in the

magnetic disks, and tapes. We follow this
with an overview of the range of other devices and media used for
secondary storage. Next, by following the journey of a byte, we take a brief

chapters that follow.

begin with

storage and processing of

look

when
look

3.1

many

files,

hardware and software that become involved

by a program to a file on a disk. Finally, we take a closer
one of the most important aspects of file management
buffering.
the

byte

pieces of

sent

Disks
Compared to the time it takes to access an item in RAM, disk accesses are
always expensive. However, not all disk accesses are equally expensive. The
reason for this has to do with the way a disk drive works. Disk drives"

belong to

a class

because they

with

of devices

make

serial devices,

known

as direct access storage devices

possible to access data

directly.

DASDs

(DASDs)

are contrasted

the other major class of secondary storage devices. Serial

devices use media such as magnetic tape that permit only serial access
particular data item cannot be read or written until
it

on the tape have been read or written

Magnetic disks come

many

all

of the data preceding

in order.

forms. So-called hard disks offer high

low cost per bit. Hard disks are the most common disk used
everyday file processing. Floppy disks are inexpensive, but they are slow
and hold relatively little data. Floppies are good for backing up individual
and for transporting small amounts of data.
files or other floppies
Removable disk packs are hard disks that can be mounted on the same drive
at different times, providing a convenient form of backup storage that also
capacity and

makes it possible to access data

Nonmagnetic disk media,

directly.

becoming
Appendix A for a

especially optical discs, are

creasingly important for secondary storage. (See

treatment of optical disc storage and

its

infull

applications.)

3.1.1 The Organization of Disks

The information

stored on a disk

stored on the surface of one or

platters (Fig. 3.1). The arrangement is such that the information is stored in
successive tracks on the surface of the disk (Fig. 3.2). Each track is often

""When

use the terms disks or disk

drives,

are referring to magnetic disk media.

SECONDARY STORAGE AND SYSTEM SOFTWARE

2>
2>
2>
2>

FIGURE 3.1 Schematic

illustration of disk drive.

divided into

of a

disk.

file,

the

Boom

Read/write heads

Spindle

Platters

number of sectors.

When

READ(

sector

statement

computer operating system

the smallest addressable portion

byte from

disk

finds the correct surface, track,

and
and

calls for a particular

sector, reads the entire sector into a special area in

RAM called a buffer,

then finds the requested byte within that buffer.

number of platters, it may be called a disk pack. The

above and below one another form a cylinder (Fig.
3.3). The significance of the cylinder is that all of the information on a single
cylinder can be accessed without moving the arm that holds the read/write
If a disk drive uses a

tracks that are directly

heads.

Moving

arm

called seeking. This

arm movement

usually the

Disks range in width from 2 to about 14 inches. They range

in storage

this

slowest part of reading information from

a disk.

3.1.2 Estimating Capacities and Space Needs

capacity

from

less

pack, the top and

than 400,000 bytes to billions of bytes. In

bottom

platter each contribute

one

surface,

a typical disk

and

all

other

Tracks

Sectors

Gaps
FIGURE 3.2 Surface of disk showing tracks and sectors.

FIGURE 3.3 Schematic illustration of disk drive viewed as a set

seven cylinders.

Seven

cylinders

SECONDARY STORAGE AND SYSTEM SOFTWARE

platters contribute

cylinder

two

surfaces to the pack, so the

The amount of data

number of

tracks per

number of platters.

function of the

that can be held

track depends

on how densely
on the quality

can be stored on
of the recording medium and the size of the read/ write heads.) An
inexpensive, low-density disk can hold about 4 kilobytes on a track, and 35
the disk surface. (This in turn depends

bits

on a surface. A top-of-the-linc disk can hold about 50 kilobytes on a

and more than 1,000 tracks on a surface. Table D.l in Appendix D
shows how a variety of disk drives compare in terms of capacity, performance, and cost.
Since a cylinder consists of a group of tracks, a track consists of a group
of sectors, and a sector consists of a group of bytes, it is easy to compute
track, cylinder, and drive capacities:
tracks
track,

Track capacity = number of sectors per track X bytes per sector

Cylinder capacity

= number of tracks

per cylinder X track capacity

Drive capacity = number of cylinders x cylinder capacity.

we know

compute

the number of bytes in a file, we can use these relationships

amount of disk space the file is likely to require. Suppose,
that we want to store a file with 20,000 fixed-length data

the

for instance,

records

"typical"

300-megabyte small computer disk with the

following characteristics:

How many

of bytes per sector

Number

o{ sectors per track

Number

of tracks per cylinder

Number

of cylinders = 1,331.

cylinders does the

file

20,000
- =

=11

require if each data record requires 256

two

bytes? Since each sector can hold

One

= 512

Number

records, the

1AAnn
10,000

sectors.

= 440

sectors

file

requires

cylinder can hold

40 x

so the

number of cylinders

1 1

required

10,000

approximately

22.7 cylinders.

440

Of course,

may

does not have

be that

a disk drive

with 22.7 cylinders of available space

22.7 physically contiguous cylinders available. In this likely

DISKS

case, the file

might

in fact

have to be spread out over dozens, perhaps even

hundreds, of cylinders.

3.1.3 Organizing Tracks by Sector

There are two basic ways to organize data on a disk: by sector and by
user-defined block. So far, we have only mentioned sector organizations. In
this section we examine sector organizations more closely. In the following
section

look

block organizations.

The Physical Placement of

Sectors

There are several views

can have of the organization of sectors on

a track.

The

that

one

simplest view, one

most users most of the time, is that sectors are adjacent,

segments
of a track that happen to hold a file (Fig. 3.4a). This is
fixed-sized
often a perfectly adequate way to view a file logically, but it may not be a
that suffices for

good way

to store sectors physically.

When you want

one right

to read a series of sectors that are

after the other,

you often cannot read

all

FIGURE 3.4 Two views of the organization of sectors on a 32-sector track.

15\ 16 117/18X19

32 \ 31

(a)

23 \ 4

27/14/

17/30/11

i20\7
(b)

in the

same

adjacent sectors.

track,

That

SECONDARY STORAGE AND SYSTEM SOFTWARE

because, after reading the data,

takes the disk controller a certain

of time to process the received information before

So, if logically adjacent sectors

it is

amount

ready to accept more.

were placed on the disk so they were

also

we would miss the start of the following sector while we

the one we had just read in. Consequently, we would be

physically adjacent,

were processing

one sector per revolution of the disk.

I/O system designers usually approach this problem by interleaving the
sectors, leaving an interval of several physical sectors between logically
adjacent sectors. Suppose our disk had an interleaving factor of 5. The
assignment of logical sector content to the 32 physical sectors in a track is
illustrated in Fig. 3.4(b). If you study this figure, you can see that it takes
five revolutions to read the entire 32 sectors of a track. That is a big
able to read only

improvement over 32 revolutions.

Over the last year or two,
high-performance disks can

now

controller
offer

speeds

successive sectors actually are physically adjacent,

an entire track in
Clusters

performance,

the

sectors.

cluster has

cluster

map

It
is

the logical parts of the

does

this

a fixed

been found on

by viewing the

file

manager

that

possible to read

file

to their corresponding

file as a series

number of contiguous

sectors.

a disk, all sectors in that cluster

without requiring an additional seek.

To view a file as a series of clusters and
the

view of sector organization, also designed to improve

view maintained by that part of a computer's operating
the file manager. When a program accesses a file, it is the

system that we call

file manager's job to
physical locations.

making

means

revolution of the disk.

a single

third

have improved so

interleaving. This

ties logical

still

"*"

clusters

Once

given

can be accessed

maintain the sectored view,

sectors to the physical clusters that they belong

(FAT). The FAT contains a list of all the

ordered according to the logical order of the sectors they
contain. With each cluster entry in the FAT is an entry giving the physical

by using

clusters in a

a file allocation table

file,

location of the cluster (Fig. 3.5).

On many

systems, the system administrator can decide

how many

sectors there should be in a cluster. For instance, in the standard physical

disk structure used

cluster size to be used

VAX
on

systems, the system administrator sets the

a disk

when

the disk

initialized.

The

default

is three 512-byte sectors per cluster, but the cluster size may be set to
any value between 1 and 65,535 sectors. Since clusters represent physically
contiguous groups of sectors, larger clusters guarantee the ability to read

value

""It

not always physically contiguous; the degree of physical contiguity

the interleaving factor.

determined by

DISKS

File allocation table

(FAT)

The

part of the

FAT

pertaining

our

Cluster

number

location

file

^^^\

^r
-~~\

~t^"

^^?

FIGURE 3.5 The file manager determines which cluster

is to be accessed.

the

file

has the sector

that

sectors without seeking,

performance gains

substantial

so the use of large clusters can lead to

when

being processed sequentially.

a file is

Extents Our final view of sector organization represents a further attempt

to emphasize physical contiguity of sectors in a file, hence minimizing
seeking even more. (If you are getting the idea that the avoidance of seeking
is an important part of file design, you are right.) If there is a lot of free
room on a disk, it may be possible to make a file consist entirely of
contiguous clusters.
extent: All

When this is

of its sectors,

contiguous whole

tracks,

(Fig. 3.6a).

the case,

and
This

(if it is
is a

we say

of one
large enough) cylinders form one

good

that the

file

consists

situation, especially if the file

be processed sequentially, because it means that the whole

accessed with a minimum amount of seeking.

file,

can be

If there is

not enough contiguous space available to contain an entire

the

divided into

file is

an extent.

When new

make them
is

file

two or more noncontiguous

clusters are

added to

Each part is
manager tries to

parts.

a file, the file

physically contiguous to the previous end of the

unavailable for

this, it

must add one or more extents

file,

but

if space

(Fig. 3.6b).

The

SECONDARY STORAGE AND SYSTEM SOFTWARE

(a)

(b)

FIGURE 3.6
single

File extents

(shaded area represents space on disk used by a

file).

most important thing

extents in a
the

file

to understand about extents

increases, the

file

amount of seeking required

file is

records and sectors. There are

that as the

number of
disk,

and

to process the file increases.

Fragmentation Generally, all sectors on

same number of bytes. If, for example, the
the size of all records in a

becomes more spread out on the

given drive must contain the

size

300 bytes, there

two ways

of a sector

to deal

512 bytes and

no convenient
with

fit

between

this situation: Store

only one record per sector, or allow records to span sectors, so the

beginning of
another (Fig.

The

first

record might be found in one sector and the end of

3.7).

option has the advantage that any record can be retrieved by

it has the disadvantage that it might leave an

retrieving just one sector, but

enormous amount of unused space within each sector. This loss of space
within a sector is called internal fragmentation. The second option has the

DISKS

advantage that

loses

no space from

disadvantage that some records

internal fragmentation, but

may

has the

be retrieved only by accessing two

sectors.

Another potential source of internal fragmentation

clusters. Recall that a cluster

allocated for a

file.

When

the

results

from the use

the smallest unit of space that can be

number of

bytes in a

file

not an exact

multiple of the cluster

size, there will be internal fragmentation in the last

For instance, if a cluster consists of three 512-byte sectors,
a file containing one byte would use up 1,536 bytes on the disk; 1,535 bytes
would be wasted due to internal fragmentation.
Clearly, there are important trade-offs in the use of large cluster sizes.

extent of the

file.

A disk that is expected to have mainly large files that will often be processed
would usually be given a large cluster size, since internal
fragmentation would not be a big problem and the performance gains
sequentially

might be great. A disk holding smaller files or files that are usually accessed
only randomly would normally be set up with small clusters.

3.1.4 Organizing Tracks by Block

Sometimes disk tracks
numbers of user-defined

are

not

blocks

divided into sectors,

whose

size

but into integral

can vary. (Note:

The word

FIGURE 3.7 Alternate record organization within sectors (shaded areas represent
data records, and unshaded areas represent unused space).

(a)

block

SECONDARY STORAGE AND SYSTEM SOFTWARE

Sector

Sector 3

Sector 2

Sector 4

1111111111 1111111111 1111222222

2 2

3 3 3

Sector 5

Sector 6

444 4444444444

4 4 15 5

(a)

11111111111.

.111111111222.

.22',333;444444.

.4444 441555

(b)

FIGURE 3.8 Sector organization versus block organization.

meaning

of the

UNIX

I/O system. See section

3.7 for details.) When the data on a track is organized by block, this usually
means that the amount of data transferred in a single I/O operation can vary
depending on the needs of the software designer, not the hardware. Blocks
can normally be either fixed or variable in length, depending on the
requirements of the file designer. As with sectors, blocks are often referred
to as physical records. (Sometimes the word block is used as a synonym for
a sector or group of sectors. To avoid confusion, we do not use it in that
way here.) Figure 3.8 illustrates the difference between one view of data on
a sectored track and that of a blocked track.
A block organization does not present the sector-spanning and fragmentation problems of sectors because blocks can vary in size to fit the logical
organization of the data. A block is usually organized to hold an integral
number of logical records. The term blocking factor is used to indicate the
number of records that are to be stored in each block in a file. Hence, if we
had a file with 300-byte records, a block-addressing scheme would let us
define a block to be some convenient multiple of 300 bytes, depending on
the needs of the program. No space would be lost to internal fragmentation,
and there would be no need to load two blocks to retrieve one record.
has a different

in the context

Generally speaking, blocks are superior to sectors

when

it is

desirable to

have the physical allocation of space for records correspond to their logical
organization. (There are disk drives that allow both sector-addressing and
block-addressing, but

we do

In block-addressing

by one or more
Typically there

number of

not describe them here. See Bohl, 1981.)

schemes, each block of data

usually accompanied

subblocks containing extra information about the data block.

a count subblock that

contains

(among other

things) the

bytes in the accompanying data block (Fig. 3.9a). There

also be a key subblock containing the

key

may

for the last record in the data block

DISKS

(Fig. 3.9b).

When

key subblocks are used, a track can be searched by the

disk controller for a block or record identified by a given key. This

that a

program can ask

its

disk drive to search

among

all

track for a block with a desired key. This approach can result in

searches

than

means
on a
much more

the blocks

normally

possible with sector-addressable

schemes, in which keys cannot generally be interpreted without first
loading them into primary memory.
efficient

are

3.1.5 Nondata Overhead

Both blocks and sectors require that a certain amount of space be taken up
on the disk in the form of nondata overhead. Some of the overhead consists
of information that is stored on the disk during preformatting, which is done
before the disk can be used.

On sector-addressable disks, preformatting involves storing, at the

beginning of each sector, such information as sector address, track address,
and condition (whether the sector is usable or defective). Preformatting also
placing gaps and synchronization marks between fields of
information to help the read/write mechanism distinguish between them.
This nondata overhead usually is of no concern to the programmer. When
the sector size is given for a certain drive, the programmer can assume that
this is the amount of actual data that can be stored in a sector.
On a block-organized disk, some of the nondata overhead is invisible to
the programmer, but some of it must be accounted for by the programmer.
Since subblocks and interblock gaps have to be provided with every block,
involves

FIGURE 3.9 Block addressing requires that each physical data block be accompanied by one
more subblocks containing information about its contents.

III

I
i

(a)

(b)

mSM

SECONDARY STORAGE AND SYSTEM SOFTWARE

there

generally

more nondata information provided with blocks than

with sectors. Also, since the number and

application to another, the relative

when block

can vary

addressing

sizes

of blocks can vary from one

amount of space taken up by overhead

used. This

illustrated in the following

example.

Suppose we have a block-addressable disk drive with 20,000 bytes per

and the amount of space taken up by subblocks and interblock gaps
is equivalent to 300 bytes per block. We want to store a file containing
100-byte records on the disk. How many records can be stored per track if
track,

the blocking factor

If there are 10

10,

it is

60?

100-byte records per block, each block holds 1,000

+ 1,000, or 1,300, bytes of track space

bytes of data and uses 300

when overhead
can

fit

taken into account.

The number of blocks which

20,000-byte track can be expressed

20,000
1,300 J

L15.38J

15.

So 15 blocks, or 150 records, can be stored per track. (Note that we

have to take the floor of the result because a block cannot span two
tracks.)
If there are

60 100-byte records per block, each block holds 6,000

The number of

bytes of data and uses 6,300 bytes of track space.

blocks per track can be expressed

20,000

6,300

So 3 blocks, or 180 records, can be stored per

track.

Clearly, the larger blocking factor can lead to

storage.

there

When
less

efficient use

blocks are larger, fewer blocks are required to hold

space

consumed by

the 300 bytes of overhead that

a file,

of
so

accompany

each block.

Can we conclude from this example that larger blocking factors always
more efficient storage utilization? Not necessarily. Since we can put
only an integral number of blocks on a track, and since tracks are fixed in
length, we almost always lose some space at the end of a track. Here we
lead to

have the internal fragmentation problem again, but this time it applies to
fragmentation within a track. The greater the block size, the greater
potential amount of internal track fragmentation. What would have
happened if we had chosen a blocking factor of 98 in the preceding example?

What about 97?

The flexibility introduced by
result

the use of blocks, rather than sectors, can

savings in time and efficiency,

since

lets

the

programmer

DISKS

determine to
disk.

a large

extent

how

the negative side,

data are to be organized physically on

blocking schemes

the

require

programmer

and/or operating system to do the extra work of determining the data

organization. Also, the very flexibility introduced by the use of blocking

schemes precludes the synchronization of I/O operations with the physical

movement of the disk, which sectoring permits. This means that strategies
such as sector interleaving cannot be used to improve performance.

3.1.6 The Cost

of a Disk

Access

To give you a feel for the factors contributing to the total amount of time
needed to access a file on a fixed disk, we calculate some access times. A disk
access can be divided into three distinct physical operations, each with its
own cost: seek rime, rotational delay, and transfer tiiiie.
Seek

Time

Seek time

the time required to

move

the access

The amount of time spent seeking during a

depends, of course, on how far the arm has to move. If we are
correct cylinder.

sequentially and the

arm

to the

disk access

accessing

packed into several consecutive cylinders,

seeking needs to be done only after all of the tracks on a cylinder have been
processed, and even then the read/write head needs to move the width of
only one track. At the other extreme, if we are alternately accessing sectors
file

from two

files

file is

that are stored at opposite extremes

innermost cylinder,

one

outermost cylinder),

the

disk (one at the

seeking

very

expensive.

Seeking

likely to be

costly in a multiuser environment,

where

several processes are contending for use of the disk at one time, than in

single-user environment,

where disk usage

dedicated to one process.

Since seeking can be very costly, system designers often go to great

extremes to minimize seeking. In an application that merges three files, for
example, it is not unusual to see the three input files stored on three different
drives and the output file stored on a fourth drive, so no seeking need be

done

I/O operations jump from

Since

it is

traversed in every seek,

required for

file

usually impossible to

particular

for each access are

third of the total

we
file

random,

know

file.

exactly

how many

operation. If the starting and ending positions

turns out that the average seek traverses one

number of cylinders

that the read/write

Manufacturers' specifications for disk drives often

"Derivations of

tracks will be

usually try to determine the average seek time

this result, as well as

list

head ranges over. 7

this

figure as the

detailed and refined models, can be found in

Wiederhold (1983), Knuth (1973b). Teory and Fry

(1982),

and Salzberg (1988).

SECONDARY STORAGE AND SYSTEM SOFTWARE

FIGURE 3.10 When a single file can span several tracks on a cylinder, we
can stagger the beginnings of the tracks to avoid rotational delay when
moving from track to track during sequential access.

average seek time for the drives. Most hard disks available today (1991)
have average seek times of less than 40 milliseconds (msec), and highperformance disks have average seek times as low as 10 msec.

Rotational Delay

Rotational delay refers to the time

we want

takes for the disk

under the read/write head. Hard disks

is one revolution per 16.7 msec.
On average, the rotational delay is half a revolution, or about 8.3 msec. On
floppy disks, which often rotate at only 360 rpm, average rotational delay
to rotate so the sector

usually rotate at about 3,600 rpm,

a sluggish

which

83.3 msec.

As in the case of seeking, these averages apply only when the read/ write
head moves from some random place on the disk surface to the target track.
In

many

circumstances, rotational delay can be

For example, suppose that you have

much

less

than the average.

two or more tracks,

on one cylinder, and that you write

a file that requires

of available tracks
the file to disk sequentially, with one write call. When the first track is
filled, the disk can immediately begin writing to the second track, without
any rotational delay. The "beginning" of the second track is effectively
staggered by just the amount of time it takes to switch from the read/write
head on the first track to the read/write head on the second. Rotational
delay, as it were, is virtually nonexistent. Furthermore, when you read the
file back, the position of data on the second track ensures that there is no
that there are plenty

rotational delay in switching

illustrates this

from one

staggered arrangement.

track to

another.

Figure 3.10

DISKS

Time

Once the data we want is under the read/write

can be transferred. The transfer time is given by the formula

Transfer

Transfer time

number of bytes transferred

X
number of bytes on a track

head,

rotation time.

time for one sector depends on the number

For example, if there are 32 sectors per track, the time
required to transfer one sector would be l/32nd of a revolution, or 0.5
msec.

If a drive is sectored, the transfer

of sectors on

a track.

Some Timing Computations

show how

situations that

times.

time

compare the time

will

takes to access

Let's look at

different types

all

two
file

different

takes to access a

of the records

file

processing

access can affect access

file in

sequence with the

in the file randomly. In the

former

much of the file as we can whenever we access it. In the

random-access case, we are able to use only one record on each access.
case,

The

use as

basis for

our calculations

"typical" 300-megabyte fixed disk

described in Table 3.1. This particular disk

typical

of one that might be

used with a workstation in 1991. Although it is typical only of a certain class

of fixed disk, the observations we draw as we perform these calculations are
quite general.

The

disks used with larger,

bigger and faster than

more expensive computers

factors contributing to total access times are essentially the same.

TABLE

Minimum

3.1

Specifications of disk drive used

(track-to--track) seek

time

examples

in text

6 msec

msec

Average seek time

Rotational delay

8.3

Maximum

16.7 msec/track, or 1,229 bytes/msec

transfer rate

are

but the nature and relative costs of the

this disk,

Bytes per sector

512

Sectors per track

Tracks per cylinder

msec

Tracks per surface

1,331

Interleave factor

Cluster size

8 sectors

5* cjtu/*ti/**

Smallest extent size

5 clusters

p^-^ "tJsoce^k-

SECONDARY STORAGE AND SYSTEM SOFTWARE

Since our drive uses

smallest extent

of 8 sectors (4,096 bytes) and the

a cluster size

5 clusters, space

allocated for storing

files in

with an interleave factor of

units. Sectors are interleaved

given track can be transferred

one-track

so data on a

at the stated transfer rate.

suppose that we wish to know how long it will take, using this
drive, to read a 2,048-K-byte file that is divided into 8,000 256-byte records.
First we need to know how the file is distributed on the disk. Since the
4,096-byte cluster holds 16 records, the file will be stored as a sequence of
500 4,096-byte clusters. Since the smallest extent size is 5 clusters, the 500
clusters are stored as 100 extents, occupying 100 tracks.
This means that the disk needs 100 tracks to hold the entire 2,048 K
bytes that we want to read. We assume a situation in which the 100 tracks
are randomly dispersed over the surface of the disk. (This is an extreme
situation chosen to dramatize the point we want to make. Still, it is not so
extreme that it could not easily occur on a typical overloaded disk that has
a large number of small files.)
Now we are ready to calculate the time it would take to read the
2,048-K-byte file from the disk. We first estimate the time it takes to read
the file sector by sector in sequence. This process involves the following
Let's

operations for each track:

Average seek

msec
msec
16.7 msec
18

Rotational delay

Read one

8.3

track

Total

want

to find

and read 100

Total time

Now

let's

msec.

43
tracks, so the

100 x 43 msec

calculate the time

4,300 msec

would

4.3 seconds.

take to read in the

same 8,000

records using random access rather than sequential access. In other words,

one sector right

rather than being able to read

we have

to access the records in

track to track every time

after another,

some order

read a

new

that requires

we assume

that

jumping from

sector. This process involves the

following operations for each record:

Average seek

Read one

cluster 11

Total
Total time

msec
msec
3.3 msec

Rotational delay

8.3

16.7)

msec

29.6

8,000 x 29.6 msec

236,800 msec

236.8 seconds.

This difference in performance between sequential access and random

access is very important. If we can get to the right location on the disk and

DISKS

we are clearly much better off than

we are if we have to jump around, seeking every time we need a new record.
Remember that seek time is very expensive; when we are performing disk
operations we should try to minimize seeking.
read a lot of information sequentially,

3.1.7 Effect
In deciding

how

Block Size on Performance: A UNIX Example

best to organize disk storage allocation for several versions

of BSD UNIX, the Computer Systems Research Group (CSRG) in

Berkeley investigated the trade-offs between block size and performance in
a UNIX environment (Leffler et al., 1989). The results of their research
provide an interesting case study involving trade-offs between block size,
fragmentation, and access time.

The
standard

CSRG
at

research indicated that

the time

UNIX

minimum

systems,

w as
T

block

size

of 512 bytes,

not very efficient

in a typical

UNIX

environment. Files that were several blocks long often were

scattered over many cylinders, resulting in frequent seeks and thereby
significantly decreasing throughput. The researchers found that doubling
the block size to 1,024 bytes improved performance by more than a factor

of 2. But even with 1,024-byte blocks, they found that throughput was only
about 4% of the theoretical maximum. Eventually, they found that
4,096-byte blocks provided the fastest throughput, but this led to large
amounts of wasted space due to internal fragmentation. These results are
summarized in Table 3.2.

TABLE 3.2

The amount

Space

Percent

Used

Waste

wasted space as

a function of block size

Organization

(Mbyte)
775.2

0.0

Data only, no separation between

807.8

4.2

Data only, each

828.7

6.9

Data + inodes, 512-byte block

866.5

11.8

Data + inodes, 1,024-byte

948.5

22.4

1,128.3

45.6

+ modes,
Data + inodes,
Data

The Design and Implementation of the 4.3BSD

UNIX

file starts

files

on 512-byte boundary

2,048-byte
4,096-byte

UNIX file system

block UNIX file system
block UNIX file system
block UNIX file system

Operating System, Loftier et

al.,

198.

SECONDARY STORAGE AND SYSTEM SOFTWARE

gain the advantages of both the 4,096-byte and the 512-byte

systems, the Berkeley group implemented

a variation of the cluster concept

implementation, they allocate 4,096-byte
that are big enough to need them; but for smaller files, they

(see section 3.1.3).

blocks for

files

In the

new

allow the large blocks to be divided into one or more fragments. With a
fragment size of 512 bytes, as many as eight small files can be stored in one
block, greatly reducing internal fragmentation.

With the 4,096/512 system,

wasted space was found to decline to about 12%.

3.1.8 Disk as Bottleneck

Disk performance is increasing steadily, even dramatically, but disk speeds
lag far behind local network speeds. A high-performance disk drive
with 50 K bytes per track can transmit at a peak rate of about 3 megabytes
per second, and only a fraction of that under normal conditions.
High-performance networks, in contrast, can transmit at rates of as much as
100 megabytes per second. The result can often mean that a process is disk
bound
the network and the CPU have to wait inordinate lengths of time
still

for the disk to transmit data.

A number of techniques are used to solve this problem. One is

multiprogramming, in which the CPU works on other jobs while waiting
for the data to arrive. But if multiprogramming is not available, or if the
process simply cannot, afford to lose so

ways must be found

much

time waiting for the disk,

up disk I/O.
One technique that is now offered on many high-performance systems
called striping. Disk striping involves splitting the parts of a file on several
to speed

different drives, then letting the separate drives deliver parts of the

file

to the

network simultaneously.
For example, suppose we have a 10-megabyte file spread across 20
high-performance (3 megabytes per second) drives that hold 50 K per track.
The first drive has the first 50 K of the file, the second drive has the second
50 K, etc., through the twentieth drive. The first drive also holds the
twenty-first 50 K, and so forth until 10 megabytes are stored. Collectively,
the 20 drives can deliver to the network 250 K per revolution, a combined
rate of 60 megabytes per second.
Disk striping exemplifies an important concept that we see more and
more in system configurations parallelism. Whenever there is a bottleneck
at

some point

in the system, consider duplicating the thing that

the source

of the bottleneck, and configure the system so several of them operate

parallel.

Another approach
the disk at

all.

to solving the disk bottleneck

the cost of

RAM steadily decreases,

to avoid accessing

more and more

users

DISKS

RAM to hold data that a few years ago had to be kept on a disk.
Two effective ways in which RAM can be used to replace secondary storage
are RAM disks and disk caches.
A RAM disk a large part of RAM configured to simulate the behavior
are using

of a mechanical disk in every respect except speed and volatility. Since data
can be located in
without a seek or rotational delay,
disks can
provide much faster access than mechanical disks. Since RAM is normally
volatile, the contents of a
disk are lost when the computer is turned
off.
disks are often used in place of floppy disks because they are
much faster than floppies and because relatively little
is needed to

RAM

simulate a typical floppy disk.

A disk cache^ is a large block of RAM configured to contain pages of data

a disk. A typical disk-caching scheme might use a 256-K cache with

from

When

a disk.

data

requested from secondary

looks into the disk cache to see

first

data. If

if it

memory,

the

file

manager

contains the page with the requested

does, the data can be processed immediately. Otherwise, the

manager reads the page containing

the data

from

disk, replacing

file

some page

already in the disk cache.

Cache memory can provide

especially
locality.

when

substantial

improvements

program's data access patterns exhibit

Locality exists in a

when

file

in
a

performance,

high degree of

blocks that are accessed in close

temporal sequence are stored close to one another on the disk. When a
disk cache is used, blocks that are close to one another on the disk are much
more likely to belong to the page or pages that are read in with a single
read, diminishing the likelihood that extra reads are

needed for extra ac-

cesses.

RAM

disks and cache memory are examples of

important and frequently used family of I/O techniques.

look

buffering,

very

take a closer

buffering in section 3.6.

examples of the need to

and disk caches, there
make trade-offs in file
is tension between the cost/capacity advantages of disk over RAM, on the
on the other. Striping provides
one hand, and the speed of
opportunities to increase throughput enormously, but at the cost of a more
In these three techniques

see once again

processing. With

RAM disks

RAM

complex and sophisticated disk management system. Good

file

design

balances these tensions and costs creatively.

f The

term

spect to

opposed to disk cache) generally refers to a very high-speed block of

performs the same types of performance-enhancing operations with
that a disk cache does with respect to secondary memory.

cache (as

mary memory

RAM

that

prire-

SECONDARY STORAGE AND SYSTEM SOFTWARE

Magnetic Tape

3.2

Magnetic tape units belong

of devices that provide no direct

to a class

accessing facility but that can provide very rapid sequential access to data.

Tapes are compact, stand up well under different environmental conditions,

are easy to store and transport, and are less expensive than disks.

3.2.1 Organization of Data on Tapes

Since tapes are accessed sequentially, there
identify the locations of data

byte within a
start

of the

file

file.

parallel tracks,

a tape.

corresponds directly to

We may

no need

sequence of

(see Fig. 3.11), the nine bits that are at

a typical tape as a set

as a

one-bit-wide

there are nine tracks

bits. If

corresponding positions in the nine

respective tracks are taken to constitute one byte, plus

can be thought of

physical position relative to the

its

envision the surface of

each of which

for addresses to

a tape, the logical position

slice

parity

of tape. Such

bit.

byte

a slice is called a

frame.

The

parity bit

not part of the data but

used to check the validity of

make the number of 1 bits

frame odd. Even parity works similarly but is rarely used with tapes.
Frames (bytes) are grouped into data blocks whose size can vary from
a few bytes to many kilobytes, depending on the needs of the user. Since
tapes are often read one block at a time, and since tapes cannot stop or start
instantaneously, blocks are separated by interblock gaps, which contain no
the data. If odd parity

in effect, this bit

set to

in the

FIGURE 3.11

Nine-track tape.

Track

Frame

I
I

111
1

Gap

Data block

Gap

MAGNETIC TAPE

information and are long enough to permit stopping and starting.

odd

tapes use

parity,

of consecutive

Tape

valid frame can contain

frames

used to

all

bits,

a large

When

number

the interrecord gap.

fill

come in many shapes, sizes, and speeds. Performance

among drives can usually be measured m terms of three

drives

differences
quantities:

commonly 800,
or 6,250
per inch
much
30,000
Tape speed commonly 30
200 inches per second
and
of interblock gap commonly between 0.3 inch and 0.75
Tape density

1,600,

track, but recently as

(bpi) per

bits

bpi;

(ips);

Size

Note

inch.

that a 6,250-bpi nine-track tape contains 6,250 bits per inch per track,

and 6,250 bytes per inch when the full nine tracks are taken together. Thus,
in the computations that follow, 6,250 bpi is usually taken to mean 6,250
bytes of data per inch.

3.2.2 Estimating Tape Length Requirements

we want

backup copy of a large mailing list file with one

million 100-byte records. If we want to store the file on a 6,250-bpi tape that
has an interblock gap of 0.3 inches, how much tape is needed?
To answer this question we first need to determine what takes up space
on the tape. There are two primary contributors: interblock gaps and data
blocks. For every data block there is an interblock gap. If we let

Suppose

to store a

the physical length of a data block,

g =
n

the length of an interblock gap, and

the

number of data

then the space requirement

for storing the

We know
b is

blocks,

file is

g).

are. In fact,
not know what b and
depends on our choice of b. Suppose
choose each data block to contain one 100-byte record. Then b, the

that

0.3 inch, but

whatever we want

length of each block,

block
=

to be,

we do

and

given by

size (bytes per block)

jr
rr
tape density (bytes per inch)
:

- r~^- =

0.016 inch.

6,250

number of blocks, is one million (one per record).

The number of records stored in a physical block is called the blocking
factor. It has the same meaning that it had when it was applied to the use ol
and

the

SECONDARY STORAGE AND SYSTEM SOFTWARE

The blocking factor we have chosen here is 1

because each block has only one record. Hence, the space requirement for
blocks for disk storage.
the

file is

1,000,000 x (0.016

1,000,000 x 0.316 inch

316,000 inches

26,333

0.3) inch

feet.

Magnetic tapes range in length from 300 feet to 3,600 feet, with 2,400
being the most common length. Clearly, we need quite a few
2,400-foot tapes to store the file. Or do we? You may have noticed that our
choice of block size was not a very smart one from the standpoint of space
feet

utilization.

The

up about 19

interblock gaps in the physical representation of the

times as

much

snapshot of our tape,

Gap

Data

Clearly,

Gap

space on the tape

we were

file

take

to take a

like this:

Gap

Data

not used!

should consider increasing the relative amount of space

used for actual data

would look something

Data

Most of the

tape.

space as the data blocks do. If

we want

to try to squeeze the

increase the blocking factor,

file

onto one 2,400-foot

can decrease the

number of

which decreases the number of interblock gaps, which in turn

decreases the amount of space consumed by interblock gaps. For example,
if we increase the blocking factor from 1 to 50, the number of blocks
becomes
blocks,

^Q =

20,000,

and the space requirement for interblock gaps decreases from 300,000 inches
to 6,000 inches. The space requirement for the data is of course the same as
it was previously. What has changed is the relative amount of space occupied
by the gaps, as compared to the data. Now a snapshot of the tape would
look much different:

Data

Gap

Data

Gap

Data

Gap

Data

Gap

Data

MAGNETIC TAPE

We leave it to
when

you

show

that the

blocking factor of 50

When we compute
numbers

file

can

easily

fit

on one 2,400-foot tape

used.

the space requirements for our

that are quite specific to our

A more

file.

we produce

file,

general measure of the

of choosing different block sizes is effective recording density. The

is supposed to reflect the amount of actual data
that can be stored per inch of tape. Since this depends exclusively on the
relative sizes of the interblock gap and the data block, it can be defined as
effect

effective recording density

number of bytes

number of inches

When

block

per block

required to store

is used in our example, the number of bytes per

and the number of inches required to store a block is 0.316.

100,

100 bytes

= 316 4bpi
'

0.316 inches
is

a far cry

it,

space utilization

of data blocks and interblock gaps. Let us

amount of

time

from the nominal recording density of 6,250

way you look

Either
sizes

block'

blocking factor of 1

Hence, the effective recording density

which

bpi.

sensitive to the relative

now

see

how

they affect the

takes to transmit tape data.

3.2.3 Estimating Data Transmission Times

you understand

the role of interblock gaps and data block sizes in

determining effective recording density, you can probably see immediately

of data transmission. Two other

of data transmission to or from tape are the
nominal recording density and the speed with which the tape passes the
that these

two

factors also affect the rate

factors that affect the rate

read/write head. If we

know

these

two

values,

can compute the nominal

data transmission rate:

Nominal

rate

tape density (bpi)

Hence, our 6,250-bpi, 200-ips tape has

6,250 x 200

x tape speed

(ips).

nominal transmission

1,250,000 bytes/sec

1,250 kilobytes/sec.

is competitive with most disk drives.

But what about those interblock gaps? Once our data

rate

This rate

interblock gaps, the

example, that

gets dispersed

effective transmission rate certainly suffers.

use our blocking factor of

with the same

Suppose, for
file

and tape

SECONDARY STORAGE AND SYSTEM SOFTWARE

discussed in

the

preceding section (1,000,000 100-byte records,

0.3-inch gap).

saw

that the effective recording density for this tape

organization

316.4 bpi.

drive

If the tape

effective transmission rate

moving

at a rate

of 200

ips,

then

its

316.4 x 200

63,280 bytes/sec

63.3 kilobytes/sec,

about one twentieth of the nominal rate!

should be clear that a blocking factor larger than

a rate that is
It

and that

result,

substantially larger blocking

factor

improves on this
improves on it

substantially.

Although there
size

are other factors that can influence performance, block

generally considered to be the one variable with the greatest influence

on space

and data transmission rate. The other factors we have

gap size, tape speed, and recording density
are often beyond
the control of the user. Another factor that can sometimes be important is
the time it takes to start and stop the tape. We consider start/stop time in the
exercises at the end of this chapter.
included

utilization

3.2.4 Tape Applications

Magnetic tape
tions

the

an appropriate

files

medium

for sequential processing applica-

being processed are not likely also to be used

that require direct access. For example, consider the

in applications

problem of updating

monthly periodical. Is it essential that the list be kept

absolutely current, or is a monthly update of the list sufficient?
If information must be up-to-the-minute, then the medium must
permit direct access so individual updates can be made immediately. But if
mailing

for a

list

list needs to be current only when mailing labels are printed, all
of the changes that occur during the course of a month can be collected in
one batch and put into a transaction file that is sorted in the same way that
the mailing list is sorted. Then a program that reads through the two files
simultaneously can be executed, making all the required changes in one pass

the mailing

through the

data.

Since tape
data offline.

megabytes

relatively inexpensive,

At current

prices,

costs about 30 times as

it is

an excellent

medium

removable disk pack

much

blocked, can hold the same amount. Tape

as a reel
is

of tape

for storing

that holds
that,

good medium

150

properly

for archival

storage and for transporting data, as long as the data does not have to be
available

on short notice

for direct processing.

of tape drive, a streaming tape drive, is used widely lor

nonstop, high-speed dumping of data to and from disks. Generally less
special kind

DISK VERSUS TAPE

expensive than general-purpose tape drives, it is also

processing that involves much starting and stopping.

3.3

less

suited

for

Disk versus Tape

magnetic tape and magnetic disk accounted for the lion's share
of all secondary storage applications. Disk was excellent for random access
and storage of files for which immediate access was desired; tape was ideal
In the past,

for processing data sequentially

and for long-term storage of

somewhat in favor of disk.

tape was preferable to disk

files.

Over

time, these roles have changed

The major reason

processing

that

for sequential

one process, while disk generally

serves several processes. This means that between accesses a disk read/ write
head tends to move away from the location where the next sequential access
that tapes are dedicated to

resulting in an expensive seek; while the tape drive, being

will occur,

dedicated to one process, pays no such price in seek time.

This problem of excessive seeking has gradually diminished, and disk

much of the secondary storage niche previously occupied by

has taken over

change is largely due to the continued dramatic decreases in the

and
storage. To fully understand this change, we need to
understand the role of
buffer space in performing I/O."*" Briefly, it is
that performance depends largely on how big a chunk of a file we can
transmit at any time; as more
space becomes available for I/O buffers,
the number of accesses decreases correspondingly, which means that the
number of seeks required goes down as well. Most systems now available,
even small systems, have enough
available to decrease the number of
tape. This

RAM

cost of disk

RAM

accesses required to process

most

files

to a level that

makes disk

quite

competitive with tape for sequential processing. This change, added to the
superior versatility and decreasing costs of disks, has resulted in use of disk
for

most

of tape.
This

sequential processing,

which

in the past

was primarily the domain

not to say that tapes should not be used for sequential

file is kept on tape, and there are enough drives available to

processing. If a

use
file

them

for sequential processing,

may

directly from tape than to stream

be more efficient to process the

disk and then process

sequentially.

Although
tions, tape

Tape

is still

has lost ground to disk in sequential processing applica-

remains important as a medium for long-term archival storage.

far less expensive than magnetic disk, and it is very easy and fast

t Techniques for

RAM

buffering are covered in section 3.6.

SECONDARY STORAGE AND SYSTEM SOFTWARE

of files between tape and disk. In this context,

one of our most important media (along with

to stream large files or sets

tape has

emerged

CD-ROM)

for tertiary storage.

Storage as a Hierarchy

3.4

Although the best mixture of devices

the needs of the system's users,

for a

computing system depends on

can imagine any computing system

as a

hierarchy of storage devices of different speed, capacity, and cost. Figure

3.12 summarizes the different types of storage found

at different levels in

FIGURE 3.12 Approximate comparisons of types of storage, circa 1991

Types of

Devices and

Access times

Capacities

Cost

memory

media

(sec)

(bytes)

(cents/bit)

- Primary

Core and

Registers

10-10 9

10- 1(T 3

RAM
RAM

disk

and
disk cache

Secondary

-_.

Direct-access

Magnetic disks

Serial

Tape and

10" 3 -10

-10 9

10- 2 -l(T o

10-10 n

10" 5 -10" 7

mass storage

_- Offline _,
Archival

and
backup

Removable
magnetic

disks,

optical discs,

and tapes

10- 10'

-10 12

5
1(T -10

A JOURNEY OF A BYTE

Operating system's
User's program:

WRITE

M
(

file

i/o system:

text M ,c,l)
Get one byte from variable c
in user

Write

program 's data

area.

to current location

in text file.

User's data area:

FIGURE 3.13 The WRITE( ) statement tells the operating system to send one
character to disk and gives the operating system the location of the character. The operating system takes over the job of doing the actual writing and
then returns control to the calling program.

such hierarchies and shows approximately

and

access time, capacity,

3.5

A Journey

how

they compare in terms of

cost.

of a Byte

a program writes a byte to a file on a disk? We know

what the program does (it says WRITE(.
and we now know
.)),
something about how the byte is stored on a disk, but we haven't looked at
what happens between the program and the disk. The whole story of what
happens to data between program and disk is not one we can tell here, but
we can give you an idea of the many different pieces of hardware and
software involved and the many jobs that have to be done by looking at one
example of a journey of one byte.
Suppose we want to append a byte representing the character 'P' stored

What happens when

in a character variable

somewhere on

a disk.

that the byte will take

WRITECTEXT,
but the journey

The WRITE(

to a file

named

in the variable

TEXT

1)
longer than

this

statement results in

simple statement suggests.

a call to the

computer's operating

system, which has the task of seeing that the rest of the journey
successfully (Fig.

stored

the program's point of view, the entire journey

might be represented by the statement

much
)

From

3.13).

is completed
Often our program can provide the operating

SECONDARY STORAGE AND SYSTEM SOFTWARE

system with information that helps it carry out this task more effectively,
but once the operating system has taken over, the job of overseeing the rest
of the journey is largely beyond our program's control.

3.5.1 The

Manager

File

operating system

not

program, but

a single

a collection

of programs,

each one designed to manage a different part of the computer's resources.

Among

these

devices.

We call

programs

are ones that deal with file-related matters

and I/O
of programs the operating system's file manager.
The file manager may be thought of as several layers of procedures (Fig.
3.14), with the upper layers dealing mostly with symbolic, or logical,
aspects of file management, and the lower layers dealing more with the
physical aspects. Each layer calls the one below it, until, at the lowest level,
the byte

The

this subset

actually written to the disk.

manager begins by finding out whether the logical characterare consistent with what we are asking it to do with the file.
It may look up the requested file in a table, where it finds out such things
as whether the file has been opened, what type of file the byte is being sent
to (a binary file, a text file, some other organization), who the file's owner
is, and whether WRITE( ) access is allowed for this particular user of the
file

of the

istics

file

file.

The

file

manager must

also

to be deposited. Since the 'P'

needs to

know where

sector in the

file.

determine where in the

to be

the end of the

This information

appended

file is

to the

file

TEXT the
the

file,

file

'P' is

manager

the physical location of the

obtained from the

file

last

allocation table

(FAT) described earlier. From the FAT, the file manager locates the
and sector where the byte is to be stored.

drive,

cylinder, track,

3.5.2 The
Next, the

I/O Buffer

file

manager determines whether the

already in

'P' is

RAM

sector that

or needs to be loaded into

RAM.

to contain the

If the sector

needs

manager must find an available system I/O buffer space

from the disk. Once it has the sector in a buffer in RAM,

to be loaded, the file

for

the

it,

then read

manager can deposit the 'P' into its proper position in the buffer
3.15). The system I/O buffer allows the file manager to read and write

file

(Fig.

data in sector-sized or block-sized units. In other words,

manager

to ensure that the organization

of data

RAM

enables the

file

conforms to the

have on the disk.

Instead of sending the sector immediately to the disk, the file manager
usually waits to see if it can accumulate more bytes going to the same sector

organization

will

A JOURNEY OF A BYTE

Logical
1.

The program

asks the operating system to write the contents of the

variable c to the next available position in

The operating system passes

The

file

about

it,

manager looks up

TEXT

such as whether the

types of access are allowed,

name TEXT corresponds

The

file

The

file

to the file

manager.

in a table containing information

file is

open and available for

any, and what physical

file

use,

what

the logical

to.

manager searches a

location of the sector that

the job

TEXT.

allocation table for the physical

file

is to

manager makes sure

contain the byte.

that the last sector in the

stored in a system I/O buffer in

RAM,

has been

file

then deposits the

into

its

proper position in the buffer.

The

file

the byte

manager gives instructions to the I/O processor about where

is stored in RAM and where it needs to be sent on the disk.

The I/O processor

finds a time

when

the drive

available to receive

the data and puts the data in proper format for the disk.

buffer the data to send

may

also

out in chunks of the proper size for the

disk.

The I/O processor sends

The

the data to the disk controller.

controller instructs the drive to move the read/write head to the

proper track, waits for the desired sector to come under the
read/write head, then sends the byte to the drive to be deposited, bitby-bit, on the surface of the disk.

Physical

FIGURE 3.14 Layers of procedures involved in transmitting a byte from a program's data area to a file called TEXT on disk.

before actually transmitting anything. Even though the statement WRITE(TEXT,c,l) seems to imply that our character is being sent immediately to
the disk,

(There are
buffer

may in fact be
many situations

is filled

would have

TEXT

kept in
in

before transmitting

to flush

so the data

all

RAM

it.

for

some time before

sent.

manager cannot wait until

For instance, if TEXT were closed,

which the

file

a
it

output buffers holding data waiting to be written to

would not be

lost.)

SECONDARY STORAGE AND SYSTEM SOFTWARE

File i/o system:

User's program:

WRITE ("text",c,

If necessary,

sector

1)
2.

load

last

from "TEXT"

into

system output buffer

Move *P into system
output buffer

User's data area:

I/O
*"

system's

output buffer
P'

FIGURE 3.15 The file manager moves P from the program's data area to a system
output buffer, where it may join other bytes headed for the same place on the
disk. If necessary, the file manager may have to load the corresponding sector
from the disk into the system output buffer.

3.5.3 The Byte Leaves RAM: The

and Disk Controller

I/O

Processor

So far, all of our byte's activities have occurred within the computer's
primary memory and have probably been carried out by the computer's
central processing unit (CPU). The byte has travelled along data paths that
are designed to be very fast and that are relatively expensive. Now it is time
for the byte to travel along a data path that is likely to be slower and
narrower than the one in primary memory. (A typical computer might have
an internal data-path width of four bytes, whereas the width of the path
leading to the disk might be only two bytes.)
Because of bottlenecks created by these differences in speed and
data-path widths, our byte and its companions might have to wait for an

become available. This also means that the CPU has

on its hands as it deals out information in small enough chunks
and at slow enough speeds that the world outside can handle them. In fact,
the differences between the internal and external speeds for transmitting

external data path to

extra time

A JOURNEY OF A BYTE

data are often so great that the

CPU

can transmit to several external devices

simultaneously.

The processes of disassembling and assembling groups of bytes for

transmission to and from external devices are so specialized that it is
unreasonable to ask an expensive, general-purpose CPU to spend its

when a simpler device could do the job as well,

do the work that it is most suited for. Such a
special-purpose device is called an I/O processor.
An I/O processor may be anything from a simple chip capable of taking
a byte and, on cue, just passing it on; to a powerful, small computer capable
of executing very sophisticated programs and communicating with many
valuable time doing I/O
freeing

CPU

the

devices simultaneously.

The I/O processor

operating system, but once

takes

its

begins processing I/O,

from

instructions
it

the

runs independently,

CPU) of the task of communicating

with secondary storage devices. This allows I/O processes and internal
relieving the operating system (and the
"

computing
In

to overlap."

typical

computer, the

processor that there

file

manager might now

data in the buffer that

tell

the

I/O

to be transmitted to the disk,

how much

data there is, and where it is to go on the disk. This information

might come in the form of a little program that the operating system
constructs and the I/O processor executes (Fig. 3.16).
The job of actually controlling the operation of the disk is done by a
device called a disk controller. The I/O processor asks the disk controller if
the disk drive is available for writing. If there is much I/O processing, there
is a good chance that the drive will not be available and that our byte will
have to wait in its buffer until the drive becomes available.
What happens next often makes the time spent so far seem insignificant
in

comparison: The disk drive

instructed to

move

on the drive where our byte and

be stored. For the

first

mechanical!

The

time, a device

its

read/write head to

companions are to
being asked to do something

the track and sector

its

read/write head must seek to the proper track (unless

it is

already there), and then wait until the disk has spun around so the desired
sector

under the head. Once the track and sector are located, the I/O

processor (or perhaps the controller) can send out bytes, one
the drive.
drive,

Our

where

byte waits until

probably

its

stored in

at a

time, to

turn comes, then travels, alone, to the

a little

one-byte buffer while

waits to

be deposited on the disk.

On many systems the I/O processor can take data directly from RAM, without further
involvement from the CPU. This process is called direct memory access (DMA). On other
systems, the CPU must place the data in special I/O registers before the I/O processor can
have access to it.
t

SECONDARY STORAGE AND SYSTEM SOFTWARE

File

Manager
Invoke I/O processor

User's program:

I/O
processor

User's data area:

program

^
!i

L___i>
I

,-a

System
buffer

I/O processor

FIGURE 3.16 The file manager sends the I/O processor instructions in the form of
an I/O processor program. The I/O processor gets the data from the system
buffer, prepares it for storing on the disk, and then sends it to the disk controller, which deposits it on the surface of the disk.

under the read/ write head, the eight bits of our

byte are deposited, one at a time, on the surface of the disk (Fig. 3.16).
There the 'P' remains, at the end of its journey, spinning about at a leisurely
Finally, as the disk spins

50 to 100 miles per hour.

3.6

Buffer

Any

Management

from some knowledge of what happens to data

program's data area and secondary storage. One aspect
of this process that is particularly important is the use of buffers. Buffering
involves working with large chunks of data in
so the number of
accesses to secondary storage can be reduced. We concentrate on the
operation of system I/O buffers, but be aware that the use of buffers within
programs can also substantially affect performance.
user of files can benefit

travelling

between

RAM

BUFFER MANAGEMENT

3.6.1 Buffer Bottlenecks

We know

that a

hold incoming

file

manager

allocates

I/O buffers

that are big

enough

we have said nothing so far about how many

common for file managers to allocate several

data, but

are used. In fact,

it is

buffers
buffers

performing I/O.
To understand the need for several system buffers, consider what
happens if a program is performing both input and output on one character
at a time, and only one I/O buffer is available. When the program asks for
its first character, the I/O buffer is loaded with the sector containing the
character, and the character is transmitted to the program. If the program
then decides to output a character, the I/O buffer is filled with the sector
into which the output character needs to go, destroying its original
for

contents.

Then when

the next input character

needed, the buffer contents

have to be written to disk to make room for the (original) sector containing
the second input character, and so on.
Fortunately, there is a simple and generally effective solution to this
ridiculous state of affairs, and that is to use more than one system buffer.
For this reason, I/O systems almost always use at least two buffers
one
for input and one for output.
Even if a program transmits data in only one direction, the use of a
single system I/O buffer can slow it down considerably. We know, for
instance, that the operation of reading a sector from a disk is extremely slow
compared to the amount of time it takes to move data in RAM, so we can
guess that a program that reads many sectors from a file might have to

spend much of its time waiting for the I/O system to fill its buffer every
time a read operation is performed before it can begin processing. When this
happens, the program that is running is said to be I/O bound the CPU
spends much of its time just waiting for I/O to be performed. The solution
to this problem is to use more than one buffer and to have the I/O system

filling the

next sector or block of data while the

CPU

processing the

current one.

3.6.2 Buffering Strategies

Multiple Buffering Suppose that a program is only writing to a disk and
that it is I/O bound. The CPU wants to be filling a buffer at the same tune
that I/O is being performed. If two buffers arc used and I/O-CPU
overlapping is permitted, the CPU can be filling one buffer while the
contents of the other are being transmitted to disk.

When

both tasks are

SECONDARY STORAGE AND SYSTEM SOFTWARE

the roles of the buffers can be exchanged.

finished,

swapping the
called

roles

of two buffers

doti ble buffering.

after

This technique of

each output (or input) operation

Double buffering allows

operating on one buffer while the other buffer

the operating system to be

being loaded or emptied

(Fig. 3.17).

The

idea of

swapping system buffers

to allow processing

and I/O

overlap need not be restricted to two buffers. In theory, any

buffers can be used, and they can be organized in a variety

number of
of ways. The

management of system buffers is usually done by the operating

system and can rarely be controlled by programmers who do not work at
the systems level. It is common, however, for users to be able to control the
number of system buffers assigned to jobs.
Some file systems use a buffering scheme called buffer pooli ng: When a
system buffer is needed, it is taken from a pool of available buffers and used.
When the system receives a request to read a certain sector or block, it looks
to see if one of its buffers already contains that sector or block. If no buffer
contains it, then the system finds from its pool of buffers one that is not
actual

currently in use and loads the sector or block into

it.

FIGURE 3.17 Double buffering: (a) The contents of system I/O buffer 1 are sent to
is being filled; and (b) the contents of buffer 2 are sent to
disk while I/O buffer 1 is being filled.
disk while I/O buffer 2

disk

(a)

ogram data area

(b)

BUFFER MANAGEMENT

Several different schemes are used to decide which buffer to take from
a

buffer pool.

buffer that

One

generally effective strategy

le ast recently

use d.

least-recently-used queue, so

less-recently-used

(LRU)

buffers

When

it is

to take

buffer

allowed to retain

new

its

The

have been accessed.

strategy for replacing old data with

from the pool

accessed,

data has

data until

tha'j

put on
all

other

least-recently-used

many

applications

computing. It is based on the assumption that a block of data that

has been used recently is more likely to be needed in the near future than
one that has been used less recently. (We encounter LRU again in later
in

chapters.)
It is

ceases

difficult to predict the point at

which the addition of extra buffers

As the cost of RAM

contribute to improved performance.

continues to decrease, so does the cost of using

the other hand, the

system to manage them. When

different numbers of buffers.

Move Mode and

Locate

Mode

more and bigger buffers. On

more time it takes for the file

doubt, consider experimenting with

Sometimes

not necessary

must always be copied from a system buffer

versa), the amount of time taken to perform
This

way of

handling buffered data

moving chunks of data from one

called

place in

to a

the

When

data

program buffer ( or

vice

distinguish between a program's data area and system buffers.

move

can be substantial.

move mode, since

involves

RAM to another before they can

be accessed.

There

are

wo way s

that

move mode

can be avoided. If the

file

manager

can perform I/O directly between secondary storage and the program's data

no extra move

manager could use

system buffers to handle all I/O, but provide the program with the locations,
through the use of pointer variables, of the system buffers. Both techniques
are examples of a general approach to buffering called locate mod e. When
locate mode is used, a program is able to operate directly on data in the I/O
buffer, eliminating the need to transfer data between an I/O buffer and a
program buffer.
area,

necessary. Alter nalively^ the

file

Suppose you are reading in a file with many blocks,

where each block consists of a header followed by data. You would like to
put the headers in one buffer and the data in a different buffer so the data can
be processed as a single entity. The obvious way to do this is to read the
whole block into a single big buffer, and then move the different parts to
their own buffers. Sometimes we can avoid this two-step process using a
technique called sc^lfx^in^uL With scatter input, a single READ call
Scatter/Gather I/O

SECONDARY STORAGE AND SYSTEM SOFTWARE

identifies

block

not one, but

a collection

of buffers into which data from

a single

to be scattered.

The converse of

scatter input

gather oulgut,.

With gather output,

and written with a single WRITE call,

avoiding the need to copy them to a single output buffer. When the cost of
copying several buffers into a single output buffer is high, scatter/gather can
have a significant effect on the running time of a program.
It is not always obvious when features like scatter/gather, locate mode,
and buffer pooling are available in an operating system. You often have to
go looking for them. Sometimes you can invoke them by communicating
with your operating system, and sometimes you can cause them to be
invoked by organizing your program in ways that are compatible with the
way the operating system does I/O. Throughout this text we return many
times to the issue of how to enhance performance by thinking about how
buffers work and adapting programs and file structures accordingly.
several buffers can be gathered

3.7

I/O in

UNIX

We see in the journey of a byte that we can view I/O as proceeding through
several layers. UNIX provides a good example of how these layers occur in
operating system, so we conclude this chapter with a look at UNIX.
of course beyond the scope of this text to describe the UNIX I/O layers
in detail. Rather, our objective here is just to pick a few features of UNIX
that illustrate points made in the text. A secondary objective is to familiarize
you with some of the important terminology used in describing UNIX
systems. For a comprehensive, detailed look at how UNIX works, plus a
thorough discussion of the design decisions involved in creating and
improving UNIX, see Leffler et al. (1989).
a real
It is

3.7.1 The Kernel

In Fig. 3. 14

see

how

the process of transmitting data

an external device can be described

The topmost
name,

file a

file.

body of text, an image, an

The

view

program

a series

of layers.

store in a

array of numbers, or

some other

that an application has of

what goes

layers that follow collectively carry out the task of turning

the logical object into a collection of bits

Likewise, the topmost I/O layer in

logical terms. This layer in
logical

from

proceeding through

layer deals with data in logical, structural terms.

logical entity. This reflects the

into a

views on

files.

UNIX

a physical device.

UNIX

deals with data primarily in

consists of processes that

impose

certain

Processes are associated with solving some problem,

I/O IN

PROCESSES

user programs

shell

UNIX

commands

libraries

system

call

interface

KERNEL
I/O

block

character

network

system
(normal

I/O system

files)

printers, etc.)

(terminals,

block device drivers

network interface drivers

TIT TTT

consoles

disk...

(sockets)

character device drivers

TT
disk

I/O system

printers...

..networks...

HARDWARE
FIGURE 3.18 Kernel

such

I/O structure.

counting the words in the

Processes include

and
programs

library

files,

once

The

scanf(

numbers,

this layer is the

UNIX

kernel views

all

I/O

pass control to the kernel

a file are

gone.

The

or searching for somebody's address.

and tail, user programs that operate on

and fread( ) that are called from

etc.

kernel,

The components of the

the layers.^
3.18.

routines like

to read strings,

Below

file

shell routines like cat

which incorporates

all

the rest of

kernel that do I/O are illustrated in Fig.

operating on

all

sequence of bytes, so

assumptions about the logical view of

decision to design

UNIX

in this

way

make

all

operations below the top layer independent of an application's logical view

a file

UNIX
""It

as a

unusual.

beyond the scope of

tion of the

also

one of the main attractions in choosing

UNIX lets us make all of the decisions

focus for this text, for

UNIX

this text to describe the

UNIX

kernel in detail. For

kernel, including the I/O system, see Leffler et

al.

(1989).

a full

descrip-

SECONDARY STORAGE AND SYSTEM SOFTWARE

about the logical structure of

think about the

file

a file,

beyond the

imposing no
it must be

fact that

restrictions
built

from

how we

sequence of

bytes.
Let's illustrate the journey
in this chapter

by tracing the

of a byte through the kernel, as we did earlier

of an I/O statement. We assume in this
character to disk. This corresponds to the left

results

example that we are writing a

branch of the I/O system in Fig. 3.18.
When your program executes a system

call

such

write (fd, &c, 1);

invoked immediately. * The routines that let processes
communicate directly with the kernel make up the syst em call interface. In
this case, the system call instructs the kernel to writelfcharacter to a file.
The kernel I/O system begins by connecting the file descriptor (fd) in
your program to some file or device in the filesystem. It does this by
proceeding through a series of four tables that enable the kernel to find its
the

kernel

way from

process to the places on the disk that will hold the

The

refer to.

file

that they

four tables are

descriptor table;

a file

an open

with information about open files;

which is part of a structure called an index

file table,

a file allocation table,

node; and
a table

of index nodes, with one entry for each

Although these
a sense,

tables are

"owned" by

managed by

file

in use.

the kernel's I/O system, they are, in

different parts of the system:

The file descriptor table is owned by the process (your program).

The open file table and index node tables are owned by the kernel.
The index node itself is part of the filesystem.
The

four tables are invoked in turn by the kernel to get the information

needs to write to your

file

disk. Let's see

how

this

works by looking

the functions of the tables.

The
of the

open

file

descriptors used

file table.

entries for

and

descriptor table (Fig. 3.19a)

simple table that associates each

process with an entry in another

Every process has

all files it

its

own

descriptor table,

has opened, including the "files"

table, the

which includes

STDIN, STDOUT,

STDERR.

^This should not be confused with a library call, such as fprintf( ), which invokes the standard library to perform some additional operations on the data, such as converting it to an
ASCII format, and then makes a corresponding system call.

I/O IN

UNIX

(a) descriptor table

table

file

descriptor

entry

to open file

table

(b) open file table

Number of

inode

processes

Offset
of next

ptr to

R/W

write

table

mode

using

access

routine

entry

to inode

^^^
write

table

100

write() routine

for this type

FIGURE 3.19 Descriptor table and open

The

file

table.

open file.
added to the open file
entries are called file structures, and they contain important
information about how the corresponding file is to be used, such as the
read/write mode used when it was opened, the number of processes
open

Every time a
table. These

file

file is

able (Fig. 3.19b) contains entries for every

opened or

created, a

new

entry

and the offset within the file to be used for the next read
or write. The open file table also contains an array of pointers to generic
currently using

it,

SECONDARY STORAGE AND SYSTEM SOFTWARE

functions that can be used to operate on the

depending on the type of

file.

These functions

will differ

file.

same open file

one process could read part of a file, another process could
read the next part, and so forth, with each process taking over where the
previous one stopped. On the other hand, if the same file is opened by two
separate open ( ) statements, two separate entries are made in the table, and
the two processes operate on the file quite independently.^
The information in the open file table is transitory. It tells the kernel
what it can do with a file that has been opened in a certain way and provides
information on how it can operate on the file. The kernel still needs more
information about the file itself, such as where the file is stored on disk, how
big the file is, and who owns it. This information is found in an ndex node,
possible for several different processes to refer to the

It is

table entry, so

more commonly

inode

structure.

inode exists
inode

file

more permanent

long

as its

a file

opened,

open file table's file

open for access, but an

structure than an

structure exists only while a

corresponding

kept on disk with the

When

referred to as an 'mode (Fig. 3.20).

file is

For

file exists.

this reason, a file's

(though not physically adjacent to the file).

copy of its inode is usually loaded into
where
file

RAM

added to the aforementioned inode table for rapid access.

For the purposes of our discussion, the most important component of
the inode is a list (index) of the disk blocks that make up the file. This list

it is

the

UNIX

counterpart to the

in this chapter.*

knows

all

that

Once
it

fi le

needs to

processor program that

know

about the

program is called a device driver.

The device driver sees that your data
proper place on disk. Before we look at the
instructive to look at

kinds of

file

data that

3.7.2 Linking
It is

File

how

must

Names

instructive to look a

cess at the

that

meaning of these may be

*This might not be
table often has a

you

to be written.

role

more
file.

closely at

All

UNIX,

its

buffer to

of device drivers in

how

references

It"

you

are independently reading

difficult to

moved from

among

this

its

UNIX,

the different

to Files
a file

name

files

actually

begin with

file with one prowith another, the

are writing to a

from the

file

determine.

simple linear array.

dynamic,

then invokes an I/O

the kernel distinguishes

there are risks in letting this happen.

same time

file.

deal with.

little

linked to the corresponding

*Of course,

described earlier

appropriate for the type of data, the type of

operation, and the type of device that

it is

allocation table that

the kernel's I/O system has the inode information,

To accommodate both

tree-like structure.

large and small

files,

this

I/O IN

UNIX

device

permissions

owner's userid
file size

block count

tile

allocation
table

FIGURE 3.20 An inode. The inode is the data structure used by UNIX to describe
file. It includes the device containing the file, permissions, owner and group

the

IDs,

and

allocation table,

file

directory, for
is

it is

just a small

pointer to the

inode of a

name
link

other things.

in directories that file

file's

are kept. In fact, a directory

a file

name

together with

other information about the

file.

RAM

When

and to

a directory to the

provides a direct reference from the

used to bring the inode into

It is

file,

inode on disk.^ This pointer from

called a hard link.

entry in the open

names

that contains, for each

file

file is

all

among

a file is

opened,

file

hard

this

up the corresponding

set

file table.

possible for several

file

names

can have several different names.

to point to the

field in the

same inode, so one

inode

tells

how many

file

hard

means that if a file name is deleted and

same file, the file itself is not deleted; its
decremented by one.

links there are to the inode. This

there are other

file

names

inode's hard-link count

for the

just

There is another kind of link, called a soft link, or syrnbolkjliik A

symbolic link links a file name to another file name, rather than to an actual
file. Instead of being a pointer to an inode, a soft link is a pathname of some
.

^The

actual structure of a directory

sential parts. See Leffler, et

al.

a little

more complex than

(1989) for details.

this,

but these arc the

es-

SECONDARY STORAGE AND SYSTEM SOFTWARE

symbolic link does not point to an actual file, it can refer to a

file in a different file system. Symbolic links are not
supported on all UNIX systems. UNIX System 4.3BSD supports symbolic
links, but System V does not.
Since

file.

directory or even to a

3.7.3 Normal

Files,

Special Files, and Sockets

The "everything is a file" concept in UNIX works only when we recognize

that some files are quite a bit different from others. We see in Fig. 3.18 that
the kernel distinguishes among three different types of files. Normal files are
files that this text is about. Special files almost always represent a stream
of characters and control signals that drive some device, such as a line

the

printer or a graphics device.

table (Fig.

The

3.19a) are special

first

files.

three

file

descriptors in the descriptor

Sockets are abstractions that serve as

endpoints for interprocess communication.

a certain

are very similar,

conceptual level, these three different types of

and many of the same routines can be used

UNIX

files

any
of them. For instance, you can establish access to all three types by opening
them, and you can write to them with the write( ) system call.

3.7.4 Block

I/O

In Fig.

3.18,

to access

see that the three different types of files access their

respective devices via three different I/O systems, the block I/O system, the

I/O system, and the network I/O system. Henceforth we ignore the
second and third categories, since it is normal file I/O that we are most
character

concerned with in this text.*

The block I/O system is the
the journey of a byte.

viewed by the user

data,

device like

UNIX

concerns
as a

itself

counterpart of the

with

how

sequence of bytes, onto

disk or tape. Given a byte to store

file

to transmit
a

manager in
normal file

block-oriented

a disk, for

example,

arranges to read in the sector containing the byte to be replaced, to replace

the byte, and to write the sector back to the disk.

UNIX view of a block device most closely resembles that of a disk.

randomly addressable array of fixed blocks. Originally all blocks were
512 bytes, which was the common sector size on most disks. No other
organization (such as clusters) was imposed on the placement of files on
The

It is

not entirely true. Sockets, for example, can be used to move normal
network systems bypass the normal
favor of sockets to squeeze every bit of performance out of the network.

iThis

place to place. In fact, high-performance

files
file

from
system

I/O IN

disk. (In section 3.1.7

with

we saw how

UNIX

the design of later

UNIX

systems dealt

convention.)

this

3.7.5 Device Drivers

For each peripheral device there

is a separate set of routines, called a device

performs the actual I/O between the I/O buffer and the device.
A device driver is roughly equivalent to the I/O processor program
described in the journey of a byte.
Since the block I/O system views a peripheral device as an array of

driver, that

physical blocks, addressed as block

driver's

job

physical blocks, and see that

from

to take a block

block

etc.,

a buffer,

block I/O device

destined for one of these

gets deposited in the proper physical place

the device. This saves the block I/O part of the kernel

from having

know

it is writing to, other than its identity and

thorough discussion of device drivers for block,
character, and network I/O can be found in Leffler et al. (1989).

anything about the specific device

that

it is

block device.

3.7.6 The Kernel and Filesystems

In Chapter 2

filesystem

the actual

files in

UNIX concept of a filesystem. A UNIX

of files, together with secondary information about

described the

a collection

the system.

filesystem includes the directory structure,

and the inodes that describe the files.

In our discussions we talk about the filesystem as if it is part of the
kernel's I/O system, which it is, but it is also in a sense separate from it. All
where the kernel
parts of a filesystem reside on disk, rather than in
the
kernel
as needed.
by
does its work. These parts are brought into
the directories, ordinary

files,

RAM

This separation of the filesystem from the kernel has many advantages. One
important advantage is that we can tune a filesystem to a particular device
or usage pattern independently of how the kernel views files. The
discussions in section 3.1.7 of

4.3BSD block

organization are file-system

concerns, for example, and need not have any effect on

how

the kernel

works.

Another advantage of keeping the filesystem and I/O system distinct is

that we can have separate filesystems that are organized differently, perhaps
on different devices, but are accessible by the same kernel. In Appendix A,
for instance, we describe the design of a filesystem on CDROM that is
organized quite differently from a typical disk-based file system yet looks
just like any other filesystem to the user and to the I/O system.

SECONDARY STORAGE AND SYSTEM SOFTWARE

3.7.7 Magnetic Tape and UNIX

it is to computing, magnetic tape is somewhat of an orphan in
view of I/O. A magnetic tape unit has characteristics similar to
both block I/O devices (being block oriented) and character devices (being

Important

the

UNIX

primarily used for sequential access), but does not

fit

nicely into either

category. Character devices read and write streams of data, not blocks, and

block devices in general access blocks randomly, not sequentially.

I/O is generally the least inappropriate of the two

inappropriate paradigms for tape, a tape device is normally considered in
UNIX to be a block I/O device and hence is accessed through the block I/O
interface. But because the block I/O interface is most often used to write to
random-access devices, disks, it does not require blocks to be written in
sequence, as they must be written to a tape. This problem is solved by
allowing only one write request at a time per tape drive. When highperformance I/O is required, the character device interface can be used in a
raw mode to stream data to tapes, bypassing the stage that requires the data
Since

block

to be collected into relativelv small blocks before or after transmission.

SUMMARY
In this chapter

we look

the software environment in

which

file

processing

programs must operate and at some of the hardware devices on which files
are commonly stored, hoping to understand how they influence the ways
we design and process files. We begin by looking at the two most common
storage media: magnetic disks and tapes.
A disk drive consists of a set of read/write heads that are interspersed
among one or more platters. Each platter contributes one or two surfaces,
each surface contains a set of concentric tracks, and each track is divided into
sectors or blocks.

read/write heads

The

set

of tracks that can be read without moving the

called a cylinder.

by sector and by
block. Used in this context, the term block refers to a group of records that
are stored together on a disk and treated as a unit for I/O purposes. When
There are two basic ways

blocks are used, the user

to address data

better able to

make

disks:

the physical organization of

and hence can sometimes

improve performance. Block-organized drives also sometimes make it
possible for the disk drive to search among blocks on a track for a record
with a certain key without first having to transmit the unwanted blocks into
data correspond to

its

logical

organization,

RAM.
Three possible disadvantages of block-organized devices are the danger
of internal track fragmentation, the burden of dealing with the extra

SUMMARY

complexity that the user has to bear, and the

the kinds of synchronization (such

some of

loss

of opportunities to do

as sector interleaving)

that

sector-addressing devices provide.

The

cost of a disk access can be

for seeking, rotational delay,

used,

it is

physically

measured in terms of the time it takes

and transfer time. If sector interleaving is

possible to access logically adjacent sectors

by one or more

sectors.

Although

by separating them

takes

much

less

time to

access a single record directly than sequentially, the extra seek time required
for

doing direct accesses makes it much slower than sequential access when
of records is to be accessed.
Despite increasing disk performance, network speeds have improved to

a series

the point that disk access

system.

often a significant bottleneck in an overall I/O

number of techniques

including striping, the use of

are available for addressing this problem,

RAM

disks,

can have a major

from 512 bytes
especially

effect

and disk caching.

BSD UNIX

Research done in connection with

shows

that block size

on performance. By increasing the default block size

throughput was improved enormously,

to 4,096 bytes,

for large

files,

because eight times

transferred in a single access.

A negative consequence

much

data

could be

of this reorganization

wasted storage increased from 6.9% for 512-byte blocks to 45.6%

It turned out that this problem of wasted space could
treating
the 4,096-byte blocks as clusters of 512-byte
be dealt with by
blocks, w^hich could be allocated to different files.
Though not as important as disks, magnetic tape has an important niche

was

that

for 4,096-byte blocks.

in file processing.

Tapes are inexpensive, reasonably

fast for sequential

processing, compact, robust, and easy to store and transport. Data are
usually organized

tapes in one-bit-wide parallel tracks, with a bit-wide

cross-section of tracks interpreted as one or

processing speed and

space utilization,

it is

bytes.

When

estimating

important to recognize the role

played by the interblock gap. Effective recording density and effective

transmission rate are useful measurements of the performance one can
expect to achieve for a given physical

file

organization.

secondary storage media, we see that

disks are replacing tape in more and more cases. This is largely because
is becoming less expensive, relative to secondary storage, which
means that one of the earlier advantages of tape over disk, the ability to do
In

comparing disk and tape

RAM

sequential access without seeking, has diminished significantly.

RAM

to disk.
journey of a byte as it is sent from
and
programs
The journey involves the participation of many different

This chapter follows

devices, including

a user's

tem;

program, which makes the

initial call to

the operating sys-

SECONDARY STORAGE AND SYSTEM SOFTWARE

file manager, which maintains tables of information that it uses to translate between the program's logical view of
the file and the physical file where the byte is to be stored;
an I/O processor and its software, which transmit the byte, synchronizing the transmission of the byte between an I/O buffer in
and the disk;
the disk controller and its software, which instruct the drive about
how to find the proper track and sector, then send the byte; and
tne disk drive, which accepts the byte and deposits it on the disk sur-

the operating system's

RAM

face.

Next,
for

on techniques
improve performance. Some techniques include

take a closer look at buffering, focusing mainly

managing

buffers to

double buffering, buffer pooling, locate-mode buffering, and scatter/gather

buffering.

second look at I/O layers, this time concentrating

I/O system call begins with a call to the UNIX
kernel, which knows nothing about the logical structure of a file, treating all
data essentially the same
as a sequence of bytes to be transmitted to some

conclude with

UNIX. We

see that every

external device. In doing

its

work

tables: a file descriptor table,

access table in the

to use

and

how

file's

inode.

to access

it,

the I/O system in the kernel invokes four

an open

Once
calls

an inode table, and a

file table,

the kernel has determined

file

which device

device driver to carry out the actual

accessing.

Although it treats every file as a sequence of bytes, the kernel I/O

system deals differently with three different types of I/O: block I/O,
character I/O, and network I/O. In this text we concentrate on block I/O.
We look briefly at the special role of the file system within the kernel,

how it uses links to connect file names in directories to their

corresponding inodes. Finally, we remark on the reasons that magnetic tape
does not fit well into the UNIX paradigm for I/O.

describing

KEY TERMS
bpi. Bits per inch per track.
tracks.

a tape,

a disk, data

recorded serially on

data are recorded in parallel

several tracks, so a

6,250-bpi nine-track tape contains 6,250 bytes per inch,

when

all

nine

tracks are taken into account (one track being used for parity).

Block. Unit of data organization corresponding

to the

amount of data

transferred in a single access. Block often refers to a collection of

KEY TERMS

records, but

may

a collection

of sectors

has no correspondence to the organization of the data.

sometimes

called a physical record; a sector

whose

(see cluster)

sometimes

block

size
is

called a

block.

Block device.
in

UNIX,

device such as

disk drive that

organized

blocks and accessed accordingly.

Block I/O. I/O between a computer and a block device.

Block organization. Disk drive organization that allows

the user to

define the size and organization of blocks, and then access a block by

giving

its

block address or the key of one of

its

records. (See sector

organization.)

Blocking factor. The number of records

Character device. In
tape drive

UNIX,

when stream I/O

stored in one block.

device such as
is

keyboard or printer

(or

used) that sends or receives data in the

form of a stream of characters.

Character I/O. I/O between a computer and a character device.
Cluster. Minimum unit of space allocation on a sectored disk, consisting
of one or more contiguous sectors. The use of large clusters can improve sequential access times by guaranteeing the ability to read
longer spans of data without seeking. Small clusters tend to decrease
internal fragmentation.

Controller. Device that directly controls the operation of one or more

secondary storage devices, such as disk drives and magnetic tape
units.

Count subblock. On block-organized

drives, a small block that pre-

cedes each data block and contains information about the data block,

such as its byte count and its address.

Cylinder. The set of tracks on a disk that are directly above and below
each other. All of the tracks in a given cylinder can be accessed without having to move the access arm; that is, they can be accessed
without the expense of seek time.
Descriptor table. In UNIX, a table associated with a single process that
links all of the file descriptors generated by that process to corresponding entries in an open file table.
Device driver. In UNIX, an I/O processor program invoked by the
kernel that performs I/O for a particular device.
Direct access storage device (DASD). Disk or other secondary storage device that permits access to a specific sector or block of data

without first requiring the reading of the blocks that precede it.
Direct memory access (DMA). Transfer of data directly between
and peripheral devices, without significant involvement by the

RAM

CPU.

SECONDARY STORAGE AND SYSTEM SOFTWARE

Disk cache. A segment of RAM configured to contain pages of data

from a disk. Disk caches can lead to substantial improvements in access time

when

access requests exhibit a high degree of locality.

Disk pack. An assemblage of magnetic

tical shaft.

pack of disks

number of cylinders
If disk

disks

mounted on

the

same ver-

treated as a single unit consisting of a

equivalent to the

number of tracks

per surface.

packs are removable, different packs can be mounted on the

same drive

at different times,

providing

convenient form of offline

storage for data that can be accessed directly.

Effective recording density. Recording density

after taking into ac-

count the space used by interblock gaps, nondata subblocks, and

other space-consuming items that

accompany

Effective transmission rate. Transmission rate

data.
after taking into ac-

count the time used to locate and transmit the block of data in which
a

desired record occurs.

One or more adjacent clusters allocated as part (or all) of a file.

The number of extents in a file reflects how dispersed the file is over
the disk. The more dispersed a file, the more seeking must be done
in moving from one part of the file to another.
File allocation table (FAT). A table that contains mappings to the

Extent.

physical locations of
File

all

the clusters in

all files

on disk

storage.

manager. The part of an operating system that is responsible for

managing files, including a collection of programs whose responsibilities range from keeping track of files to invoking I/O processes that
transmit information between primary and secondary storage.

File structure. In connection with the

open

file

table in a

UNIX

kernel,

the term file structure refers to a structure that holds information the

kernel needs about an open

such things
rently using

as the file's
it,

and the

file.

File structure

read/write mode,

information includes

number of processes

offset within the file to be

cur-

used for the next

read or write.

Filesystem. In

UNIX,

a hierarchical collection

single secondary device, such as a hard disk or

files,

usually kept

CD-ROM.

Fixed disk. A disk drive with platters that may not be removed.
Formatting. The process of preparing a disk for data storage, involving
such things as laying out sectors, setting up the disk's file allocation
table, and checking for damage to the recording medium.
Fragmentation. Space that goes unused within a cluster, block, track,
or other unit of physical storage. For instance, track fragmentation

when space on a track goes unused because there

enough space left to accommodate a complete block.
Frame. A one-bit-wide slice of tape, usually representing a
occurs

not

single byte.

KEY TERMS

Hard

link. In

to the

UNIX,

links to a single

deleted until

Index node.
its

all

directory that connects

hence

file;

a file

name

There can be several hard

can have several names. A file is not

a file

hard links to the

file.

file

are deleted.

UNIX, a data structure associated with a file that deAn index node includes such information as a file's

scribes the

type,

an entry in

inode of the corresponding

file.

owner and group IDs, and a list of the disk blocks

file. A more common name for index node is

comprise the

that
inode.

Inode. See index node.

Interblock gap. An interval of blank space that separates sectors,

blocks, or subblocks on tape or disk. In the case of tape, the gap
provides sufficient space for the tape to accelerate or decelerate
starting or stopping.

read/write heads to

when

tell

both tapes and disks the gaps enable the

accurately when one sector (or block or sub-

block) ends and another begins.

Interleaving factor. Since it is often not possible to read physically adjacent sectors of a disk, logically adjacent sectors are sometimes arranged so they are not physically adjacent. This is called interleaving.
The interleaving factor refers to the number of physical sectors the
next logically adjacent sector

located

from the current

sector being

read or written.

I/O processor. A device that

work on non-I/O tasks.

carries out

Kernel. The central part of the

Key subblock. On

UNIX

I/O

tasks,

allowing the

operating system.

block-addressable drives,

block that contains the

key of the last record in the data block that follows

it,

allowing the

among the blocks on

without having to load the blocks into primary

a track for a block containing a

drive to search
certain key,

CPU

mem-

ory.

Mass storage system. General term

applied to storage units with large

capacity. Also applied to very high-capacity secondary storage systems that are capable of transmitting data between a disk and any of

several thousand tape cartridges within a

few seconds.

density. Recording density on a disk track or

magnetic tape without taking into account the effects of gaps or non-

Nominal recording
data subblocks.

Nominal transmission

rate. Transmission rate of a disk or tape unit

without taking into account the effects of such extra operations as

seek time for disks and interblock gap traversal time for tapes.
Open file table. In UNIX, a table owned by the kernel with an entry,
called a file structure, for each

open

file.

See

file structure.

SECONDARY STORAGE AND SYSTEM SOFTWARE

Parity.

error-checking technique in which an extra parity

panies each byte and

set in

such

way

that the total

bit

accom-

number of

even (even parity) or odd (odd parity).

Platter. One disk in the stack of disks on a disk drive.
Process. An executing program. In UNIX, several instances of the same
program can be executing at the same time, as separate processes.
The kernel keeps a separate file descriptor table for each process.
configured to simulate a disk.
disk. Block of
Rotational delay. The time it takes for the disk to rotate so the desired
bits is

RAM

sector is under the read/write head.

Scatter/gather I/O. Buffering techniques that involve, on input, scattering incoming data into more than one buffer, and, on output,

gathering data from several buffers to be output as

chunk of

a single

data.

Sector.

The

fixed-sized data blocks that together

make up

the tracks

certain disk drives. Sectors are the smallest addressable unit

a disk

whose tracks are made up of sectors.

Sector organization. Disk drive organization that uses sectors.
Seek time. The time required to move the access arm to the correct cylinder on a disk drive.

Sequential access device.

card reader, in which the

the beginning.

device, such as a magnetic tape unit or

medium

Sometimes

(e.g., tape)

must be accessed from

called a serial device.

Socket. In UNIX, a socket is an abstraction that serves as an endpoint

of communication within some domain. For example, a socket can
be used to provide direct communication between two computers.

Although

some ways

the kernel treats sockets like

files,

we do

not

deal with sockets in this text.

Soft link. See symbolic

link.

UNIX,

ters

the term special file refers to a stream of characand control signals that drive some device, such as a line printer

graphics device.

Special
a

file.

Streaming tape drive. A tape drive whose primary purpose is dumping large amounts of data from disk to tape or from tape to disk.
Subblock. When blocking is used, there are often separate groupings of
information concerned with each individual block. For example,

count subblock,

key subblock, and

data subblock might

all

present.

Symbolic link. In UNIX, an entry in a directory that gives the pathname of a file. Since a symbolic link is an indirect pointer to a file,
not

with the file

can point to directories, or even to files

as closely associated

as a

hard

link.

Symbolic

in other filesystems.

links

EXERCISES

Track. The set of bytes on a single surface of a disk that can be accessed
without seeking (without moving the access arm). The surface of a
disk can be thought of as a series of concentric circles, with each circle corresponding to a particular position of the access arm and read/
write heads. Each of these circles is a track.
Transfer time. Once the data we want is under the read/write head, we
have to wait for it to pass under the head as we read it. The amount
of time required for this motion and reading is the transfer time.

EXERCISES
Determine as well as you can what the journey of a byte would be like
on your system. You may have to consult technical reference manuals that
describe your computer's file management system, operating system, and
peripheral devices. You may also want to talk to local gurus who have
experience using your system.
1.

Suppose you are writing

a list

of names to

write statement.

Why is it not a good idea to

and then reopen

Find out what

for

utility routines are available

computing system, there are

of users, depending on what

When you

every write,

create or

open

Compared

on your computer system

you have a large

utilization. If

different routines available for different kinds

privileges and responsibilities they have.

a file in

information to your computer's

properly.

one name per

file after

before the next write?

monitoring I/O performance and disk

a text file,

close the

file

or Pascal, you must provide certain

manager so it can handle your file

to certain languages, such as PL/I or

COBOL,

the

amount of information you must provide in C or Pascal is very small. Find

a text or manual on PL/I or COBOL and look up the ENVIRONMENT
file description attribute, which can be used to tell the file manager a great
deal about how you expect a file to be organized and used. Compare PL/I
or

COBOL

with

available to the
5.

Much is

way

file

or Pascal in terms of the types of

said in section 3.

to store files.

every

Assume

that

must occupy
stored on

a file is

problems does

file

specifications

programmer.

create?

about

how

disk space

organized physically

no such complex organization

a single

tape.

contiguous piece of

How

does

this

a disk,

used and that

somewhat the
What

simplify disk storage?

SECONDARY STORAGE AND SYSTEM SOFTWARE

program requests that a 128-byte

manager may have to read a sector from
can write the record. Why? What could you do to decrease

disk drive uses 512-byte sectors. If a

record be written to disk, the

the disk before
the
7.

number of times such an

file

extra read

likely to occur?

have seen that some disk operating systems allocate storage space
in clusters and/or extents, rather than sectors, so the size of any file
a multiple of a cluster or extent.
a. What are some advantages and potential disadvantages of this
method of allocating disk space?
b. How appropriate would the use of large extents be for an application that mostly involves sequential access of very large files?
c. How appropriate would large extents be for a computing system
that serves a large number of C programmers? (C programs tend to
be small, so there are likely to be many small files that contain C
programs.)
d. The VAX record management system uses a default cluster size
of three 512-byte sectors but lets a user reformat a drive with any
cluster size from 1 to 65,535 sectors. When might a cluster size larger
than three sectors be desirable? When might a smaller cluster size be

on disks
must be

desirable?
8.

In early

disk,

UNIX

systems, inodes were kept together on one part of

while the corresponding data was scattered elsewhere on the disk.

Later editions divided disk drives into groups of adjacent cylinders called
cylinder groups, in

corresponding data.

which each cylinder group contains inodes and their

How does this new organization improve perfor-

mance?

UNIX

systems, the minimum block size was 512 bytes, with

of one. The block size was increased to 1,024 bytes in 4.0BSD,
more than doubling its throughput. Explain how this could occur.
9.

In early

a cluster size

10.

Draw

the

numbers

11.

The IBM 3350

pictures that illustrate the role of fragmentation in determining

Table

3.2, section 3.1.7.

disk drive uses block addressing.

The two subblock

organizations described in the text are available:

Count-data, where the extra space used by count subblock and interis equivalent to 185 bytes; and
Count-key-data, where the extra space used by the count and key
subblocks and accompanying gaps is equivalent to 267 bytes, plus

block gaps

the key size.

An IBM
cylinder,

3350 has 19,069 usable bytes available per track, 30 tracks per
and 555 cylinders per drive. Suppose you have a file with 350,000

EXERCISES

80-byte records that you want to store on

a 3350 drive. Answer the

following questions. Unless otherwise directed, assume that the blocking
factor is 10 and that the count-data subblock organization is used.
a.

How many

blocks can be stored on one track?

records?
b.

How many

blocks can be stored on one track

data subblock organization

Make

used and key

size

the count-key-

13 bytes?

graph that shows the effect of block size on storage utilization, assuming count-data subblocks. Use the graph to help predict
the best and worst possible blocking factor in terms of storage utilic.

zation.

Assuming that access to the file is always sequential, use the

graph from the preceding question to predict the best and worst
blocking factor. Justify your answer in terms of efficiency of storage
utilization and processing time.
d.

How many

cylinders are required to hold the

10 and count-data format)?

How much

file

(blocking factor

space will go unused due to

internal track fragmentation?

If the file were stored on contiguous cylinders and if there were
no interference from other processes using the disk drive, the average
seek time for a random access of the file would be about 12 msec.
Use this rate to compute the average time needed to access one
f.

record randomly.
g.

how

Explain

fected

retrieval

time for random accesses of records

by increasing block

size.

af-

Discuss trade-offs between storage

and retrieval when different block sizes are used. Make a

with different block sizes to illustrate your explanations.
Suppose the file is to be sorted and a shell sort is to be used to

efficiency
table
h.

sort the

Since the

file.

sorted in place,

on the

that this requires

sents the total

random

It is

memory,

estimated (Knuth, 1973b,

number of records

access. If

provide

disk.

too large to read into

about 15N1.25 moves of records, where

all

to sort the file? (As

file is

much

in the

of the preceding

you

will see, this

will be

380)

N repre-

Each move requires

file.

true,

how

not

very good solution.

better ones in Chapter

which

long does

take

deals with cose-

quential processing.)

12.

there

sectored disk drive differs from one with

less

correspondence between

block organization
the

logical

and

in that

physical

organization of data records or blocks.

For example, consider the Digital

RM05

disk drive,

which uses sector

has 32 512-byte sectors per track, 19 tracks per cylinder, and

823 cylinders per drive. From the drive's (and drive controller's) point ot
addressing.

SECONDARY STORAGE AND SYSTEM SOFTWARE

view, a

file is

just a vector of bytes divided into 512-byte sectors. Since the

knows nothing about where one record ends and another begins, a
record can span two or more sectors, tracks, or cylinders.
One common way that records are formatted on the RM05 is to place
a two-byte field at the beginning of each block, giving the number of bytes
drive

followed by the data

store a
a.

There

no extra gap and no other

you want to
file with 350,000 80-byte records, answer the following questions:
How many records can be stored on one track if one record is

data,

overhead. Assuming that

this

itself.

organization

used, and that

stored per block?

b.
c.

How many
How might

cylinders are required to hold the

you block records

sults in 10 actual records

file?

so each physical record access re-

being accessed?

What

are the benefits

of do-

ing this?
13. Suppose you have a collection of 500 large images stored in files, one
image per file, and you wish to "animate" these images by displaying them
in sequence on a workstation at a rate of at least 15 images per second over
a high-speed network. Your secondary storage consists of a disk farm with
30 disk drives, and your disk manager permits striping over as many as 30
drives, if you request it. Your drives are guaranteed to perform I/O at a
steady rate of 2 megabytes per second. Each image is 3 megabytes in size.

Network transmission speeds

mation
b.

are.

not a problem.

Describe in broad terms the steps involved in doing such an aniin real

time from disk.

Describe the performance issues that you have to consider in im-

plementing the animation. Use numbers.

c. How might you configure your I/O system to achieve the desired
performance?

Consider the 1,000,000-record mailing list file discussed in the text. The
to be backed up on 2,400-foot reels of 6,250-bpi tape with 0.3-inch
interblock gaps. Tape speed is 200 inches per second.
a. Show that only one tape would be required to back up the file if a
blocking factor of 50 is used.
b. If a blocking factor of 50 is used, how many extra records could
be accommodated on a 2,400-foot tape?
c. What is the effective recording density when a blocking factor of
14.

file is

How

used?
large does the blocking factor have to be to achieve the

maximum

effective recording density?

What

negative results can re-

from increasing the blocking factor? (Note: An I/O buffer

enough to hold a block must be allocated.)
sult

large

FURTHER READINGS

the minimum blocking factor required to fit the

onto the tape?
f.
If a blocking factor of 50 is used, how long would it take to read
one block, including the gap? What would the effective transmission
rate be? How long would it take to read the entire file?
g. How long would it take to perform a binary search for one

What would be

file

file, assuming that it is not possible to read backwards

on the tape? (Assume that it takes 60 seconds to rewind the tape.)
Compare this with the expected average time it would take for a sequential search for one record.
h. We implicitly assume in our discussions of tape performance that
the tape drive is always reading or writing at full speed, so no time is
lost by starting and stopping. This is not necessarily the case. For ex-

record in the

ample,

some

drives automatically stop after writing each block.

Suppose that the extra time it takes to start before reading a block
and to stop after reading the block totals 1 msec, and that the drive
must start before and stop after reading each block. How much will
the effective transmission rate be decreased due to starting and stop-

ping

if the

blocking factor

What

it is

50?

15.

Why are there interblock gaps

just

jam

16.

The use of large blocks can lead to severe internal fragmentation of

on disks. Does this occur when tapes are used? Explain.

all

tapes? In other words,

why do we not

records into one block?

tracks

textbooks contain more detailed information on the material covered

chapter. In the area of operating systems and

file

in this

management systems, we have

found the operating system texts by Deitel (1984), Peterson and Silberschatz (1985),
and Madnick and Donovan (1974) useful. Hanson (1982) has a great deal of material
on blocking and buffering, secondary storage devices, and performance. Flores's

book

(1973)

on peripheral devices may be

bit dated,

but

contains a

com-

prehensive treatment of the subject.

Bohl (1981) provides

DASDs. Chaney and Johnson

thorough treatment of mainframe-oriented IBM

good article on maximizing hard disk

(1984) wrote a

performance on small computers. Ritchie and Thompson (1974), Kermghan and

Ritchie (1978), Deitel (1984), and McKusick et al. (1984) provide information on

I/O is handled in the UNIX operating system. The latter provides a good
of ways in which a filesystem can be altered to provide substantially taster
throughput for certain applications. A comprehensive coverage of UNIX I/O from
the design perspective can be found in Leffler et al. (1989).

how

file

case study

SECONDARY STORAGE AND SYSTEM SOFTWARE

Information on specific systems and devices can often be found in manuals and
documentation published by manufacturers. (Unfortunately, information about
how software actually works is often proprietary and therefore not available.) If you
use a VAX, we recommend the manuals Introduction to the VAX Record Management
Services (Digital, 1978), VAX Software Handbook (Digital, 1982), and Peripherals

Handbook

(Digital,

Laboratories'

IBM PCs
useful.

1981).

UNIX users will find

useful to look at
UNIX I/O System by Dennis Ritchie (1979).
it

monograph The

will find the Disk Operating System (Microsoft,

1983 or

later)

the Bell

Users ot

manual

Fundamental File
Structure Concepts

CHAPTER OBJECTIVES
Introduce

file

structure concepts dealing with

Stream files;
and record boundaries;
Fixed-length and variable-length
Field

fields

and

records;

Search keys and canonical forms;

Sequential search;
Direct access; and
File access

and

file

organization.

Examine other kinds of file

structures in terms of

Abstract data models;

Metadata;
Object-oriented file access; and
Extensibility.

Examine

issues

of portability and standardization.

CHAPTER OUTLINE
4.1

Field

and Record Organization

4.4

4.1.1

4.5

Stream

File

File Access

Beyond Record

4.1.2 Field Structures

4.1.3 Reading a Stream of Fields
4.1.4 Record Structures

Record Structure That Uses a

4.1.5
Length Indicator
4.1.6 Mixing Numbers and
Characters: Use of a File Dump

Structures

Abstract Data Models

4.5.2

More Complex Headers

4.5.4 Color Raster Images

Mixing Object Types

One

File

4.5.6 Object-oriented File Access

4.5.7 Extensibility

4.2.1

Record Keys

4.2.2

A Sequential Search
UNIX Tools for Sequential

4.6

More about Record

Portability and Standardization

4.6.1

Factors Affecting Portability

4.6.2 Achieving Portability

Processing
4.2.4 Direct Access
4.3

Organization

4.5.1

4.5.5

Record Access

4.2.3

File

4.5.3 Metadata

4.2

and

Structures

Choosing a Record Structure

and Record Length
4.3.2 Header Records
4.3.1

4.1

Field

and Record Organization

When we build file structures we are imposing order on data. In this chapter
we investigate the many forms that this ordering can take. We begin by
looking

at the

base case:

4.1.1 A Stream
Suppose the

program
out as

file

organized

as a

are building contains

name and

names and addresses from

stream of consecutive bytes to

OUTPUT,

stream of bytes.

File

to accept

a file

described in the pseudocode

address information.

the keyboard, writing

file

shown

them

with the logical name

in Fig. 4.1.

Implementations of this program in both C and Pascal, called writstrm.c

and writstrm.pas, are provided in the C and Pascal Programs sections at the
end of this chapter. You should type in this program, working in either
C or Pascal, compile it, and run it. We use it as the basis for a number
of experiments, and you can get a better feel for the differences between

AND RECORD ORGANIZATION

FIELD

PROGRAM: writstrm
get output file name and open it with the logical name OUTPUT
get LAST name as input
while
LAST name has a length > 0)
get FIRST name, ADDRESS, CITY, STATE and ZIP as input
(

write
write
write
write
write
write

LAST
FIRST
ADDRESS
CITY
STATE
ZIP

to
to
to
to
to
to

the
the
the
the
the
the

file
file
file
file
file
file

OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT
OUTPUT

get LAST name as input

endwhile
close OUTPUT
end PROGRAM
FIGURE 4.1

Program

the

file

to write out a

structures

name and address

are discussing if

file

as a stream of bytes.

you perform

the experiments your-

self

The following names and

John Ames
123 Maple
Stillwater, OK 74075

When we
AmesJohnl

list

Map 1 eS

The program
a

the output
i 1 1

file

addresses are used as input to the program:

Alan Mason
Eastgate
Ada, OK 74820

on our terminal

screen, here

wa t er 0K74 75MasonA lan90 Ea 5 t gat

writes the information out to the

file

specifications, the

there

no way

program
put

to get

all
it

creates a kind

what we

e Ada

see:

OK 7482

precisely as specified: as

stream of bytes containing no added information. But

problem. Once

of "reverse

meeting our

Humpty-Dumpty"

that information together as a single byte stream,

apart again.

of the fundamental organizational units of

our input data; these fundamental units are not the individual characters, but
meaningful aggregates of characters, such as "John Ames" or "123 Maple."

have

lost the integrity

FUNDAMENTAL

When we
fields.

STRUCTURE CONCEPTS

FILE

working with

arc

files,

these fundamental aggregates

call

field is the smallest logically meaningful unit

field

logical notion;

necessarily exist in any physical sense,

When we

structure.

write out our

of information

yet

name and

in a file.^

does not
important to the file's

conceptual tool.

field

address information as a stream

of undifferentiated bytes, we lose track of the fields that make the

information meaningful. We need to organize the file in some way that lets
us keep the information divided into fields.

4.1.2 Field Structures

There are many ways of adding structure to files
fields. Four of the most common methods are
Force the

fields into a predictable length.

Begin each
Place

to maintain the identity

field

with

a delimiter at

length indicator.

the end of each field to separate

from the next

field.

Use

"keyword = value" expression

to identify each field

and

its

contents.

Method

Fix the Length of Fields

in their length. If

pull

them back out of the

field.

can define

Using

this

looks like that

file

shown

our sample

file

way

to the

or a record in Pascal to

vary

can

end of the
hold these

in Fig. 4.2.

kind of fixed-field length structure changes our output so

shown

in Fig. 4.3(a).

Simple arithmetic

recover the data in terms of the original

One

fields in

simply by counting our

a structure in

fixed-length fields, as

The

force the fields into predictable lengths, then

obvious disadvantage of

this

sufficient to let us

fields.

approach

that

adding

all

the

padding required to bring the fields up to a fixed length makes the file much
Rather than using 4 bytes to store the last name Ames, we use 10. We
can also encounter problems with data that is too long to fit into the
allocated amount of space. We could solve this second problem by fixing all
larger.

enough to cover all cases, but this would

problem of wasted space in the file even worse.

the fields at lengths that are large

just

make

"'"Readers

the

first

should not confuse the term field and

some programming

record

with the meanings given to them by

languages, including Pascal. In Pascal,

record

an aggregate data

where each member is referred to as

a field. As we shall see, there is often a direct correspondence between these definitions of
the terms and the fields and records that are used in files. However, the terms field and
record as we use them have much more general meanings than they do in Pascal.
structure that can contain

members of different

types,

FIELD

InC:

In Pascal:

struct {
char lastCIO];
char firstClO]
char addressCl5]
char city[15];
char stateC2]
char zipC9]
} set_of__fields;

TYPE
set _of_field; s = RECORD
last
packed array [1
first
packed array [1

address
city
state
zip

AND RECORD ORGANIZATION

packed
packed
packed
packed

array
array
array
array

[1
[1

[1
CI

of
of
of
of
of
of

10]
10]
15]
15]
2]
9]

char
char
char
char
char
char

END;
FIGURE 4.2 Fixed-length

fields.

Because of these
is

difficulties, the fixed-field

approach to structuring data

often inappropriate for data that inherently contain

of fields, such

variability in the length

kinds of data for which fixed-length

field

already fixed in length, or

lengths, using a

file

names and

fields are

there

a large

addresses.

amount of

But there

highly appropriate.

very

little

are

every

variation in field

structure consisting of a continuous stream of bytes

organized into fixed-length

Method 2: Begin Each

make

fields is often a

Field with a

very good solution.

Length Indicator

Another way

possible to count to the end of a field involves storing the field

length just ahead of the

too long (length

less

field, as illustrated in Fig. 4.3(b). If the fields are

than 256 bytes),

it is

not

possible to store the length in

single byte at the start of each field.

Separate the Fields with Delimiters We can also preserve

by separating them with delimiters. All we need to do
is choose some special character or sequence of characters that will not
appear within a field and then insert that delimiter into the file after writing

Method

the identity of fields

each

field.

The choice of a

delimiter character can be very important since

must

way of processing. In many instances

white-space characters (blank, new line, tab) make excellent delimiters because
they provide a clean separation between fields when we list them on the
be

a character that

does not get in the

most programming languages include I/O statements

assume that fields are separated by white space.
Unfortunately, white space would be a poor choice for our file

console. Also,

that,

default,

blanks

often

occur

legitimate

characters

within

address

since
field.

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

Ames

John

123 Maple

Stillwater

OK74075377-1808

Mason

Alan

90 Eastgate

Ada

OK74820

(a) Field

lengths fixed. Place blanks in the spaces where the

phone number would

Ames John 123 Maple Stillwater OK 74075 377-1808

MasonlAlanl 90 Eastqate Ada OKI 74820

go.

(b) Delimiters are used to indicate the end of a field. Place the delimiter for the "empty" field
immediately after the delimiter for the previous field.

...

S-illwater|OK| 74075 377-1808 #Mason

90Eastgate Ada OK 74820

...

SURNAME=Ames FIRSTNAME=John STREET=123 Maple

(d) Use a keyword to identify each

assumed to be missing.

field. If the

keyword

If the

...

end-of-record mark

ZIP = 74075 PHONE=3377-1 ono

missing, the corresponding field

FIGURE 4.3 Four methods for organizing fields within records to account for possible missing
the examples, the second record is missing the phone number.

fields. In

Therefore, instead of white space

delimiter, so our

file

use the vertical bar character as our

appears as in Fig. 4.3(c). Readers should jnodify the

st ream-of-bytes programs, writstrm.c and writstrm.pas (found in the

and Pascal Programs sections at the end of this chapter), changing them
so they place a delimiter after each field. We use this delimited field format

original

in the

next few sample programs.

Method

Use

"Keyword = Value" Expression

to Identify Fields

This option, illustrated in Fig. 4.2(d), has an advantage that the others do
not: It is the first structure in which a field provides information about itself.

Such

self-describing structures

can be very useful tools for organizing

files

FIELD

AND RECORD ORGANIZATION

many applications. It is easy to tell what

if we don't know ahead of time what

even

contain.

It is

also a

fields are

contained in

fields the file is

a file,

supposed to

good format for dealing with missing fields. If a field is

makes it obvious, because the keyword is simply not

missing, this format

there.

You may have noticed in Fig. 4.3(d) that this format is used in
combination with another format, a delimiter to separate fields. While this
may not always be necessary, in this case it is helpful because it shows the
division between each value and the keyword for the following field.
Unfortunately, for the address file this format also wastes a lot of space.
Fifty percent or more of the file's space could be taken up by the keywords.
But there are applications in which this format does not demand so much
overhead. We discuss some of these applications in section 4.5.
4.1.3 Reading a Stream

of Fields

Given modified versions of writstrm.c and writstrm.pas that use delimiters to

we can write a program called readstrm that reads the stream
of bytes back in, breaking the stream into fields. It is convenient to conceive
of the program on two levels, as shown in the pseudocode description
provided in Fig. 4.4. The outer level of the program opens the file and then
calls the function readfield( ) until readfield( ) returns a field length of zero,
indicating that there are no more fields to read. The readfield( ) function, in
turn, works through the file, character by character, collecting characters
into a field until the function encounters a delimiter or the end of the file.
The function returns a count of the characters that are found in the field.
Implementations of readstrm in both C and Pascal are included with the
programs at the end of this chapter.
When this program is run using our delimited-field version of the file
containing data for John Ames and Alan Mason, the output looks like this:
separate fields,

Field
Field
Field
Field
Field
Field
Field
Field
Field
Field
Field
Field

lwat er

74075
Mason
Alan
90 Eastgate

8
9

i 1

Ada

1
1

Ames
John
123 Maple

74820

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

Define Constant: DELIMITER

'!'

readstrm

PROGRAM:

get input file name and open as INPUT

initialize FIELD_C0UNT

FIELD_LENGTH := readfield (INPUT, FIELD_C0NTENT

FIELD_LENGTH >
while
)

increment the FIELD_C0UNT

write FIELD.COUNT and FIELD_CONTENT to the screen
FIELD_LENGTH := readfield (INPUT, FIELD_C0NTENT

endwhile
close INPUT
end PROGRAM

FUNCTION:

readfield (INPUT, FIELD_C0NTENT

initialize I
initialize CH
while (not EOF (INPUT) and CH does not equal DELIMITER
read a character from INPUT into CH
increment I
FIELD_C0NTENT [I]
= CH
:

endwhile
return (length of field that was read)
end FUNCTION
FIGURE 4.4 Program to read fields from a

Clearly,
these data.
as a

we now

file

preserve the notion of

But something

stream of fields. In

is still

fact,

six fields are a set associated

are a set
records.

and display them on the screen.

fields associated

missing.

a field as

store

and retrieve

We do not really think of this

the fields need to be grouped into sets.

The

file

first

with someone named John Ames. The next six

with Alan Mason. We call these sets of fields

FIELD

AND RECORD ORGANIZATION

4.1.4 Record Structures

record

terms of a higher level of organization. Like the notion of a field, a record

can be defined as a

another conceptual tool.

set

It is

offields that belong together when the

another level of organization that

file is

viewed
is

we impose

on the data
in

to preserve meaning. Records do not necessarily exist in the file

any physical sense, yet they are an important logical notion included in

the

structure.

file's

Here

are

some of the most

often used methods for organizing a

file

into

records:

Require that the records be

a predictable
a

predictable

number of bytes
number of fields

in length.
in length.

Begin each record with a length indicator consisting of a count of the

number of bytes that the record contains.
Use a second file to keep track of the beginning byte address for
each record.
Place a delimiter at the end of each record to separate

from the

next record.
1: Make Records a Predictable Number of Bytes (Fixedlength Records) A fixed-length record file is one in which each record
contains the same number of bytes. This method of recognizing records is
analogous to the first method we discussed for making fields recognizable.

Method

As we

will see in the chapters that follow, fixed-length record structures are

among the most commonly used methods for organizing files.

The C structure set_of_fields (or the Pascal RECORD of the same name)
that we define in our discussion of fixed-length fields is actually an example
of a fixed-length record as well as an example of fixed-length fields. We have
a fixed number of fields, each with a predetermined length, which combine
to make a fixed-length record. This kind of field and record structure is
illustrated in Fig. 4.5(a).

however, that fixing the number of bytes in

a record does not imply that the sizes or number of fields in the record must
be fixed. Fixed-length records are frequently used as containers "to hold
It is

important to

realize,

numbers of variable length fields. It is also possible to mix fixedand variable-length fields within a record. Figure 4.5(b) illustrates how
variable-length fields might be placed in a fixed-length record.
variable

Method

Make Records

a Predictable

than specifying that each record in a

bytes,

good way

can specify that

file

Number

will contain a fixed

to organize the records in the

of Fields Rather
some fixed number of
number of fields. This is a

contain

name and

address

file

we have

been

102

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

Ames

John

123 Maple

Stillwater

0K74075

Mason

Alan

90 Eastgate

Ada

0K74820

(a)

Ames John
;

123 Maple

Stillwater OK 74075

Unused space

(b)

Ames John 123 Maple Stillwater OK 74075 Mason Alan 90 Eastgate Ada OK
|

(c)

FIGURE 4.5 Three ways of making the lengths of records constant and predictable, (a) Counting
bytes: fixed-length records with fixed-length fields, (b) Counting bytes: fixed-length records with
variable-length fields, (c) Counting fields: six fields per record.

looking

The

at.

writstrm

program

asks for six pieces of information for

every person, so there are six contiguous

(Fig. 4.5c).

fields in the file for

each record

We could modify readstrm to recognize fields simply by counting

the fields modulo six, outputting record

every time the count

boundary information

to the screen

starts over.

3: Begin Each Record with a Length Indicator

We can
communicate the length of records by beginning each record with a field

Method

containing an integer that indicates

the record (Fig.

This

4.6a).

variable-length records.

Method

look

how many bytes there are in the rest of

commonly used method for handling

closely in the next section.

Use an Index to Keep Track of Addresses

index to keep a byte offset for each record in the original

offsets

the length of each record.

index and then seek to the record

illustrates this two-file

Method
at a

can use an

The byte

allow us to find the beginning of each successive record and also

compute

in the

file.

let

record

in the data file.

Figure 4.6(b)

End of Each Record

This option,

mechanism.

Place a Delimiter at the

record level,

look up the position of

exactly analogous to the solution

used to keep the

FIELD

103

AND RECORD ORGANIZATION

sample program we developed. As with fields, the

must not get in the way of processing. Because we often

fields distinct in the

delimiter character

want

our console,

to read files directly at

delimiter for

that contain readable text

files

(carriage return/new-line pair or,

acter

common

'\n'). In Fig 4.6(c)

UNIX

choice of a record

the end-of-line character

systems, just a new-line char-

use a '#' character as the record delimiter.

4.1.5 A Record Structure That Uses a Length Indicator

Not one of

these approaches to preserving the idea of a record in a

file is

situations. Selection of a method for record organization

depends on the nature of the data and on what you need to do with it. We
begin by looking at a record structure that uses a record-length field at the
beginning of the record. This approach lets us preserve the variability in the
length of records that is inherent in our initial stream file.

appropriate for

all

Writing the Variable-length Records to the File We call the program

that builds this new, variable-length record structure writrec. The set of
programs at the end of this chapter contains versions of this program in C
and Pascal. Implementing this program is partially a matter of building on

FIGURE 4.6 Record structures for variable-length records, (a) Beginning each record with a length
indicator, (b) Using an index file to keep track of record addresses, (c) Placing the delimiter '#' at
the end of each record.

40kmes John 123 Maple Stillwater OK 74075 J6Mason Alan 90 Eastgate

(a)

Data

Ames John 123 Maple Stillwater OK 74075 Mason Alan

file:

T
Index

file:

40
(b)

Ames John 123 Maple Stillwater 0K 74075 #Mason Alan 90 Eastgate Ada 0K
;

(c)

FUNDAMENTAL

the writstrm

addressing
If

FILE

STRUCTURE CONCEPTS

program that we created

some new problems:

we want

earlier in this chapter,

to put a length indicator at the beginning

(before any other fields),

the fields in each record

but also involves

of every record

we must know the sum of the lengths

before we can begin writing the record

of
to

need to accumulate the entire contents of a record in a

bu ffcr before writing it out
In what form should we write the record-length field to the file? As
a binary integer? As a series of ASCII characters?
the

file.

The concept of buffering

with

files.

into

which we place the

one

run into again and again

In the case ofwritrec, the buffer can

fields

and

simply be

field delimiters as

we work

character array

collect

them.

Resetting the buffer length to zero and adding information to the buffer can

be handled using the loop logic provided in Fig.

Representing the Record Length

record length

a little

length in the form of

a natural

4.7.

The question of how to represent

One option would be to write

difficult.

much

two-byte binary integer before each record. This is

it does not require us to go to the trouble of

bigger numbers with an integer than

number of ASCII

by.tes

(e.g.,

32,767 versus 99).

FIGURE 4.7 Main program logic for writrec.

get LAST name as input

while
LAST name has a length >
set length of string in BUFFER to zero
concatenate: BUFFER + LAST name + DELIMITER
)

while

input fields exist for record

get the FIELD
(

concatenate: BUFFER
endwhile

FIELD

DELIMITER

write length of string in BUFFER to the file

write the string in BUFFER to the file
get LAST name as input

endwhile

the

solution in C, since

converting the record length into character form. Furthermore,

represent

the

can

can with the same

also conceptually

FIELD

AND RECORD ORGANIZATION

since it illustrates the use of a fixed-length, binary

combination with variable-length character fields.

interesting,

Although we could use

we might
between

choose,

and

instead,

this

same solution

to account for

field

for a Pascal implementation,

some important

differences

Pascal:

Unlike C, Pascal automatically converts binary integers into characof those integers if we are writing to a text file.
Consequently, it is no trouble at all to convert the record length into
a character form: It happens automatically.
In Pascal, a file is defined as a sequence of elements of a single type.
Since we have a file of variable-length strings of characters, the natural type for the file is that of a character.
ter representations

do in C is to store the integers in the file

two-byte fields containing integers. In Pascal it is easier to
make use of the automatic conversion of integers into characters for text
files. File structure design is always an exercise in flexibility. Neither of
these approaches is correct; good design consists of choosing the approach
that is most appropriate for a given language and computing environment. In
the programs included at the end of this chapter, we have implemented our
record structure both ways, using integer-length fields in C and character
representations in Pascal. The output from the Pascal implementation is
shown in Fig. 4.8. Each record now has a record-length field preceding the
data fields. This field is delimited by a blank. For example, the first record
(for John Ames) contains 40 characters, counting from the first 'A' in
"Ames" to the final delimiter after "74075," so the characters '4' and '0' are
placed before the record, followed by a blank.
Since the C version of writrec uses binary integers for the record length,
we cannot simply print it to a console screen. We need a way to interpret the
noncharacter portion of the file. For this, we introduce in the next section
the file dump, a valuable tool for viewing the contents of files. But first,
let's look at a program that will read in any file that is written by writrec.
In short, the easiest thing to

as fixed-length,

Reading the Variable-length Records from the

structure of variable-length records preceded

FIGURE 4.8 Records preceded by record-length fields

File
Given our file
by record-length fields, it is

character form.

123 Maple Stillwater OK 74075 36 Mason Alan 90

Eastgate Ada OK 74820

40 Ames John
\

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

readrec

PROGRAM:

open input file as INP_FILE

initialize SCAN_P0S to
RECORD_LENGTH := get_rec INP_FILE, BUFFER)
while (RECORD_LENGTH > 0)
SCAN_P0S := get_fld(FIELD, BUFFER, SCAN_POS,RECORD_LENGTH)
while (SCAN_P0S > 0)
print FIELD on the SCREEN
SCAN_P0S := get_fld( FIELD, BUFFER, SCAN_POS,RECORD_LENGTH)
endwhile
(

RECORD_LENGTH
endwhile
end PROGRAM

FUNCTION:

get_rec INP_FILE, BUFFER)

(

get_rec INP_FILE, BUFFER)

(

if EOF (INP_FILE)

then return

read the RECORD_LENGTH

read the record contents into the BUFFER
return the RECORD_LENGTH
end FUNCTION

FUNCTION

ge t_f Id FIELD BUFFER, SCAN_P0S RECORD_LENGTH

(

if SCAN_P0S == RECORD.LENGTH then return

get a character CH at the SCAN_P0S in the BUFFER

while (SCAN_P0S < RECORD_LENGTH and CH is not a DELIMITER)
place CH into the FIELD
increment the SCAN_P0S
get a character CH at the SCAN_P0S in the BUFFER
endwhile

return the SCAN_P0S

end FUNCTION
FIGURE 4.9 Main program logic

for readrec,

along with functions get_rec(

and get_fld(

FIELD

easy to write

program

that reads

through the

file,

record by record,

The program

displaying the fields from each of the records on the screen.

logic

shown

in Fig. 4.9.

The main program

calls

the function get_rec(

that reads records into a buffer; this call continues until get_rec(

Once get_rec(

value of 0.
is

(SCAN_POS)

get_Jld(

the

in the

reads characters

returns a

places a record's contents into a buffer, the buffer

passed to the function get_fld(

position

107

AND RECORD ORGANIZATION

from

end of the record

The

call to get_jld( )

argument

list.

includes

scanning

SCAN_POS,

Starting at the

the buffer into a field until either a delimiter

reached.

Function get_Jld(

returns

the

SCAN_POS for use on the next call. Implementations ofwritrec and readrec
in both C and Pascal are included along with the other programs at the end
of

this chapter.

4.1.6 Mixing Numbers and Characters: Use

File

dumps

give us the ability to look inside

of a File

Dump

a file at the actual

bytes that are

stored there. Consider, for instance, the record-length information in the

program output that we were examining a moment ago. The length

Ames record, which is the first one in the file, is 40 characters,
including delimiters. In the Pascal version of writrec, where we store the
ASCII character representation of this decimal number, the actual bytes
Pascal

of the

stored

the file

implementation,

look

like the representation in Fig.

where we choose

4.10(a).

In the

represent the length field

C
a

two-byte integer, the bytes look like the representation in Fig. 4.10(b).
As you can see the number 40 is not the same as the set of characters '4'
and '0'. The hex value of the binary integer 40 is 0x28; the hex values of the
x 30. (We are using the C language
characters '4' and '0' are
x 34 and
convention of identifying hexadecimal numbers through the use of the
prefix Ox.) So, when we are storing a number in ASCII form, it is the hex

FIGURE 4.10 The number 40, stored as ASCII characters and as a short integer.

Decimal value
of number

Hex value

stored

ASCII
character form

in bytes

(a)

40 stored as ASCII chars:

'4'

'0'

(b)

40 stored as a 2-byte integer:

'\0'

"('

FUNDAMENTAL

ASCII

values of the

number

STRUCTURE CONCEPTS

FILE

characters that

go into the

file,

not the hex value of the

itself.

Figure

shows

10(b)

an integer (this

the byte representation of the

called storing the

number

number 40

stored as

form, even though

in binary

Now

usually view the output as a hexadecimal number).

the hexadecimal

file is that of the number itself. The ASCII characters that

happen to be associated with the number's actual hexadecimal value have no
obvious relationship to the number. Here is what the version of the file that
uses binary integers for record lengths looks like if we simply print it on a

value stored in the

terminal screen:

(Ames

tt_
^0x28

John

Blank, since

123 Maple

code for

ascii

'\0' is

Stillwater

74075

$Mason Alan

tf_
^ 0x28

'('

unprintable.

...

'*'
ascii

Blank:

'\0' is

code for

unprintable.

The ASCII representations of characters and numbers in the actual record

come out nicely enough, but the binary representations of the length fields
are displayed cryptically. Let's take a different look at the

using the

UNIX dump

-xc

< f

i 1

this

file,

time

UNIX command

Entering the

utility od.

ename>

produces the following:

Values

Offset
0000000

3037

^ASCII
^Hex

0024

7374

4561

3020
7

736f

6761

2
I

6461

4d61

7c39

6572

357c

7c41

6174

6c77

696c

5374

3734
a

416c

7465

3320

3132

6e7c

657c

6f68

7c4a
S

4b7c

6e7c

0000100

7c4f

0000060

6573

706c

4d61

0000040

416d

0028

0000020

7c4f

4b7c

3734

3832

307c

As you can see, the display is divided into three different kinds of data. The
column on the left labeled Offset gives the offset of the first byte of the row
that is being displayed. The byte offsets are given in octal form; since each
line contains 16 (decimal) bytes, moving from one line to the next adds 020
to the range. Every pair of lines in the printout contains interpretations of
the bytes in the file in hexadecimal and ASCII. These representations were
requested on the command line with the -xc flag (x = "hex;" c =
"character").
Let's look at the first

row of ASCII

values.

As you would

'('

expect, the

RECORD ACCESS

ASCII form appears in this row in a readable way.

which there is no printable ASCII
x 00. But there
representation. The only such value appearing in this file is
could be many others. For example, the hexadecimal value of the number
500,000,000 is 0xlDCD6500. If you write this value out to a file, an od of

data placed in the

But there

the

file

file in

are hexadecimal values for

with the option -xc looks

0000000 \035\315

like this:

1dcd 6500

The only
handles

printable byte in this

of the others by

all

file is

the one with the value 0x65 (V).

listing their equivalent octal values in the

ASCII

representation.

The hex dump of this output from

this

file

we have

version of writrec shows

an interesting mix of

represents

structure

organizational tools

the

encountered. In

a single

number of

record

binary and ASCII data. Each record consists of a fixed-length

count) and several delimited, variable-length
different data types
file

fields.

and organizational methods

how
the

we have both
field (the

byte

This kind of mixing of

common

in real-world

structures.

A Note about Byte Order

or a computer from

DEC,

computer you are using is an IBM PC

VAX, your octal dump for this file will
one we see here. These machines store the

If the

such

as a

probably be different from the

values of numbers in reverse order from the
example,

if this

dump were

executed on an

way we

IBM PC,

think of them. For

the hex representation

of the first two-byte value in the file would be 0x2800, rather than 0x0028.
This reverse order also applies to long, four-byte integers on these
machines. This is an aspect of files that you need to be aware of if you expect
to make sense out of dumps like this one. A more serious consequence of
the byte-order differences among machines occurs when we move files
from a machine with one type of byte ordering to one with a different byte
ordering. We discuss this problem and ways to deal with it in section 4.6,
"Portability and Standardization."

4.2

Record Access
4.2.1 Record Keys
Since our

new

file

structure so clearly focuses

the quantity of information that

on the notion of a record

it makes sense

being read or written,

think in terms of retrieving just one specific record rather than having to
read

all

the

way through

the

file,

displaying everything.

When

looking for

1 1

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

an individual record,

convenient to identify the record with

it is

a key

based

on the record's contents. For example, in our name and address file we
might want to access the "Ames record" or the "Mason record" rather than
thinking in terms of the "first record" or "second record." (Can you
remember which record comes first?) This notion of a ke_y is another
fundamental conceptual tool. We need to develop a more exact idea of what
a key is.

When we
want

are looking for a record containing the last

name Ames, we

form "AMES",
"ames", or "Ames". To do this, we must define a standard form for keys,
along with associated rules and procedures for converting keys into this
standard form. A standard form of this kind is often called a canonical form
for the key. One meaning of the word canon is rule, and the word canonical
means conforming to the rule. A canonical form for a search key is the single
representation for that key that conforms to the rule.
As a simple example, we could state that the canonical form for a key
requires that the key consist solely of uppercase letters and have no extra
to recognize

blanks

even

end. So, if a user enters

at the

to the canonical

form

"AMES"

"Ames", we would convert

before searching for

record. If there

a single record,

not

one-to-one relationship between the key and

then the program has to provide additional mechanisms to

allow the user to resolve the confusion that can result

record

fits a

when more than one

we are looking for

particular "key. Suppose, for example, that

John Ames's

address. If there are several records in the

different people

named John Ames, how should

finds.

The

simplest solution

file.

When

for several

provide

way of

The prevention

takes

the user enters a

new

to prevent such confusion.

new records are added to the

we form a unique canonical key

file

program respond?
first John Ames that it

the

it should not just give the address of the

Should it give all the addresses at once? Should
scrolling through the records?

Certainly

place as

the key

it.

often desirable to have ^isthutke^s or keys that uniquely identify

It is

a single

the user enters the key in the

and then search the

file for that key. This concern about uniqueness applies only to primary keys.
A primary key is, by definition, the key that is used to identify a record
record,

for that record

uniquely.

An
It is also possible, as we see later, to search on s econdary keys
example of a secondary key might be the city field in our name and address
file. If we wanted to find all the records in the file for people who live in
towns named Stillwater, we would use some canonical form of "Stillwater"
as a secondary key. Typically, secondary keys do not uniquely identify a
.

record.

Although a person's name might at first seem to be a good choice for

primary key, a person's name runs a high risk of failing the test for

RECORD ACCESS

uniqueness.

A name is

a perfectly fine

important secondary key

likelihood that

The reason

in a retrieval

two names
a

name

in the

a risky

a real data value. In general,

think

are choosing a

secondary key, and

same

file

choice for

often an

too great

will be identical.
a

primary key

primary keys should be

unique key,

in fact

system, but there

1 1 1

dataless.

contains data there

that

contains

Even when we
is

danger that

unforeseen identical values could occur. Sweet (1985) cites an example of a

file system that used a person's Social Security number as a primary key for
turned out

personnel records.

represented in the

file,

a large

that, in the particular

population that was

number of people who were not U.S.

citizens

were included, and in a different part of the organization all of these people
had been assigned the Social Security number 999-99-9999!
Another reason, other than uniqueness, that a primary key should be
dataless is that a primary key should be unchanguw. If information that
a certain record changes, and that information is contained
primary key, what do you do about the primary key? You probably
cannot change the primary key itself, in most cases, because there are likely
to be reports, memos, indexes, or other sources of information that refer to
the record by its primary key. As soon as you change the key, those

corresponds to
in a

become useless.
A good rule of thumb is to avoid trying to put data into primary keys.
If we want to access records according to data content, we should assign this
content to secondary keys. We give a more detailed look at record access by
references

primary and secondary keys in Chapter 6. For the rest of this chapter, we
suspend our concern about whether a key is primary or secondary and
concentrate simply on finding things by key.

4.2.2 A Sequential Search

Now that you know about keys,

reads through the

file,

you should be

able to write a

record by record, looking for

program

that

record with

Such sequential searching is just a simple extension of our

readrec program, adding a comparison operation to the main loop to see if
the key for the record matches the key we are seeking. We leave the actual
program as an exercise.

particular key.

Evaluating Performance of Sequential Search In the chapters that

follow, we find ways to search for records that are faster than the sequential
search mechanism. We can use sequential searching as a kind of baseline
against which to measure the improvements that we make. It is important,
therefore, to find some way of expressing the amount of time and work
expended in a sequential search.

1 1

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

performance measure requires that we decide on a unit of

on the performance of the
whole process. When we describe the performance of searches that take
place in electronic RAM, where comparison operations are more expensive

Developing

work

that usefully represents the constraints

memory, we

than fetch operations to bring data in from

usually use the

number of comparisons required for the search as the measure of work. But,
given that the cost of a comparison in
is so small compared to the cost
of a disk access, comparisons do not fairly represent the performance

RAM

constraints for a search through a

count low-level

READ(

file

storage. Instead,

calls.

seek and that any one

from the discussions of matters such
a

on secondary

We assume that each READ( call requires

READ( call is as costly as any other. We know

system buffering

Chapter 3 that

these assumptions are not strictly accurate. But, in a multiuser environment

where many processes

enough

are using the disk at once, they are close

correct to be useful.

Suppose

we have

file

with 1,000 records and

sequential search to find Al Smith's record.

required? If Al Smith's record

read in only

makes 1,000

a single record. If

READ(

calls

the

it is

first

we want

How many READ(

one

in the

file,

to use a
)

calls are

program has to
file, the program

the

the last record in the

before concluding the search. For an average

search, 500 calls are needed.

double the number of records

average and the

maximum number

in a

file,

READ(

sequential search to find Al Smith's record in a

requires,

on the average, 1,000

required for a sequential search

records in the

calls.
is

double both the

also

required.

calls

Using

of 2,000 records,
In other words, the amount of work
file

directly proportional to the

number of

file.

In general, the

with n records

work

required to search sequentially for

proportional to n;

takes at

most

record in

a file

comparisons; on

it takes approximately nil comparisons. A sequential search is said

of the order O(n) because the time it takes is proportional to w.*

average
to be

Improving Sequential Search Performance with Record Blocking

It is interesting and useful to apply some of the information from Chapter
3 about disk performance to the problem of improving sequential search

We learned in

Chapter 3 that the major cost associated with a

on the
disk. Once data transfer begins, it is relatively fast, although still much
slower than a data transfer within RAM. Consequently, the cost of seeking

performance.
disk access

""If

is a

the time required to perform a seek to the right location

you are not familiar with

good source.

this

"big-oh" notation, you should look

up.

Knuth

(1973a)

RECORD ACCESS

and reading

record and then seeking and reading another record

than the cost of seeking just once and then reading

(Once

again,

are

assuming

required for each separate

two

call.) It

follows that

all

greater

seek

should be able

improve the performance of sequential searching by reading

several records

successive records.

multiuser environment in which

READ(

1 1

in a block

once and then processing that block of records

RAM.
We

began this chapter with a stream of bytes. We grouped the bytes

and then grouped the fields into records. Now we arc
considering a yet higher level of organization, grouping records into blocks.
This new level of grouping, however, differs from the others. Whereas
fields and records are ways of maintaining the logical organization within
the file, blocking is done strictly as a performance measure. As such, the
into

fields,

block

size

usually related

of the disk drive

than to the content of the data. For instance, on sector-oriented disks the

block size

almost always some multiple of the sector

Suppose we have

size.

of 4,000 records and that the average length of

record
is
512
If
our
operating
system uses sector-sized buffers of 512
a
bytes.
bytes, then an unblocked sequential search requires, on the average, 2,000
READ( ) calls before it can retrieve a particular record. By blocking the
records in groups of 16 per block, so each READ( ) call brings in 8 kilobytes
worth of records, the number of reads required for an average search comes

down

a file

Each READ( requires slightly more time, since more data is

from the disk, but this is a cost that is usually well worth paying

to 125.

transferred

for such a large reduction in the

There

number of reads.

are several things to note

from

this analysis

and discussion of

record blocking:

Although blocking can

ments,
tion.

The

result in substantial

performance improve-

does not change the order of the sequential search operacost of searching

is still

tion to increases in the size of the

O(n), increasing in direct proporfile.

RAM

Blocking clearly reflects the differences between

access speed
and the cost of accessing secondary storage.
Blocking does not change the number of comparisons that must be

RAM, and it probably increases the amount of data transbetween disk and RAM. (We always read a whole block, even
if the record we are seeking is the first one in the block.)
Blocking saves time because it decreases the amount of seeking. We
find, again and again, that this differential between the cost of seeking and the cost of other operations, such as data transfer or RAM
done

ferred

access,

the force that drives

file

structure design.

FUNDAMENTAL

1 1

FILE

STRUCTURE CONCEPTS

When

Sequential Searching Is Good Much of the remainder of this text

devoted to identifying better ways to access individual records; sequential
searching is just too expensive for most serious retrieval situations. This is

unfortunate, because sequential access has

over other types of access:

the simplest of

file

It is

two major

practical advantages

extremely easy to program, and

requires

structures.

Whether sequential search is advisable depends largely on how the file

be used, how fast the computer system is that is performing the search,
and structural aspects of the file. There are many situations in which a
sequential search is often reasonable. Here are some examples:
is

ASCII

files in

which you

are searching for

some

pattern (see grep in

the next section);

Files

with few records

Files that

(e.g.,

10 records);

hardly ever need to be searched

(e.g.,

tape

files

usually

used for other kinds of processing); and

which you want all records with a certain secondary key

where a large number of matches is expected.

Files in

value,

Fortunately, these sorts of applications do occur often in day-to-day

computing
for

often, in fact, that operating systems provide

performing sequential processing.

this, as

UNIX is

many

utilities

one of the best examples of

see in the next section.

4.2.3 UNIX Tools

for

Sequential Processing

Recognizing the importance of having a standard file structure that is simple

and easy to program, the most common file structure that occurs in UNIX
is an ASCII file with the new-line character as the record delimiter and, when

-r-A%

possible, white space as the field delimiter. Practically all files that

using

create

UNIX editors use this structure. And since most of the built-in C and

Pascal functions that perform I/O write to this kind of file,

it is

common

numbers or words separated by blanks

or tabs, and records separated by new-line characters. Such files are simple
and easy to process. We can, for instance, generate an ASCII file with a
simple program, and then use an editor to browse through it or alter it.
UNIX provides a rich array of tools for working with files in this form.

see data files that consist of fields of

Since this kind of file structure

inherently sequential (records are variable

we have to pass from record to record to find any particular

field or record), many of these tools process files sequentially.
Suppose, for instance, that we choose the white-space/new-line

in length, so

file, ending every field with a tab and ending every

While this causes some problems in distinguishing
white space, but it doesn't separate a field), and in that sense

structure for our address

record with
fields (a

blank

new
is

line.

RECORD ACCESS

UNIX

For example,
such

utilities,

cat

buys us something very valuable: the full use of

around the white-space/new-line structure.
can print the file on our console using any of a number of

not an ideal structure,

those

1 1

tools that are built

we
as

>cat myf i 1
Stillwater DK 74075
Ames
John 123 Maple
Alan 90 Eastgate Ada
DK 74820
Mason

Or we

can use tools like

wc and

grep for processing the

files.

wc The command wc

("word count") reads through an ASCII file

number of lines (delimited by new lines), words
(delimited by white space), and characters in a file:

sequentially and counts the

>wc myf

i 1

grep

It is

common

character string in
sequentially,

(and

it.

want to find out

For ASCII files

recognize. In
a pattern,

its

if a text file

has a certain

word

that can reasonably be searched

provides an excellent

variants egrep and fgrep).

its

file

UNIX

regular expression,"

the

filter

The word

for

doing

this called grep

grep stands for "generalized

which describes the type of pattern

that grep

simplest form, grep searches sequentially through a

and then returns to standard output

(the console)

all

able to
file

for

the lines in

that contain the pattern.

>grep Ada myf

Mason
Alan

i 1

90 Eastgate Ada

We can also combine tools to create,

on the

>grep Ada

some very powerful file

number of words in all

fly,

processing software. For example, to find the

records containing the

74820

word Ada:

As we move through the text we will encounter

powerful UNIX commands that sequentially process
white-space/new-line structure.

number of

files

other

with the basic

4.2.4 Direct Access

The most
record

through a file for a

We have direct access
can seek directly to the beginning of the record and

radical alternative to searching sequentially

a retrieval

to a record

mechanism known

when we

as direct accesj.

1 1

FUNDAMENTAL

read
is

STRUCTURE CONCEPTS

Whereas sequential searching is an O(n) operation, direct access

no matter how large the file is, we can still get to the record we want

in.

O(l);

with

FILE

a single seek.

Direct access
required record

predicated on

is.

Sometimes

carried in a separate index

not have an index.

file.

knowing where

the beginning of the

information about record location is

this

But, for the

moment, we assume

assume, instead, that

we know

we do

that

the relative record

(RRN ) of the record that we want. The idea of an RRN is an

important concept that emerges from viewing a file as a collection of
records rather than a collection of bytes. If a file is a sequence of records,
then the RRN of a record gives its position relative to the beginning of the
number

file.

The

first

record in

a file

has

RRN 0,

the next has

RRN

and so

forth."

file, we might tie a record to its RRN by

membership numbers that are related to the order in which we
enter the records in the file. The person with the first record might have a
membership number of 1001, the second a number of 1002, and so on.
Given a membership number, we can subtract 1001 to get the RRN of the

our name and address

assigning

record.

What can we do with

have been using so
tells

far,

this

which

RRN? Not much,

us the relative position of the record

records, but

records as

still

given the

structures

file

The

consist of variable-length records.

we want

we want. An exercise

go, to get to the record

sequence of

in the

have to read sequentially through the

RRN

file,

counting

the end of this

a method of moving through the file called skip sequential

which can improve performance somewhat, but looking for a
particular RRN is still an O(n) process.
To support direct access by RRN, we need to work with records of
fixed, known length. If the records are all the same length, then we can use

chapter explores
processing,

a record's

RRN

to the start

RRN

record,

to calculate the byte offset

of the

file.

For instance,

of 546 and our

file

of the

as follows:

= 546 x

128

In general, given a fixed-length record

file

byte offset of

offset

record with an

RRN

Byte

offset

69,888.

where the record

t In

this

byte offset calculation

keeping with the conventions of

:av-lhised count. In

some

file

size

n X

the

differ

with regard to

done and even with regard

and Turbo

Programming languages and operating systems

where

of the record relative

has a fixed-length record size of 128 bytes per

can calculate the byte offset

Byte

start

are interested in the record with

Pascal,

systems, the count starts

we assume
1

that the

rather than

whether

RRN

MORE ABOUT RECORD STRUCTURES

byte offsets are used for addressing within

MS-DOS

operating systems), where

bytes, the application

command
a file

jump

files.

the calculation and uses the lseek(

to the byte that begins the record. All

terms of bytes. This

responsibility for translating an

and

sequence of

a file is treated as just a

program does

UNIX

(and the

1 1

very low-level view of

RRN

movement

the

files;

wholly

into a byte offset belongs

within

the application program.

The PL/I language and the operating environments in which PL/I is

(OS/MVS, VMS) are examples of a much different, higher-level
view of files. The notion of a sequence of bytes is simply not present when
you are working with record-oriented files in this environment. Instead,
files are viewed as collections of records that are accessed by keys. The
often used

operating system takes care of the translation between a key and

a record's

key is, in fact, just the record's RRN, but

of actual location within the file is still not the

location. In the simplest case, the

the

determination

programmer's concern.
If

at all in

of
no seeking

limit ourselves to the use of standard Pascal, the question

seeking by bytes or seeking by records

standard Pascal. But, as

not an issue: There

said earlier,

many implementations of

Pascal extend the standard definition of the language to allow direct access
to

different

locations in a

file.

The nature of

these extensions

varies

according to the differences in the host operating systems around which the
extensions were developed. All the same, one feature that
across implementations

that a

in Pascal

file

consistent

always consists of elements of

A file is a sequence of integers, characters, arrays, or records,

and so on. Addressing is always in terms of this fundamental element size.
For example, we might have a file of datarec, where datarec is defined as

a single type.

TYPE datarec = packed array [0..64] of char;

Seeking within
datarec,

which

datarec

number

into the

4.3

this file

terms of multiples of the elementary unit

to say in multiples

3 (zero-based count),

of
I

65-byte entity.

am jumping

195 bytes

ask to
(3

jump

X 65 =

195)

file.

More about Record Structures

4.3.1 Choosing a Record Structure and Record Length

Once we

decide to fix the length of our records so

give us direct access to

Clearly, this decision

a record,

can use the

RRN

have to decide on a record length.

related to the size

of the

fields

we want

to store in

1 1

STRUCTURE CONCEPTS

FUNDAMENTAL

FILE

the record.

Sometimes the decision

easy.

Suppose we

are building a

file

transactions that contain the following information about each

sales

transaction:

six-digit account

number of the

purchaser;

Six digits for the date field;

A
A
A

five-character stock

number

for item purchased;

three-digit field for quantity; and

10-position field for total cost.

These are all fixed-length fields; the sum of the field lengths is 30 bytes.
Normally, we would simply stick with this record size, but if performance
is so important that we need to squeeze every bit of speed out of our
retrieval system, we might try to fit the record size to the block
organization of our disk. For instance, if we intend to store the records on
a typical sectored disk (see Chapter 3) with a sector size of 512 bytes or some
other power of 2, we might decide to pad the record out to 32 bytes so we
can place an integral number of records in a sector. That way, records will
never span sectors.
The choice of a record length is more complicated when the lengths of
the fields can vary, as in our name and address file. If we choose a record
length that is the sum of our estimates of the largest possible values for all
the fields, we can be reasonably sure that we have enough space for
everything, but we also waste a lot of space. If, on the other hand, we are
conservative in our use of space and fix the lengths of fields at smaller
values, we may have to leave information out of a field. Fortunately, we can
avoid this problem to some degree through appropriate design of the field
structure within a record.
In our earlier discussion

of record structures, we saw that there are two

take toward organizing fields within a

approaches

can

fixed-length record.

The

first,

general

illustrated in Fig. 4.11(a), uses fixed-length

fields inside the fixed-length record.

sales transaction file

in Fig.

This

previously described.

the approach

took for the

The second approach,

illustrated

4.11(b), uses the fixed-length record as a kind of standard-sized

container for holding something that looks like

The

first

approach has the virtue of simplicity:

out" the fixed-length

fields

from within

variable-length record.
It is

very easy to "break

fixed-length record.

The second

approach lets us take advantage of an averaging-out effect that usually

occurs: The longest names are not likely to appear in the same record as the
longest address

field.

letting the field boundaries vary,

make
the two

can

more efficient use of a fixed amount of space. Also, note that

approaches are not mutually exclusive. Given a record that contains a
number of truly fixed-length fields and some fields that have variable-

119

MORE ABOUT RECORD STRUCTURES

Ames

John

123 Maple

Stillwater

0K74075

Mason

Alan

90 Eastgate

Ada

0K74820

(a)

Ames John 123 Maple Stillwater OK 74075

Mason Alan 90 Eastgate Ada OK 74820

;

Unused

space-

Unused space

(b)

FIGURE 4.1

Two fundamental approaches

to field structure within a fixed-

length record, (a) Fixed-length records with fixed-length fields, (b) Fixed-length

records with variable-length fields.

length information,

two approaches.
The programs
programs

and

which

update. pas,

are included in the set

change

it,

and then write

user to

back. These programs create

structure that uses variable-length fields within fixed-length records.

Given the
this

update. c

design a record structure that combines these

the end of this chapter, use direct access to allow

retrieve a record,
a file

we might

variability in the length

of the

fields in

our name and address

file,

an appropriate choice.

One of the

must be resolved in the design of

of distinguishing the real-data portion of the
record from the unused-space portion. The range of possible solutions
parallels that of the solutions for recognizing variable-length records in any
other context: We can place a record-length count at the beginning of the
record, we can use a special delimiter at the end of the record, we can count
fields, and so on. Because both update. c and update. pas use a character string
buffer to collect the fields, and because we are handling character strings
differently in C than in Pascal (strings are null-terminated in C; we keep a
byte count of the string length at the beginning of the Pascal strings), it is
this

interesting questions that

kind of structure

convenient to use
tations. In the

that

a slightly different file structure for

version

fill

null characters. In the Pascal version

(an integer) at the start of the record to

record are valid. As usual, there

structure; instead

and

situation.

the

two implemen-

out the unused portion of the record with

actually place a fixed-length field

tell

how many

in the

way to implement this file

most appropriate for our needs

single right

seek the solution that

of the bytes

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

dump

output from each of these programs.

number of

other ideas, such as the use of header

Figure 4.12 shows the hex

The output
records,

introduces

which we discuss

For now, however, just look

in the next section.

of the data records. We have italicized the length fields at the

of the records in the output from the Pascal program. Although we
filled out the records created by the Pascal program with blanks to make the
output more readable, this blank fill is unnecessary. The length field at the

at the structure

start

of the record guarantees that

start

we do

not read past the end of the data in

the record.

4.3.2 Header Records

often necessary or useful to keep track of

It is

about

a file to assist in future

at the

beginning of the

file

use of the

file.

some

general information

header record

often placed

to hold this kind of information. For example,

is no easy way to jump to the end of a file,

even though the implementation supports direct access. One simple
solution to this problem is to keep a count of the number of records in the
file and to store that count somewhere. We might also find it useful to
include information such as the length of the data records, the date and
time of the file's most recent update, and so on. Header records can help

some

make

versions of Pascal there

a file a self-describing object, freeing the

software that accesses the

file

from having to know a priori everything about its structure, and hence
making the file-access software able to deal with more variation in file
structures.

The header record usually has a different structure than the data records
file. The output from update. c, for instance, uses a 32-byte header

in the

record, whereas the data records each contain 64 bytes. Furthermore, the

data records created

update. c contain only character data,

header record contains an integer that

tells

how many

whereas the

data records are in the

file.

Implementing a header record presents more of a challenge for the

programmer. Recall that the Standard Pascal view of a file is one of
a repeated collection of components, all of which are the same component
type. Since a header record is fundamentally a different kind of record than

Pascal

the other records in a

some

file,

Pascal does not naturally support header records.

cases, Pascal lets us get

variant

record in Pascal

around this problem by using variant records.

one that can have different meanings,

depending on context. Unfortunately,

its

same

use as

header record

size as all other records in the

When

a variant

record cannot vary in

constrained by the fact that

size,

must be the

file.

faced with a language like Standard Pascal that strictly proscribes

the types of records

can use in

a file,

often find ourselves resorting

o
u

c
3

a
o
3

O
u
QJ

o
U

-C

._L

-M

$2
^A ^

djDlz:
cz

CJ
i_

>>
cn xi
QJ

--"

O
u

+->

-t->

=>
C=

QJ
l.

CJ
QJ

JZZ

O
3

-t->

-t-J

+->

X
+J

X
ID

o ^
CD O

C
O
in
ID

*ID
TD
^-

4^r
CJ

c\j

OJ
r^

co
oo

CD
"d"

r^
T
CD

U r^
r*

CO
CD
<+-

m o

CVJ

T
cn

o o
O O
o o
o o
o o

CD
CD
CD

cn
-h*

ro
r*
tn

cvj

r*
xi
-q-

4^a-

u ,_
CD
,
CD

CD
sr

u ,_
CD
T
\T

O
r^

i_n

i>.

r^
00

00
iv
uo
CD

U
1^
LD
CD

u u u
CD

4CD
ro

<r
00
rs
cn

r-.

o
t r^
o o o o
"srcDooj
o O t- rO CD O CD
O O CD CD
o o o o
O CD CD CD
sr

ID
lu

in
111

i.
<*

*
^

cn
+-1

^
0
m *
ID

(vj

CD
CD
CD

CD
CD
CD
CD

CVJ

03
->

CD
<

CVJ

CD
sr
cD

CVJ

o o
o o
o o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o

,_
CD
[V
CD

CD
CD
CD
CD
CD
CD
CD
CD

CD
CD

O
CD
CD
CD
CD
CD
CD
CD

CD
CD
CD

O
CD
CD
CD

CD
CD

CD
CD
CD
CD

CD
CD
CD

O
O

CD
CD
CD

CVJ

1^
00

cn
00
00

CD
CD
CD

CD
CD
CD
CD

O O O

^-(Docvj
,-

O
CD
CD

CD
CD
CD

O
CD
CD

CD
CD

O
O O O O

CD
CD
CD
CD

CD
CD

XI
TT

CD
CD
CD
CD

O
CD
CD

O
CD CD CD O
O O O O

CD
CD
CD

CD
CD

O O

CD
CD
CD
CD
CD
CD
CD
CD

O
CD
CD
CD
CD
^r
CD

CD
CD
CD
CD
CD
CD

O
O

O
O
O
O

CD
CD
CD
CD
CD

o
CD o
CD rCD CD
O O O O O
CD CD CD CD CD
CD CD O CD CD
cvj

r--

i*v

o o
o o
o
CVJ

u ,_

cvj

CO
T

^3-

CVJ

UO
CO

00
CD
4CD

CD
CD
CD

CVJ

L0
00

CVJ

CD
00

00
uo

CVJ

ro
0-

uo
CD

X)
CD
T
\r

O
O

o
o o

CVJ

,_
co

\r
iv
00

CVJ

o X o
^d-

cvj

CD
XI

CVJ

00
00

CO
r^
CO

* o
cn
r\ o

o
o
o

4- ,_

CD
r^

O
[N.

OsJ

o
o

CVJ

o CD
o o

CVJ

-Q

^.
CD
Ul
M-

cvj

Cvj

CVJ

CO
CJ
CO

CVJ

I*"".

CVJ

Cvj

00
CO
oo

^r
00
r^
00

00
CJ

o o
o o
CVJ

CVJ

cvi

cvj

o x o o

o o

(J)

Cvj

CD
oj
CD

cvj

CD
oj
CD
oj

CD
CD

Cvj

03
CD

o E

CD
OJ
OJ

CD
TT
OJ

-,,

c/)

J-;

>^
_Q

T3
CJ
CO
0J

oj
fC
o
c

'"5

^ x:
ro

j2)

O 4-

djo

O c
X
i_

O O O O O

CVJ

CJ
QJ

CVJ

-6

o1
o ci

Cvj

cvj

CVJ

CJ
X X 3
to
u
X 6 TJ
^-

cvj

OS)

+->

O O
O
CD O
oj oo
OJ
O O O CD O
CD CD O CD O
CD CD O CD O

JC
4'

JC 0)

-3

CVJ

cvj

00
>^
TJ JQ
CJ
CJ
i

CVJ

X
o

> O

o o
CD o
o
u o
CD o
CVJ

(V\
CO

Ih.

o o
o CD
CVJ

>^
*- jd
CD

CVJ

CO
CD

Q.
t

CVJ

rv
in
CD

O
^

CVJ

nr
~ CO

o o
o o

T O CD
CM
o O

1^
T
CD

-*=

CJ
cu

O O

= TD
-^ ?!
^fe

X3 "O

CD
CD
CD
CD

OJ
CO

o
o 2
O
o

DJD

GO
ro

l/l

^
ID

CD
cn

o
o
o
o
o
o
o
o
o

E17

-t->

uo
\r

x:
<,

^"^

cx>

2 ffl
O '^
CJJ

s
d.1

CUD

CJ
CJ

DJO

g?
-

E
C

c
x:
-trL

CU3

-t-^

*f 00 >,
JQ CD
LU TJ
II
O "O
CD
CJ x:
CD CJ 03 cd
X
u_ cn c= x=

121

FUNDAMENTAL

FILE

to tricks.

STRUCTURE CONCEPTS

use such

We just

a trick in update. pas:

use the

initial

integer

purpose in the header record. In the data

records this field holds a count of the bytes of valid data within the record;
in the header record it holds a count of the data records in the file.
Header records are a widely used, important file design tool. For
field in the

record for

when we

example,

a different

reach the point where

are discussing the construction

of tree-structured indexes for files, we see that header records are often
placed at the beginning of the index to keep track of matters such as the
RRN of the record that is the root of the index. We investigate some more
elaborate uses of header records later in this chapter and also in subsequent
chapters.

4.4

Access and

File

In the course

File

Organization

of our discussions in

we have

this chapter,

looked

Variable-length records;

Fixed-length records;
Sequential access; and

Direct access.

The

two of these

first

has to do with file

access

with

a useful one;

relate to aspects

access.

The

of file organization. The second pair

interaction

we need

to look at

between
it

organization and

file

closely before continuing

this chapter.

Most of what we have considered

so far

falls

into the category of

file

organization:

Can
Is

the

file

be divided into

fields?

there a higher level of organization to the

file

that

combines the

fields into records?

the records have the same number of bytes or fields?

do we distinguish one record from another?
do we organize the internal structure of a fixed-length record
so we can distinguish between data and extra space?
all

How
How
We

have seen that there are many possible answers to these questions and
of a particular file organization depends on many things,
including the file-handling facilities of the language you are using and the

that the choice

use you

want

Using

make of the file.

file

implies access.

looked

first

ultimately developing a sequential search. So long as

individual records began, sequential access

sequential

did not

access,

know where

was the only option open

to us.

BEYOND RECORD STRUCTURES

When we wanted

direct access, we fixed the length of our records, and this

allowed us to calculate precisely where each record began and to seek

directly to

it.

In other

words, our desire for direct

fixed-length record

file

organization.

Does

access

this

caused us to choose

mean

that

can equate

fixed-length records with direct access? Definitely not. There

is nothing
about our having fixed the length of the records in a file that precludes
sequential access; we certainly could write a program that reads sequentially

through

fixed-length record

Not only

can

sequentially, but

elect

file.

read through the fixed-length records

can also provide direct access to variable-length records

simply by keeping a list of the byte offsets from the start of the file for the
placement of each record. We chose a fixed-length record structure in
update. c and update. pas because it is simple and adequate for the data that we
want to store. Although the lengths of our names and addresses vary, the
variation is not so great that we cannot accommodate it in a fixed-length
record.

Consider, however, the effects of using

fixed-length record organi-

zation to provide direct access to records that are documents ranging in

length from a few hundred bytes to over

hundred kilobytes. Fixed-length

of space, so some form of
variable-length record structure would have to be found. Developing file
structures to handle such situations requires that you clearly distinguish
between the matter of access and your options regarding organization.
The restrictions imposed by the language and file system used to
develop your applications do impose limits on your ability to take
advantage of this distinction between access method and organization. For
records

would be

disastrously

wasteful

C language provides the programmer with the ability to

implement direct access to variable-length records, since it allows access to
any byte in the file. On the other hand, Pascal, even when seeking is
supported, imposes limitations related to Pascal's definition of a file as a
collection of elements that are all of the same type and, consequently, size.
Since the elements must all be of the same size, direct access to
example, the

variable-length records

4.5

difficult, at best, in Pascal.

Beyond Record Structures

Now

on the concepts of organization and access, we

look at some interesting new file organizations and more complex ways of
accessing files. We want to extend the notion of a file beyond the simple
that

have

a grip

idea of records and fields.

124

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

We begin

with the idea of abstract data models. Our purpose here is to

put some distance between the physical and the logical organization of files,
to allow us to focus more on the information content of files and less on
physical format.

4.5.1 Abstract Data Models

The

history of

file

structures and

file

processing parallels the history of

computer hardware and software. When file processing first became

common on computers, magnetic tape and punched cards were the primary
means used to store files, RAM space was dear, and programming
languages were primitive. Programmers as well as users were compelled to
view file data exactly as it might appear on a tape or cards
as a sequence
of fields and records. Even after data was loaded into RAM, the tools for
manipulating and viewing the data were unsophisticated and reflected the
magnetic tape metaphor. Data processing meant processing fields and

records in the traditional sense.

Gradually, computer users began to recognize that computers could

process

more than just

fields

and records. Computers could,

for instance,

process and transmit sound, and they could process and display images and

These kinds of applications deal with information

metaphor of data stored as sequences of records
that does not nicely
that are divided into fields, even if, ultimately, the data might be stored
physically in the form of fields and records. It is easier, in the mind's eye,
to envision data objects such as documents, images, and sound as objects

documents

(Fig. 4.13).
fit

that

the

manipulate in ways that are specific to the objects themselves, rather

than simply as fields and records on

The notion

medium

that

we need

a disk.

not view data only

captured in the phrase

FIGURE 4.13 Data such as sound, images, and documents do not

of data stored

appears on

abstract data model, a

as sequences of records that are divided into

fit

fields.

term

a particular

that encourages

the traditional metaphor

BEYOND RECORD STRUCTURES

an application-oriented view of data, rather than

The organization and

terms of

how

an application views the data, rather than

might physically be

One way
file is

a medium-oriented view.
methods of abstract data models are described

access

that

how

the data

stored.

know

save a user from having to

keep information

about objects

in a

in the file that file-access software can use to

"understand" those objects. A good example of how

file structure information in a header.

might be done

this

to put

4.5.2 Headers and Self-Describing

We have

seen

how

Files

header record can be used to keep track of how

many

our programming language permits it, we can

put much more elaborate information about a file's structure in the header.
When a file's header contains this sort of information, we say the file is
records there are in a

file.

Suppose, for instance, that

self-describing.

store in a

file

the following

information:

A name

for each field;

The width of each field; and

The number of fields per record.

can now write a program that can read and print a meaningful display
of files with any number of fields per record and any variety of fixed-length
widths. In general, the

field
file's

header,

usual, there

files in

file

structure information

our software needs to

the less

structure of an individual

structures of

put into

about the specific

file.

a trade-off: If

the

know

programs

we do

field and record

and write them, the programs

not hard-code the

that read

themselves must be more sophisticated. They must be flexible enough to

interpret the self-descriptions that they find in the

file

headers.

4.5.3 Metadata
Suppose you

are an

astronomer interested

telescopes that scan the sky, and

digital representations

you want

of these images

studying images generated by

to design a

(Fig. 4.14).

You

file

structure for the

expect to have

many

images, perhaps thousands, that you want to study, and you want to store
one image per file. While you are primarily interested in studying the

images themselves, you will certainly need information about each image:
where in the sky the image is from, when it was made, what telescope was
used, references to related images, and so forth.

data that describes the

This kind of information is called metadata
primary data in a file. Metadata can be incorporated into any file whose

126

FUNDAMENTAL

STRUCTURE CONCEPTS

FILE

FIGURE 4.14 To make sense

two-Mbyte image, an
astronomer needs such metadata as the kind of image it
of this

is, the part of the sky it is

from, and the telescope that

was used to view it. Astronomical metadata is often

stored

data

the

itself.

same

file

(This image

as the

shows

polarized radio emission from

the southern spiral galaxy

NGC 5236 [M83] as observed

with the Very Large Array radio telescope in New Mexico.)

primary data requires supporting information. If a file is going to be shared

users, some of whom might not otherwise have easy access to its

by many

metadata,

A common

may

be most convenient to store the metadata in the

place to store metadata in a

Typically, a

community of users of a

file is

particular kind of data agrees

standard format for holding metadata. For example,

called

FITS

(Flexible

file itself.

the header record.

standard format

Image Transport System) has been developed by the

International Astronomers'

Union

for storing the kind of astronomical data

FITS header is a collection of 2,880-byte

which each record contains a single
piece of metadata. Figure 4.15 shows part of a FITS header. In a FITS file,
the header is followed by the actual numbers that describe the image, one

just described in a

blocks of 80-byte

file's

header.^

ASCII

records, in

number per observed point of the image.

Note that the designers of the FITS format chose to use ASCII in the
header, but binary values for the image. ASCII headers are easy to read and
process and, since they occur only once, take up relatively little extra space.
Since the numbers that make a FITS image are rarely read by humans, but

binary

rather are first processed into a picture and then displayed, binary format

the preferred choice for them.

t For

Readings."

details

on FITS,

see the references listed at the

end of

this

chapter

"Further

BEYOND RECORD STRUCTURES

=
=
=
=
=
=
=
-

CONFORMS TO BASIC FORMAT

BITS PER PIXEL
2 / NUMBER OF AXES
256 / RA AXIS DIMENSION
256 / DEC AXIS DIMENSION
F / T MEANS STANDARD EXTENSIONS EXIST
0.000100000 / TRUE = TAPE*BSCALE +BZERO
0.000000000 / OFFSET TO TRUE PIXEL VALUES
MAP_TYPE= REL_EXPOSURE'/ INTENSITY OR RELATIVE EXPOSURE MAP
=
/ DIMENSIONLESS PEAK EXPOSURE FRACTION
BUNIT
CRVAL1
0.625 / RA
REF POINT VALUE (DEGREES)
=
CRPIX1
128.500 / RA
REF POINT PIXEL LOCATION
=
-0.006666700 / RA
CDELT1
INCREMENT ALONG AXIS (DEGREES)
= 'RA
/
TAN'
RA
TYPE
CTYPE1
=
CROTA1
0.000 / RA
ROTATION
71.967 / DEC REF POINT VALUE (DEGREES)
CRVAL2 =
CRPIX2 =
128.500 / DEC REF POINT PIXEL LOCATION
0.006666700 / DEC
CDELT2 =
INCREMENT ALONG AXIS (DEGREES)
/ DEC
CTYPE2 = 'DEC--TAN'
TYPE
CR0TA2 =
ROTATION
0.000 / DEC
=
EPOCH
1950.0 / EPOCH OF COORDINATE SYSTEM
ARR_TYPE=
=DP
4 /
3 = FP, 4=1
DATAMAX 1.000 / PEAK INTENSITY (TRUE)
DATAMIN =
0.000 / MINIMUM INTENSITY (TRUE)
-22.450 / ROLL ANGLE (DEGREES)
ROLL_ANG=
BAD_ASP =
/
0=good, 1=bad(Do not use roll angle)
TIME_LIV=
5649.6 / LIVE TIME (SECONDS)
OBJECT = 'REM6791
/
SEQUENCE NUMBER
AVGOFFY =
1.899 / AVG Y OFFSET IN PIXELS, 8 ARCSEC/PIXEL
AVGOFFZ =
2.578 / AVG Z OFFSET IN PIXELS, 8 ARCSEC/PIXEL
RMSOFFY =
0.083 / ASPECT SOLN RMS Y PIXELS, 8 ARCSC/PIX
RMSOFFZ =
0.204 / ASPECT SOLN RMS Z PIXELS, 8 ARCSC/PIX
TELESCOP= 'EINSTEIN
/
TELESCOPE
INSTRUME= 'IPC
/
FOCAL PLANE DETECTOR
OBSERVER= 2
/
OBSERVER #: 0=CFA; 1=CAL; 2=MIT; 3=GSFC
=
GALL
119.370 / GALACTIC LONGITUDE OF FIELD CENTER
=
GALB
9.690 / GALACTIC LATITUDE OF FIELD CENTER
DATE_OBS= '80/238
/
YEAR & DAY NUMBER FOR OBSERVATION START
DATE_STP= '80/238
/
YEAR & DAY NUMBER FOR OBSERVATION STOP
= "SNR SURVEY: CTA1
TITLE
ORIGIN = 'HARVARD-SMITHSONIAN CENTER FOR ASTROPHYSICS
= '22/09/1989
DATE
/
DATE FILE WRITTEN
= '05:26:53
TIME
TIME FILE WRITTEN
/
END
SIMPLE
BITPIX
NAXIS
NAXIS1
NAXIS2
EXTEND
BSCALE
BZERO

FIGURE 4.15 Sample FITS header. On each line, the data to the left of the 7* is the actual
metadata (data about the raw data that follows in the file). For example, the second line
("BITPIX = 16") indicates that the raw data in the file will be stored in 16-bit integer format. Everything to the right of a V is a comment, describing for the reader the meaning of
the metadata that precedes it. Even a person uninformed about the FITS format can learn a
great deal about this file just by reading through the header.

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

is a good example of an abstract data model. The data

meaningless without the interpretive information contained in the
header, and FITS-specific methods must be employed to convert FITS data

FITS image

itself is

into an understandable image.

look

Another example

the raster image,

which

at next.

4.5.4 Color Raster Images

From

a user's

device as

it is

point of view, a
a data processor.

spreadsheets, or numbers,

modern computer is as much a graphical

Whether we are working with documents,

viewing and storing pictures

in addition to whatever other information we work with. Let's examine one
type of image, the color raster image, as a means to filling in our conceptual
understanding of data objects.

color raster image

are likely to be

a rectangular array

of colored dots, or pixels,^

FITS image is a raster image in the sense

that the numbers that make up a FITS image can be converted to colors and
then displayed on a screen. There are many different kinds of metadata that
can go with a raster image, including
that are displayed

a screen.

The dimensions of the image: the number or pixels per row and the
number of rows.
The number of bits used to describe each pixel. This determines how

many

colors can be associated with each pixel.

display only

two

colors, usually black

display four colors (2

and so

2
),

image can
image can

1-bit

2-bit

an 8-bit image can display 256 colors

forth.

color lookup table,

or palette, indicating which color

to each pixel value in the image.

table

and white.

2-bit

with 4 colors, an 8-bit image uses

image uses

a table

to be assigned

color lookup

with 256 colors, and

so forth.

we think of an image as an abstract data

we might associate with images? There

that

with getting things in and out of

store_\mage routine.

Then

type,

what

are

some methods

are the usual ones associated

computer:

read_image routine and

there are those that deal with images as special

objects; for example,

Display an image in

window on

console screen;

color lookup table;

Overlay one image onto another to produce a composite image; and
Display several images in succession, producing an animation.

Associate an image with

'Pixel stands for "picture

element."

a particular

BEYOND RECORD STRUCTURES

The

color raster image

an example of

129

type of data object that

more than the traditional field/record file structure. This is

particularly true when more than one image might be stored in a single file,
or when we want to store a document or other complex object together
with images in a file. Let's look at some ways to mix object types in one file.
requires

4.5.5 Mixing Object Types

Keywords

The FITS header

described earlier,

nique,

be contained

One

File

4.15) illustrates an important tech-

(Fig.

and records: the use of

do not know what fields are going

for identifying

keywords. In the case of FITS

headers,

any given header, so

fields

identify each field using a

"keyword = value" format.

Why does this format work for FITS

files,

for our address file? For the address

we saw

demanded
the

file.

file

whereas

was inappropriate
keywords

that the use of

high price in terms of space, possibly even doubling the

FITS

files

When

size

amount of overhead introduced by keywords

the

of
is

FITS file in the example

quite small.
contains approximately 2 megabytes. The keywords in the header occupy a
total of about 400 bytes, or about 0.02% of the total file space.

With the addition

via

included, the

keywords of

file

structure information and

more than just a collection of

repeated fields and records. Can we extend this notion beyond the header to
other, more elaborate objects? For example, suppose an astronomer would
metadata to

header,

like to store several

see that a

can be

file

FITS images of different

sizes in a file,

usual metadata, plus perhaps lab notes describing

from the image

(Fig. 4.16).

Now we

can think of our

may

be very different in content

structures do not handle well. Maybe we need
objects that

together with the

what the

view

new

scientist learned

mixture ot
our previous file
kind of file structure.
file as a

that

There are many ways to address this new file design problem. One
would be simply to put each type of object into a variable-length record and

FIGURE 4.16 Information that an astronomer wants to include

----.
yj-.y.i

...

in a file.

MMIS1

KAXIS2

600

max: si

".-

MMUS4

kax:i<

yjvx

ssoale - :.::
BZERO - 1S8E<

BZES

U5x=rO

500

t::
l

1S8E14

U5y=rA

FUNDAMENTAL

write our

The

like:

FILE

processing programs so they

file

first

STRUCTURE CONCEPTS

record

image; the third record

header for the

first

know what

each record looks

image; the second record

document; the fourth

the

header for the second

is a

image; and so forth.

This solution
drawbacks:

workable and simple, but

has

some

familiar

Objects must be accessed sequentially, making access to individual

images

The

in large files

file

time consuming.

must contain exactly the

actly the order indicated.

objects that are described, in ex-

could not, for instance, leave out the

notebook for some of the images (or in some cases leave out the
notebook altogether) without rewriting all programs that access the
file

to reflect the changes in the

solution to these problems

file's

structure.

hinted

at in

the

FITS header: Each

line

keyword that identifies the metadata field that follows in

not use keywords to identify all objects in the file
not just

begins with a

Why

line.

fields in the headers,

but the headers themselves,

the

as well as the

images and

any other objects we might need to store? Unfortunately, the "keyword =

it is short and fits easily in an
data" format makes sense in a FITS header
80-byte line
but it doesn't work at all for objects that vary enormously in
size and content. Fortunately, we can generalize the keyword idea to
address these problems by making two changes:

Lift the restriction that

enough

each record be 80 bytes, and

to hold the object that

Place the

keywords

let it

be big

referenced by the keyword.

in an index table, together

with the byte

offset

the actual metadata (or data) and a length indicator that indicates

how many
The term
this

type of

it,

bytes the metadata (or data) occupies in the

tag is

file

file.

commonly used in place of keyword in connection with

The resulting structure is illustrated in Fig. 4.17.

structure.

encounter two important conceptual tools for

file

design:

(1)

the

use of an index table to hold descriptive information about the primary data,

and

(2)

the use of tags to distinguish different types of objects. These tools

allow us to store in one

one another

Tag

structures are

For example,

mixture of objects
and content.

file a

in structure

common among

a structure called

popular tagged

file

objects

standard

file

TIFF (Tagged Image

format used for storing images.

that can vary

from

formats in use today.

File

HDF

Format)

very

(Hierarchical Data

Format) is a standard tagged structure used for storing many different kinds
of scientific data, including images. In the world of document storage and
retrieval, SGML (Standard General Markup Language) is a language for

131

BEYOND RECORD STRUCTURES

Index table
with tags:

header

notes

header

image

rrx

SIMPLE - -7HAX1S - 4
HAXIS1 - :.:
KAXIS2 - *::
hax:s3 - l
HAXIS4 > i
8SCALE - 0.015
BZERO - 158EU

=:v?le

MAXIS

MAX! SI
MAX Is;

500
600

"_i

MAX1S4
BSCALE
BZEfcO

0.015
i53E-:4

U5x=ro

FIGURE 4.17

Same

as Fig. 4.16, except with tags identifying the objects.

describing

document

structures and for defining tags used to

mark up

structure. Like FITS, each of these provides an interesting study in

that
file

design and standardization. References to further information on each are

provided

the end of this chapter, in "Further Readings."

Accessing Files with Mixtures of Data Objects

files

to contain widely varying objects

The

idea of allowing

compelling, especially for appli-

amounts of metadata or unpredictable mixes of

of data, for it frees us of the requirement that all records be
fundamentally the same. As usual, we must ask what this freedom costs us.
To gain some insight into the costs, imagine that you want to write a
program to access objects in such a file. You now have to read and write
tags as well as data, and the structure and format for different data types are
likely to be different. Here are some questions you will have to answer
almost immediately:
cations that require large
different kinds

When we want

to read an object

a particular type,

how

search for the object?

When we want

how

and where do we
and where exactly do we put the object?
Given that different objects will have very different appearances
within a file, how do we determine the correct method for storing or
store

its

to store an object in the

file,

tag,

retrieving the object?

The

first

two questions have

do with accessing the table

the objects. Solutions to this problem
to

that contains the

are dealt

with in

FUNDAMENTAL

detail

question,

STRUCTURE CONCEPTS

FILE

Chapter

how

defer their discussion until then.

to determine the correct

implications that

briefly touch

4.5.6 Object-oriented

File

abstract data

have used the term

methods

The

third

for accessing objects, has

here.

Access

application has of a data object. This

model to describe the view that an

essentially an

in-RAM,

application-

oriented view of an object, one that ignores the physical format of objects
as

they are stored in

files.

Taking

this

view of objects buys our software two

things:

delegates to separate modules the responsibility of translating to

and from the physical format of the object, letting the application
modules concentrate on the task at hand. (For example, an image
processing program that can operate in
on 8-bit images should
not have to worry about the fact that a particular image comes from
a file that uses the 32-bit FITS format.)
It opens up the possibility of working with objects that at some level
fit the same abstract data model, even though they are stored in different formats. The in-RAM representations of the images could be
identical, even though they come from files with quite different forIt

RAM

mats.)
File access
access,

that exploits these possibilities could be called object-oriented

emphasizing the

oriented

between

parallels

and the well-known object-

programming paradigm.

As an example

that illustrates

both points, suppose you have an image

processing application program (we'll

call

nfind_star) that operates in

RAM

and you need to process a collection of images. Some are

FITS format and some in TIFF files in a different
format. An object-oriented approach (Fig. 4.18) would provide the
application program with a routine (let's call it read_jmage( )) for reading
images into
in the expected 8-bit form, letting the application
concentrate on the image processing task. For its part, the routine
read_\mage( ), given a file to get an image from, determines the format ot
the image within the file, invokes the proper procedure to read the image in
format
that format, and converts it from that format into the 8-bit

8-bit images,

stored in FITS

files in a

RAM

that the application needs.

Tagged
file

file

formats are one

organization

accompanied by

and
a

file

way

access.

specification

implement

The

this

specification

conceptual view of
ot

tag

can

of methods for reading, writing, and

133

BEYOND RECORD STRUCTURES

program find_star

read_image ("starl"
process image

image)

image

end find star

RAM

(FITS

file)

Disk

FIGURE 4.18 Example of object-oriented access. The program find_star knows nothing about
file format of the image that it wants to read. The routine readjmage has methods to
convert the image from whatever format it is stored in on disk into the 8-bit in-RAM format
required by find_star.

the

otherwise manipulating the corresponding data object according to the

needs

of an application.

Indeed,

definition of the abstract data

format lends

itself to the

any specification that separates the

that of the corresponding file

model from

object-oriented approach.

4.5.7 Extensibility

One of the

advantages of using tags to identify objects within

do not have
software

may

know

a priori

what

all

files is

of the objects will look

eventually have to deal with.

that

like that

have just seen that

we
our
our

STRUCTURE CONCEPTS

FUNDAMENTAL

FILE

program
methods

to be able to access a mixture of objects in a

file, it must have

and writing each object. Once we build into our
software a mechanism for choosing the appropriate methods for a given
type of object, it is easy to imagine extending, at some future time, the
types of objects that our software can support. Every time we encounter a
new type of object that we would like to accommodate in our files, we can
implement methods for reading and writing that object and add those
methods to the repertoire of methods available to our file processing
is

for reading

software.

4.6

Portability

and Standardization

recurring theme in several of the examples that

we have just

seen

the

want to share files. Sharing files means making sure

accessible on all of the different computers that they might turn

idea that people often

that they are

up on, and that they are somehow compatible with all of the different
programs that will access them. In this final section, we look at two
complementary topics that affect the sharability of files: portability and
standardization.

4.6.1 Factors Affecting Portability

Imagine that you work for a company that wishes to share simple data files
as our address file with some other business. You get together with the
other business to agree on a common field and record format, and you
discover that your business does all of its programming and computing in
C on a Sun computer and the other business uses Turbo Pascal on an IBM
PC. What sorts of issues would you expect to arise?
such

among Operating Systems

"Unexpected Characters in Files," we saw

Differences

linefeed character every time

whereas on most other

every time our address

file

In
that

encounters

systems

this is

Chapter 2

in the section

MS-DOS

adds an extra

carriage return character,

not the case. This means that

has a byte with hex value OxOd, whether or not

that byte is meant to be a carriage return, the file is extended by an extra

0x0a byte.
This example illustrates the fact that the ultimate physical format of the
same logical file can vary depending on differences among operating systems.

Differences

among Languages

header records,
forced to

make our

chose to

Earlier in this chapter,

make our C header 32

Pascal header 64 bytes.

when

discussing

bytes, but

allows us to

we were

mix and match

fixed record lengths according to our needs, but Pascal requires that

all

PORTABILITY AND STANDARDIZATION

records in

This

nontext

file

illustrates a

be the same

size.

second factor impeding portability

physical layout offiles produced with different languages

way

the languages let

you define

structures within a

may

among

files:

The

be constrained by the

file.

Differences in Machine Architectures Consider again the header

record that we produce in the C version of our address file. The hex dump
of the file (Fig. 4.13), which was generated using C on a Sun 3 computer,

shows

this

header record in the

0000000

The
,

0020 0000 0000 0000 0000 0000 0000 0000

two bytes contain the number of records in the file, in this case
If the same C program is compiled and executed on an IBM

first

20 16 or 32 10

first line:

VAX,

the hex

0000000

Why

dump

of the header record will look

like this:

2000 0000 0000 0000 0000 0000 0000 0000

program? The answer

both cases the numbers were written to the file exactly as they
appeared in RAM, and the two different machines represent two-byte
integers differently
the Sun stores the high-order byte, followed by the
low-order byte; the IBM PC and VAX store the low-order byte, followed
is

are the bytes reversed in this version of the

that in

by the high-order

byte.

This reverse order also applies to four-byte integers on these machines.

For example, in our discussion of

IBM

dumps we saw

that the

hexadecimal

is ldcd6500 16 If you write this value out to a file on

PC, or some other reverse-order machine, a hex dump of the file

value of 500,000,000 10

file

created looks like

0000000

this:

0065 cdld

The problem of data representation is not restricted only to binary

numbers. The way structures, such as C structs or Pascal records, are laid
out in
can vary from machine to machine and compiler to compiler.
For example, suppose you have a C program containing the following lines
of code:

RAM

struct

{
i

char
>

i t

cost;
i den t

write (fd,

&item,

sizeof (item));

and you want to write files using this code on two different machines, a
Cray 2 and a Sun 3. Because it likes to operate on 64-bit words, Cray's C

FUNDAMENTAL

STRUCTURE CONCEPTS

FILE

compiler allocates
it

allocates

minimum

of eight bytes for any element

16 bytes for the struct item.

Cray writes 16 bytes

statement, then, the

When

in a struct, so

executes the write(

The same program

you probably would expect,

to the

file.

compiled on a Sun 3 writes only eight bytes, as

and on most IBM PCs it writes six bytes: same exact program; same
language; three different results.

Text

encoded differently on

also

differences are primarily restricted to

different platforms. In this case the

two

different types of systems: those

and those that use ASCII. EBCDIC is a standard created
by IBM, so machines that need to maintain compatibility with IBM must
support EBCDIC. Most others support ASCII. A few support both.
Hence, text written to a file from an EBCDIC-based machine may well not
be readable by an ASCII-based machine.
Equally serious, when we go beyond simple English text, is the
problem of representing different character sets from different national
languages. This is an enormous problem for developers of text databases.

that use

EBCDIC^

4.6.2 Achieving Portability

among

Differences

languages, operating systems, and machine architec-

major problems when we need to generate portable

Achieving portability means determining how to deal with these
differences. And the differences are often not just differences between two
platforms, for many different platforms could be involved.
The most important requirement for achieving portability is to
recognize that it is not a trivial matter and to take steps ahead of time to

tures represent three

files.

insure

it.

Here

are

some

guidelines.

Standard Physical Record Format and Stay with It A

is one that is represented the same physically, no matter
what language, machine, or operating system is used. FITS is a good
example of a physical standard, for it specifies exactly the physical format of
each header record, the keywords that are allowed, the order in which
keywords may appear, and the bit pattern that must be used to represent the
binary numbers that describe the image.

Agree on

physical standard

Unfortunately, once

"improve" on
a

standard

by changing

standard. If the standard

sometimes be avoided. FITS,

over

its

established,

very tempting to

some way, thereby rendering

no longer

sufficiently extensible, this temptation can

for example, has been extended a

lifetime to support data objects that

few times

were not anticipated

""EBCDIC stands for Extended Binary Coded Decimal Interchange Code.

in its

PORTABILITY AND STANDARDIZATION

original design, yet

all

additions have remained compatible with the original

format.

One way

make

simple enough that

sure that a standard has staying

files

power

can be written in the standard format from

make
a

wide

range of machines, languages, and operating systems. FITS again exemplifies

such

standard.

FITS headers

are

ASCII 80-byte records

in blocks

of 36

records each, and FITS images are stored as one contiguous block of

numbers, both very simple structures that are easy

modern operating systems and languages.

Agree on
most

to read

and write

Standard Binary Encoding for Data Elements

most

The two

common types

of basic data elements are text and numbers. In the case

of text, ASCII and EBCDIC represent the most common encoding
schemes, with ASCII standard on virtually all machines except IBM
mainframes. Depending on the anticipated environment, one of these
should be used to represent all text."*"
The situation for binary numbers is a little cloudier. Although the
number of different encoding schemes is not large, the likelihood of having
to share data among machines that use different binary encodings can be
quite high, especially when the same data is processed both on large
mainframes and on smaller computers. Two standards efforts have helped
diminish the problem, however: IEEE Standard formats, and External Data
Representation (XDR).

IEEE has established standard format specifications for 32-bit, 64-bit,

and 128-bit floating point numbers, and for 8-bit, 16-bit, and 32-bit
integers. With a few notable exceptions (e.g., IBM mainframes, Cray, and
Digital) most computer manufacturers have followed these guidelines in
designing their machines. This effort goes a long way toward providing
portable

number encoding schemes.

is an effort to go the rest of the way. XDR specifies not only a set
of standard encodings for all files (the IEEE encodings), but provides for a
set of routines for each machine for converting from its binary encoding
when writing to a file, and vice versa (Fig. 4.19). Hence, when we want to
store numbers in XDR, we can read or write them by replacing read and

XDR

write routines in our

program with

XDR routines.

The

XDR routines take

care of the conversions.*

""Actually, there are different

applications,

character

versions of both

and for the purposes of

this text,

ASCII and EBCDIC. However,

it is

for

most

sufficient to consider each as a single

set.

used for more than just number conversions. It allows a C programmer to deoriginated as a Sun
scribe arbitrary data structures in a machine-independent fashion.
protocol for transmitting data that is accessed by more than one type of machine. For further information, see Sun (1986 or later).

*XDR

XDR

138

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

XDR float

(&xdrs,

&x)

234.5

RAM

XDR specifies a standard external data representation for numbers stored

XDR routines are provided for converting to and from the XDR representation

FIGURE 4.19
in

file.

encoding scheme used on the host machine. Here a routine called

) translates a 32-bit floating point number from its XDR representaon disk to that of the host machine.

to the

XDR_f loatC
tion

Once again, FITS provides us with an excellent example: The binary

numbers that constitute a FITS image must conform to the IEEE Standard.
Any program written on a machine with XDR support can thus read and
write portable FITS files.

Number and Text Conversion

encodings

IBM

not feasible.

Sometimes the use of standard data

For example, suppose you are working primarily

mainframes with software that deals with floating point numbers

and text. If you choose to store your data in IEEE Standard formats, every
time your program reads or writes a number or character it must translate
the number from the IBM format to the corresponding IEEE format. This
is not only time-consuming but can result in loss of accuracy. It is probably
better in this case to store your data in native IBM format in your files.
What happens, then, when you want to move your files back and forth
between your IBM and a VAX, which uses a different native format for
numbers and generally uses ASCII for text? You need a way to convert
from the IBM format to the VAX format and back. One solution is to write
(or borrow) a program that translates IBM numbers and text to their VAX

PORTABILITY AND STANDARDIZATION

and vice versa. This simple solution

equivalents,

illustrated

Fig.

4.20(a).

But what if, in addition to IBM and VAX computers, you find that
your data is likely to be shared among many different platforms that use
different numeric encodings? One way to solve this problem is to write a
program to convert from each of the representations to every other
representation. This solution, illustrated in Fig. 4.20(b), can get rather

complicated. In general,

need n(n
messy. Not
for each
to

you have

of where the

know which

many

file

they are to be exported to

If n

large, this can

be very

you need to keep track,

came from and/or where it is going in order

In this case, a better solution

n(n

encoding schemes, you will

translators, but

translator to use.

intermediate format, such as

Fig. 4.20(c).

(Why?)

different translators.

only do you need

file,

n different

and

down

cut

to agree

translate files into

the

number of

standard

XDR whenever

platform. This solution

a different

Not only does

to 2n, but

would probably be

XDR,

illustrated in

translators

from

should be easy to find translators to convert from most

platforms to and from

XDR. One

requires two conversions to

negative aspect of this solution

go from any one platform

that

to another, a cost that

has to be weighed against the complexity of providing n(n

translators.

Conversion

Suppose you are a doctor and you have

organ taken periodically over several
minutes. You want to look at a certain image in the collection using a
program that lets you zoom in and out and detect special features in the
image. You have another program that lets you animate the collection of
images, showing how it changes over several minutes. Finally, you want to
annotate the images and store them in a special X-ray archive, and you have
another program for doing that. What do you do if each of these three
programs requires that your image be in a different format?
The conversion problems that apply to atomic data encodings also
apply to file structures for more complex objects, like images, but at a
different level. Whereas character and number encodings are tied closely to
specific platforms, more complex objects and their representations just as

File Structure

X-ray

raster

images of

a particular

often are tied to specific applications.

For example, there are many software packages that deal with images,
and very little agreement about a file format for storing them. When we
look at this software, we find different solutions to this problem:
Require that the user supply images in a format that is compatible
with the one used by the package. This places the responsibility on
the user to convert from one format to another. For such situations,

140

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

From:

To:

(a)

Converting between

IBM and Vax

native format

requires two conversion routines.

From:

To:

(b)

Converting directly between five different native formats

requires 20 conversion routines.

From:

To:

XDR

(c) Converting between five different native formats via an

intermediate standard format requires 10 conversion routines.

FIGURE 4.20 Direct conversion between n native machines formats requires n (n - 1) conversion routines, as illustrated in (a)
and (b). Conversion via an intermediate standard format requires
2n conversion routines, as illustrated in (c).

PORTABILITY AND STANDARDIZATION

may be preferable to provide utility programs that translate from

one format to another and that are invoked whenever translating.
Process only images that adhere to some predefined standard format.
This places the responsibility on a community of users and software
developers for agreeing on and enforcing a standard. FITS is a good
example of this approach.
Include different sets of I/O methods capable of converting an image
from several different formats into a standard RAM structure that
the package can work with. This places the burden on the software
developer to develop I/O methods for file object types that may be
stored differently but for the purposes of an application are conceptually the same. You may recognize this approach as a variation on the
concept of object-oriented access that we discussed earlier.
it

System Differences

File
to

another,

Finally, if

organized physically. For example,

512-byte blocks, but

such

you move

2,880-bytes

non-UNIX

UNIX

files

system

files

are

to tapes in

systems often use different block

When

you may need

problem.

to deal

systems write

file

way

sizes,

where the FITS

between systems,

thirty-six 80-byte records. (Guess

blocking format comes from?)

UNIX

from one

files

chances are you will find differences in the

with

and Portability

problem just described,

this

transferring

files

Recognizing problems such

UNIX

provides a utility called

intended primarily for copying tape data to and from

as the block-size
dd.

Although dd

UNIX systems,

be used to convert data from any physical source. The dd

following options, among others:

utility

can

provides the

Convert from one block size to another;

Convert fixed-length records to variable length, or vice versa;
Convert ASCII to EBCDIC, or vice versa;
Convert all characters to lowercase (or to uppercase); and
Swap every pair of bytes.

course, the greatest contribution

discussed here

UNIX

itself.

its

UNIX

makes

to the

problems

simplicity and ubiquity,

UNIX

encourages the use of the same operating system, the same file system, the
same views of devices, and the same general views of file organization, no
matter what particular hardware platform you happen to be using.

For example, one of the authors works in an

nationwide constituency that operates many different
two Crays, a Connection Machine, and many Sun,
Graphics, and Digital workstations. Because each

organization with

computers, including
Apple, IBM, Silicon
runs

some

flavor of

FUNDAMENTAL

UNIX,

STRUCTURE CONCEPTS

FILE

they

incorporate precisely the same view of

all

external storage

ASCII, and they all provide the same basic

programming environment and file management utilities. Files are not
perfectly portable within this environment, for reasons that we have
covered in this chapter, but the availability of UNIX goes a long way
toward facilitating the rapid and easy transfer of files among the applications, programming environments, and hardware systems that the organithey

devices,

all

use

zation supports.

SUMMARY
The lowest

normally impose on a file is a

file merely as a stream of
bytes, we lose the ability to distinguish among the fundamental informational units of our data. We call these fundamental pieces of information
fields. Fields are grouped together to form records. Recognizing fields and

of organization

level

stream of bytes. Unfortunately,

recognizing records requires that

There are many ways

from the next:

that

by storing

data in a

we impose structure on

to separate

one

field

the data in the

file.

from the next and one record

Fix the length of each field or record.

Begin each
it

field

number of bytes

or record with a count of the

that

contains.

Use

delimiters to

In the case

mark

the divisions

between

another useful technique

fields,

value" form to identify

fields.

entities.
is

to use a

In the case of records,

"keyword =

another useful

where each record begins.

which
records are grouped into
higher level of organization, in
blocks, is also often imposed on files. This level is imposed to improve I/O
performance rather than our logical view of the file.
technique

to use a second, index

that tells

file

One

In this chapter
at

use the record structure that uses

simple

file

individuals.

before

We use buffering to

we know

its

length indicator

to the

complete record

length field of each record as a binary

In the

former

contents of our

file.

case,

for writing

accumulate the data in an individual record

length to write

allowing us to read in
digits.

programs

and reading
of variable-length records containing names and addresses of

the beginning of each record to develop

file.

Buffers are also useful in

one time.

number

represent the

or as a sequence of

useful to use a file

dump

ASCII

examine the

SUMMARY

Sometimes we identify individual records by their relative record numbers

(RRNs) in a file. It is also common, however, to identify a record by a key
whose value is based on some of the record's content. Key values must
occur in, or be converted to, some predetermined canonical form if they are
to be recognized accurately and unambiguously by programs. If every
record's key value is distinct from all others, the key can be used to identify
and locate the unique record in the file. Keys that are used in this way are
called primary keys.

In this chapter

through

a file

look

looking for

can perform poorly for long

searching

time for
process

the technique of searching sequentially

record with
files,

a particular key. Sequential search

but there are times

reasonable. Record blocking can be used to

a sequential search substantially.

sequentially are

files

wc and

Two

mechanism

the beginning of a record.

accessing the record

The

directly,

for looking

This,

sequential

UNIX

utilities that

clear that

some of the

useful

grep.

In our discussion of ways to separate records,

methods provide

when

improve the I/O

up or

in turn,

by RRN,

it is

calculating the byte

offset

opens up the possibility of

rather than sequentially.

RRN

simplest record formats for permitting direct access by

involve the use of fixed-length records.

When

the data itself actually

in fixed-size quantities (e.g., zip codes), fixed-length records

comes

can provide

good performance and good space utilization. If there is a lot of variation in

amount and size of data in records, however, the use of fixed-length

the

records can result in expensive waste of space. In such cases the designer

should look carefully

Sometimes
such as the

it is

number of

beginning of the

the possibility of using variable-length records.

helpful to keep track of general information about

file it

records they contain.

pertains to,

files,

header record, stored at the

a useful tool for storing this

kind of

information.

important to be aware of the difference between file access and file

We try to organize files in such a way that they give us the
types of access we need for a particular application. For example, one of the
It is

organization.

advantages of a fixed-length record organization

that

allows

access that is

either sequential or direct.

view of a file as a more or less regular

and records, we present a more purely logical view of the
contents of files in terms of abstract data models, a view that lets applications
ignore the physical structure of files altogether.
This view is often more appropriate to data objects such as sound,
images, and documents. We call files self-describing when they do not require
In addition to the traditional

collection of fields

an application to reveal their structure,

but provide that information

143

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

themselves. Another concept that deviates from the traditional view

which the file contains data that describe the primary data in
FITS files, used for storing astronomical images, contain extensive

metadata, in

the

file.

headers with metadata.

The
makes
file.

use of abstract data models, self-describing

When

models

mix

possible to

a variety

files, and metadata

of different types of data objects in one

this is the case, file access is

also facilitate extensible files

more object oriented. Abstract data

whose structures can be extended

files

accommodate new kinds of objects.

Portability becomes increasingly important as files are used in more
heterogeneous computing environments. Differences among operating
to

systems, languages, and machine architectures

portability.

One

important

way

all

lead to the need for

to foster portability

which means agreeing on physical formats, encodings

and

file

structures.

If a

standard does not exist and

standardization,

for data elements,

becomes necessary

to convert

from

one format to another, it is still often much simpler to have one standard
format that all converters convert into and out of. UNIX provides a utility
called dd that facilitates data conversion. The UNIX environment itself
supports portability simply by being commonly available on a large
number of platforms.

KEY TERMS

A collection of records stored as a physically contiguous unit on

secondary storage. In this chapter, we use record blocking to improve I/O performance during sequential searching.

Block.

Byte count

field.

that gives the

byte count

field at the

beginning of

number of bytes used

field

allows

program

a variable-length

to store the record.

The

record
use of

to transmit (or skip over) a vari-

able-length record without having to deal with the record's internal

structure.

Canonical form. A standard form for a key that can be derived, by the
application of well-defined rules, from the particular, nonstandard
form of the data found in a record's key field(s) or provided in a
search request supplied by a user.
Delimiter.

One

characters used to separate fields and records in

a file.

Direct access.
location of

A
a

file

accessing

mode

that involves

jumping

to the exact

record. Direct access to a fixed-length record

usually

KEY TERMS

accomplished by using

its relative

record

byte offset, and then seeking to the

Extensibility.

characteristic

number (RRN), computing

byte of the record.

its

first

of some

file

organizations that makes

possible to extend the types of objects that the format can

accommo-

date without having to redesign the format. For example, tagged

file

formats lend themselves to extensibility, for they allow the addition

new

tags for

new

new methods

data objects and associated

for

accessing the objects.

Field.

The

smallest logically meaningful unit of information in a

record in

usually

made up of several

file.

fields.

method. The approach used to locate information in a

two alternatives are sequential access and direct

File-access
file.

a file is

In general, the

access.

method. The combination of conceptual and physiused to distinguish one record from another and one

File organization
cal structures
field

from another.

example of

kind of

fixed-length records containing variable

delimited

same

organization

fields.

Fixed-length record. A
same length. Records
ters so

file

numbers of variable-length

file

organization in which

are

padded with blanks,

they extend to the fixed length. Since

length,

it is

all

records have the

nulls, or other charac-

all

the records have the

possible to calculate the beginning position of any

record, making direct access possible.

Header record. A record placed at the beginning of a
store information about the

file

contents and the

file

that

used to

organization.

Key. An expression derived from one or more of the fields within a

record that can be used to locate that record. The fields used to build
the key are sometimes called the key fields. Keyed access provides a

way of performing
retrieval

content-based retrieval of records, rather than

based merely on

a record's position.

Metadata. Data in a file that is not the primary data, but describes the
primary data in a file. Metadata can be incorporated into any file
whose primary data requires supporting information. If a file is going to be shared by many users, some of whom might not otherwise
have easy access to its metadata, it may be most convenient to store
the metadata in the
file is

file itself.

A common

place to store metadata in a

the header record.

Object-oriented

file access.

form of file

access in

access data objects in terms of the applications'

which applications

in-RAM view of the

methods associated with the objects are responsible

and from the physical format of the object, letting
the application concentrate on the task at hand.
objects. Separate

for translating to

145

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

files that describes how amenable

of different machines, via a variety of
different operating systems, languages, and applications.
Primary key. A key that uniquely identifies each record and that is
used as the primary method of accessing the records.
Record. A collection of related fields. For example, the name, address,
etc. of an individual in a mailing list file would probably make up

Portability. That characteristic of

they are to access on

one record.
Relative record

a variety

number (RRN). An

record relative to the beginning of

records, the

RRN

index giving the position of

its file.

If a file has fixed-length

can be used to calculate the byte

offset

record

so the record can be accessed directly.

Self-describing

files.

Files that contain

information such

as the

number

and formal descriptions of the file's record

structure, which can be used by software in determining how to
of records in the
cess the

file.

file

file's

header

good

ac-

place for this information.

Sequential access. Sequential access to a file means reading the file from
the beginning and continuing until you have read in everything that

you need. The

alternative

direct access.

Sequential search. A method of searching a file by reading the file

from the beginning and continuing until the desired record has been
found.

Stream of bytes. Term

describing the lowest-level view of a

begin with the basic stream-of-bytes view of

a file,

file.

can then impose

own higher levels of order on the file, including field, record,

and block structures.
Variable-length record. A file organization in which the records have
no predetermined length. They are just as long as they need to be,
hence making better use of space than fixed-length records do. Unfortunately, we cannot calculate the byte offset of a variable-length
record by knowing only its relative record number.
our

EXERCISES
1.

Find situations for which each of the four

the text

might be appropriate.

field structures

described in

Do the same for each of the record structures

described.
2.

Discuss the appropriateness of using the following characters to delimit

fields or records: carriage return, linefeed, space,

comma,

period, colon,

EXERCISES

Can you

escape.

think of situations in which you might want to use

different delimiters for different fields?

Suppose you want to change the programs in section 4.1 to include

phone number field in each record. What changes need to be made?
3.

4. Suppose you need to keep a file in which every record has both fixedand variable-length fields. For example, suppose you want to create a file of
employee records, using fixed-length fields for each employee's ID
(primary key), sex, birthdate, and department, and using variable-length
fields for each name and address. What advantages might there be to using
such a structure? Should we put the variable-length portion first or last?
Either approach is possible; how can each be implemented?
5.

One

record structure not described in

labeled record structure each field that

describing

and

its

this

chapter

represented

called labeled. In a

preceded by

a label

LN, FN, AD, CT, ST,

fixed-length fields for a name and

contents. For example, if the labels

are used to describe the six

address record,

might appear

as follows:

LNAmesbbbbbbFNJohnbbbbbbAD123 Map 1 ebbbbbbCTSt

i 1 1

Under what conditions might

even desirable, record

this

a reasonable,

water STDKZP74075bbbb

structure?
6.

Define the terms stream of bytes, stream offields, and stream of records.

7. Find out what basic file structures are

programming language that you are currently

available

you

the

using. For example, does

your language recognize a sequence-of-bytes structure? Does it recognize

of text? Record blocking? For those types of structures that your
language does not recognize, describe how you might implement them
using structures that your language does recognize.
lines

Report on the basic

and record structures available

PL/I or

9. Compare the use of ASCII characters to represent everything

with the use of binary and ASCII data mixed together.

in a file

field

COBOL.

10. If

you

list

the contents of a

file

containing both binary and ASCII

on your terminal screen, what results can you expect? What

happens when you list a completely binary file on your screen? (Warning: If
you actually try this, do so with a very small file. You could lock up or
reconfigure your terminal, or even log yourself off!)
characters

11. If a

key

in a record

already in canonical form and the key

the

first

147

FUNDAMENTAL

of the record,

field

it is

FILE

STRUCTURE CONCEPTS

possible to search for a record

separating out the key field

from the

rest

of the

by key without ever

fields.

Explain.

1985) that primary keys should be

unchanging, unambiguous, and unique." These concepts are
interrelated since, for example, a key that contains data runs a greater risk
12.

has been suggested (Sweet,

"dataless,

of changing than a dataless key. Discuss the importance of each of these

concepts, and show by example how their absence can cause problems. The
primary key used in our example file violates at least one of the criteria.
How might you redesign the file (and possibly its corresponding information content) so primary keys satisfy these criteria?
13.

How many comparisons

would be required on average

using sequential search in a 10,000-record disk

file,

how many

comparisons are required?

are stored per block,

only one record

14. In

how many

If the file

to find a record

record

not in the

blocked so 20 records

disk accesses are required on average?

What

stored per block?

we assume that
do the assumptions change on a
magnetic disk? How do these changed

our evaluation of performance for sequential search,

every read results in

assumptions

seek.

affect the analysis

Look up

How
to a

of sequential searching?

UNIX commands grep,

What motivates the differences?

the differences between the

Why

and fgrep.

machine with access

single-user

15.

file? If the

are they different?

egrep,

16. Give a formula for finding the byte offset of a fixed-length record
which the RRN of the first record is 1 rather than 0.

17.

Why

variable-length record structure unworkable for the update

program? Does

help if

we have

an index that points to the beginning of

each variable-length record?

18.

The

How

update

must the

deletion if

program
file

we do

lets

the user change records, but not delete records.

structure and access procedures be modified to allow for

not care about reusing the space from deleted records?

How do the file structures and procedures

change

we do want

to reuse the

space?
19. In

our discussion of the uses of relative record numbers (RRNs), we

file in which there is a direct correspondence

suggest that you can create a

as membership number, and RRN, so we can

by knowing just the name or membership number.
What kinds of difficulties can you envision with this simple correspondence
between membership number and RRN? What happens if we want to delete

between

primary key, such

find a person's record

EXERCISES

name? What happens

variable-length record
20.

The following

file

filled in.

How

long

dump

type produced by the

we change the information

and the new record is longer?

file

describes the

first

in a record in a

few bytes from

a file

version ofwritrec, but the right-hand column

the

record?

first

What

are

its

of the
is

not

contents?

0000000 00264475 6D707C46 7265647C 38323120

0000020 4B6C7567 657C4861 636B6572 7C50417C
0000040 36353533 357C2E2E 48657861 64656369
21.

Assume

we have

that

variable-length record

(greater than 1,000 bytes each,

with

for a record

file

on the average). Assume

a particular

RRN.

with long records

that

are looking

Describe the benefits of using the

a byte count field to skip sequentially from record to record to

one we want. This is called skip sequential processing. Use your
knowledge of system buffering to describe why this is useful only for long
records. If the records are sorted in order by key and blocked, what
information do you have to place at the start of each block to permit even

contents of
find the

faster skip sequential processing?

Suppose you have a fixed-length record with fixed-length fields, and

the sum of the field lengths is 30 bytes. A record with a length of 30 bytes
would hold them all. If we intend to store the records on a sectored disk
with 512-byte sectors (see Chapter 3), we might decide to pad the record
out to 32 bytes so we can place an integral number of records in a sector.
Why would we want to do this?

22.

23.

Why

important to distinguish between

file

access

and

file

organization?

What is an abstract data model? Why did the early file processing
programs not deal with abstract data models? What are the advantages of
using abstract data models in applications? In what way does the UNIX
concept of standard input and standard output conform to the notion of an
abstract data model? (See "Physical and Logical Files in UNIX" in Chap24.

ter 2.)

25.

What

26. In the

about the
scientific

metadata?

some metadata provides information

and some provides information about the
which the corresponding image was recorded. Give

FITS header
files's

in Fig. 4.15,

structure,

context in

three examples of each.

149

FUNDAMENTAL

FITS header

27. In the

program

determine

STRUCTURE CONCEPTS

FILE

in Fig.

how

the block containing the header

large

the

file?

4.15, there

must be

What proportion of the

enough information for a

Assuming that the size of

to read the entire

file.

of 2,880 bytes,

a multiple

file

28. In the discussion of field organization,

list

"keyword = value"

the

How

construct as one possible type of field organization.

applied in tagged

object-oriented

file

structures?

access?

How

does

do tagged

how

contains header information?

tagged

file

this

notion

structure support

formats support extensibil-

ity?

29. List three factors that affect portability in

ways

30. List three

31.

What

that portability can be achieved in

XDR? XDR

described in this chapter. If

"Further Readings"

ways

that

files.

actually

much more

you have

the end of this

files.

extensive than what

XDR

documentation (see
look
chapter),
up XDR and list the
access to

supports portability.

we see two possible record structures for our address file,

one based on C and one based on Pascal. Discuss portability problems that
might arise from using these record structures in a heterogeneous
computing environment. (Hint: Some compilers allocate space for character
fields starting on word boundaries, and others do not.)

32. In Fig. 4.2,

Programming Exercises
33. Rewrite writstrm so

the

new

uses delimiters as field separators.

The output of

version of writstrm should be readable by readstrm.c or readstrm .pas

34. Create versions

writrec

and

readrec that use the

following fixed-field

lengths rather than delimiters.

Last name:

15 characters

name:

15 characters

First

Address:

30 characters

City:

20 characters

State:

2 characters

Zip:

5 characters

35. Write the

Make
36.

program described

in the preceding

store five records per block.

Implement the program find.

problem so

uses blocks.

EXERCISES

37.

Rewrite the program find so

file. For example,

position in the

can find

record on the basis of

requested to find the 547

its

record in

would read through the first 546 records, then print the contents of
th
record. Use skip sequential search (see exercise 21) to avoid
the 547
reading the contents of unwanted records.
file, it

program

38. Write a

similar to find, but with the following differences.

from the keyboard, the program reads them

Instead of getting record keys

from

a separate transaction

file

that contains only the keys of the records to

be extracted. Instead of printing the records on the screen, it writes them

out to a separate output file. First, assume that the records are in no

Then assume
by key. In the latter
than^m^?

both the main

particular order.

that

are sorted

case,

efficient

39.

Make any
a.

all

how

can you

file

and the transaction

file

make your program more

of the following alterations to

update. pas or update. c.

Let the user identify the record to be changed by name, rather

than

RRN.

Let the user change individual fields without having to change an

entire record.
c.

40.

Let the user choose to view the entire

Modify

file.

update. c or update. pas to signal the user

when

record exceeds

The modification should allow the user to bring the

and input it again. What are some other
would make the program more robust?

the fixed-record length.

record

down

to an acceptable size

modifications that

Change update. c or update. pas to a batch program that reads a transaction

file in which each transaction record contains an RRN of a record that is to
be updated, followed by the new contents of the record, and then makes the

41.

changes in

batch run. Although not necessary,

the transaction

file

might be desirable

to sort

by RRN. Why?

42. Write a

program

dump. The

file

and outputs the file contents as a file

format similar to the one used in the
examples in this chapter. The program should accept the name of the input
file on the command line. Output should be to standard output (terminal

dump

that reads a

file

should have

screen).

43.

Develop

a set

of rules for translating the dates August

1949,

1949, 8-7-49, 08-07-49, 8/7/49, and other, similar variations into

Aug.

common

canonical form. Write a function that accepts a string containing a date in

one of these forms and returns the canonical form, according to your rules.
Be sure to document the limitations of your rules and function.

151

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

program to read in a FITS file and print

The size of the image (e.g., 256 by 256)
The title of the image
The telescope used to make the image
The date the image file was created
The average pixel value in the image (use BSCALE and BZERO)

44. Write a
a.

b.
c.

d.
e.

discussed in the text.

are contained in the following

name and

Writes out

writstrm.c

following pages correspond to the programs

The programs

address information as

files.

stream of con-

secutive bytes.

input and prints

readstrm.c

Reads

writrec.c

Writes a variable-length record file that uses a byte count

the beginning of each record to give its length.

readrec.c

Reads through a file, record by record, displaying the

from each of the records on the screen.

stream

file as

to the screen.
at

fields

Contains support functions for reading individual records or

These functions are needed by programs in readrec.c
and find, c

getrf.c

fields.

Searches sequentially through a

find.c

file

for a record with a partic-

ular key.

Combines

makekey.c

first

and

last

names and converts them

canonical form. Calls strtrimf

and

ucase(

to a

found

key

in strjuncs.c.

strfuncs.c

Contains two string support functions: strtrimf ) trims the

blanks from the ends of strings; ucase( ) converts alphabetic
characters to uppercase.

update.

Allows new records to be added

to a file or old records to

changed.

Fileio.h
All of the

programs include

useful definitions.

were
/*

fileio.h

to be run

Some of
a

UNIX

header

file

called fileio.h

which contains some

programs

these are system dependent. If the

system

fileio.h

might look

like this:

header file containing file I/O definitions

*/
(continued)

FUNDAMENTAL

ude <stdio.h>
<fcntl .h>

nc lude

STRUCTURE CONCEPTS

FILE

'define PMDDE

0755

#define DELIM_STR
'define DELIM_CHR
#define out_s t

"I"
j

r ( f d

5 )

fd )

( s )

5 t r

write((fd),DELIM_STR,1
'define

d_t o_recbuf f ( rb

treat ( rb

en(

5 ) )

Id)

s t r

t (

rb DEL IM_STR)
,

'define MAX_REC_SIZE 512

fw ritstrm.c
writstrm.c
creates name and address file that is strictly a stream of
bytes (no delimiters, counts, or other information to
distinguish fields and records).

simple modification, to the out_str macro:

\
wr i t e( ( f d) ( 5 ) s t r len( s ) )
#define ou t_s t r ( f d 5 )
write((fd),DELIM_STR,1 );
changes the program so that it creates delimited fields.

'include "fileio.h"
#def ine
ma

ou t_s
)

t r (

write((fd),(s),strlen(s))

s )

char firstC303, last[30], address[30], city[20];

char stateM5], z i p 9
char f i lename [15];
[

pr intf ("Enter the name of

get s( f i lename)
if

creaUf

((fd =
pr int

("f

exitd);
>

i 1

the file you wish to create:

PMODE )
i 1 ename
opening error
,

program

opped\n" )

");

printf("\n\nType
get s( last

in a last name

PROGRAMS: READSTRM.C

155

(surname), or <CR> toexit\n>>>");

while (strlen(last)

printf ("\nFirst Name:");

gets(first);
printf ("
Address:")
gets (addres s )
City:")
printf ("
gets(city)
printf ("
State:")
gets(state)
printfC"
Zip:")
;

get s(

ip)

/output the strings to the buffer and then to the file*/

out_5 1 r(fd,last);
out _5 1 r (f d first)
out _5 t r ( f d address)
out_str(fd,city)
out _str(fd, state)
out_5 t r ( f d z ip)
;

/* prepare for next entry */

printf("\n\nType in a last name (surname), or <CR> toexit\n>>>");

get s( las

t )

/* close the file before leaving

close(fd)

Hieadstrm.c
/ *

reads t rm
reads

stream of delimited fields

^include "fileio.h"
int readf ield( int fd, char sM);
mai n(

int

f d

char
char
int

lename
1 d_count

f
f

[ 1

5
(continued)

156

FUNDAMENTAL

STRUCTURE CONCEPTS

FILE

printf (" Enter name of file to read: ");

ge t 5 ( f i 1 ename )
if ((fd = openCf i 1 ename Q_RD0NLY ) ) < 0) {
printf ("fi le opening error
program
,

opped\ n"

exitd);

main program loop -- calls readfieldC ) for as long

as the function succeeds
f ld_count =
while ((n = readf ield(fd,s)) > 0)
printf ("\tfield # %3d:
%s \n" + + f 1 d_count s )
/*

close(fd)

int

eadf

char

fd,

sM)

int

char
-

while
s

sti

read(fd,&c

(
[

++
'

,1 )

DELIM_CHR)

append null to end string

return (i);

Pvv
Writrec.c
/ *

writrec.c
creates name and address file using fixed length (2-byte)
record length field ahead of each record

#include "fileio.h"
char recbuf f MAX_REC_S ZE +11;
= {
char *prompt
'Enter Last Name -- or <CR> to exit
First name
[

Address
City
State
Zip
/* null string to terminate the prompt loop */

C
[

f d, i

shor
pr

char response 50
char f i 1 ename
5
i

PROGRAMS: WRITREC.C

rec_lgt h

"Ent er the name of the file you wish to create:

gets(filename);
if

((fd - creat(f

l 1

ename PMODE
,

) )

printfCfile opening error

exitd);

program

opped \ n"

prmtf ("\n\n
get

s", prompt

response )

while

( s t r 1

en( response

\0
recbuf f [03 =
1 d_t o_r ecbuf f (recbuff
response)
for (i=1; *prompt[i] != '\0'
i + +)

prmtf ("7.s", prompt

get

response)
d_t o_r ecbuf f (recbuff response)
;

/* write out the record length and buffer contents

rec_lgth = s t r 1 en( recbuf f )
write(fd &rec__lgth sizeof(rec_lgth))
write(fd, recbuff ,rec_lgth)
;

/* prepare for next entry */

printf ("\ n\n %s" prompt [01 )
gets(response)
;

close the file before leaving

close(fd)
/*

question:
How does the termination condition work in the for loop:
for (i-1; *promptEi] != \0
i++)
;

What does the

refer to?

Why do we need the "*"?

/ *

FUNDAMENTAL

STRUCTURE CONCEPTS

FILE

Readrec.c
readrec

...

reads through a file, record by record, displaying the

fields from each of the records on the screen.
/

^include "fileio.h"
main(

rec_count
fld_count;
5can_po5
short rec_lgth;
char f i 1 ename
5
char recbuff MAX_REC_S IZE +11;
char f ield[MAX_REC_SIZE +1];
fd,

int
int

get
if

"En t er name of file to read:

ename)
((fd = open(f
s ( f

i 1

exitd

i 1

ename 0_RD0NLY )
opening error

i 1

program

s t

opped\n")

);

ec_count =
5can_po5 =
while
((rec_lgth = get_rec ( fd recbuff
r

) )

printf ("Record %d\n" ++rec_count )

f 1 d_coun t =
while ((scan_po5 = get_f 1 d( f i e 1 d recbuff scan_pos
rec_lgth))
printf ("\tField %d: %s \n" + + f 1 d_count f i e 1 d )
,

lose(f d

question -- why can

to scan_pos just once, outside
assign
*/
of the while loop for records?
I

C PROGRAMS: GETRF.C

rsetrf.c
/*

get rf

...

Two functions used by programs

get_rec(

get_fld(

readrec.c and find.c:

reads a variable length record from file fd

into the character array recbuff.
moves a field from recbuff into the character
array field, inserting a '\0' to make it a
string.

'include "fileio.h"

get_rec(int fd, char recbuffM)

{

short

rec__lgt h

(readCfd, &rec_lgth, 2) == 0) /* get record length */

/* return
return(O);
if EOF */
/* read record */
rec_lgth = readCfd, recbuff, rec_lgth);
return(rec_lgth)
if

get_fld(char field [

char recbuffM

short scan_pos

short rec_lgth)

short fpos = 0;
if

position in "field" array

(scan_po5 == rec_lgth) /if no more fields to read,/

return(O);
/*return scan_pos of 0.*/

/* scanning loop */
while ( 5can_po5 < rec_lgth &&
(f ieldCfpos++] = recbuff [scan_pos++]

DELIM_CHR)

ifCfieldtfpos-1] = =DELIM_CHR)/*if last character

ieldC --f po5

'XO

is a field*/
/*delimiter, replace with null*/

else

fieldCfpos] =
ret urn( 5can_po5

/*otherwise, just ensure that

the field is nu 1 1 - ermi na t ed*
/*return position of start of next field*/
'\0';

FUNDAMENTAL

STRUCTURE CONCEPTS

FILE

Find.c
/*

find.c ...
searches sequentially through
particular key.

file for

record with

^include "fileio.h"
#define TRUE
#define FALSE
1

main(

fd, scan_pos;
short r ec_l gt h
int mat ched
char search_key[30]
k ey_f ound
30
char f i 1 ename
5
char recbuff MAX_REC_S ZE +13;
char f ield[MAX_REC_SIZE +11;
int

[ 1

pr i nt
get s (
if

lastC30], firstC303;

"En t er name of file to search:

ename )
((fd = open(f
f

");

i 1

exitd

i 1

ename D_RD0NLY )
opening error

i 1

program

s t

opped\n")

);

("\n\nEnter last name: ");

gets(last);
pr i nt f ("\nEnt er first name: " )
gets(first);
makekey(last, first, search_key);
pr int

get

search key */

matched = FALSE;
while
(Imatched && (rec_lgth = get_r ec ( fd recbuff

) )

5can_po5 =
recbuff, scan_pos, rec_lgth);
5can_po5 = get_f ld( las t
recbuff, scan_pos, rec_lgth)
5can_po5 = get_f ld(f i r s t
first, key_found);
mak ekey( las t
if (strcmp (key_found, search_key) == 0)
matched = TRUE;
;

/*
if
{

record found, print the fields

(ma t ched

161

PROGRAMS: MAKEKEY.C

pnntf("\n\nRecord found :\n\n");

5can_po5 =
/*

break out the fields */

e( ( 5can_po5 = get_f ld(field,recbuff, scan_pos ,rec_lgth))>0)
printf ("\t%s\n'\ field)
i 1

else
printf ("\n\nRecord not found. \n");

ques tions:

/ *

-why does scan_pos get set to zero inside the while loop here?
-what would happen if we wrote the loop that reads records
((rec_lgth = ge t_r ec ( f d ecbuf f ) ) >
&&
like this: while
ma t ched )
,

Makekey.c

mak ek ey( las

r s t

s )

...

function to make a key from the first and last names passed
Returns the key in
through the functions arguments.
canonical form through the address passed through the
argument s.
Calling routine is responsible for ensuring
that s is large enough to hold the return string.
Value returned through the function name
the string returned through s.

the

length of

makekey(char lasttl, char

irst

[ ]

,char

sM)

{
i

en
=

enf

/* trim the last name */

im( las t )
/* place it in the return string
las t )
/* append a blank at the end */
';
s[lenl++] =
=
s[ lenl
\0
/* trim the first name */
lenf = s t r t r im( f i r s t )
/* append it to the string */
s
rcat ( s f i r s t )

lenl

s t

s t r t r

cpy (

/* convert everything to uppercase

ucase(s,s);
returnClenl + lenf);

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

Strfuncs.c

/ *

rf uncs . c

module containing the following functions:

strtrim(s) trims blanks from the end of the ( nu 1 1 - ermi nat ed)
string referenced by the string address s.
Nhen
done, the parameter s points to the trimmed string
The function returns the length of the trimmed
string.

ucase(si,so) converts all lowercase alphabetic characters in

the string at address si into uppercase characters
returning the converted string through the address
so

strtrimC char s[])

{
i

for

strlen(s)-1;

/* now that
to form a

i--)

the blanks are trimmed, reaffix null on the end

string */

= \0
s[++i
return(i);
]

i>=0 &&

ucase (char siM, char

soM)

while (5o++ = (5i >=

5i + +

'a'

*si

'z')

*si

0x5f

*si)

fupdate
/ *

update c ...
program to open or create a fixed length record file for
Records to be
updating.
Records may be added or changed.
changed must be accessed by relative record number
.

^include "fileio.h"
#define REC_LGTH 64

static char *promptM = {"

Last Name:
"
First name
Address:
City:
State:

163

PROGRAMS: UPDATE.C

Zip:

static int fd;

static struct {
short
rec_count
char
fill[30];
>
head
static
static
static
static
static

menu(

recbuffM);

read_and_show(
change( )

ma i n(
int
int

as k_i nf o( char
as k_r r n( )

);

menu_choice

by t e_po5

char f i lename
5
long 1 seek ( )
char recbuf f MAX_REC_S ZE + 1];/*buffer to hold
[ 1

record*/

printf ("Enter the name of the file: ");

g e t 5 ( f i lename);
if (( fd = openCf ilename, 0_RDWR)) < 0) /*if OPEN fails*/
{

fd = creatCf ilename, PMODE);

head rec_count = 0;
wr i te(fd Ahead s i zeof (head) )
.

/*then CREAT*/
/*initialize header */
/*write header rec*/

/* existing file opened -- read in header */

read(fd &head,sizeof(head));
/* main program loop -- call menu and then jump to options */
whi le( (menu_choice = menu( )) < 3)

else

swi

ch(menu_cho ice)

case

printfC Input

/* add a new record */

the informationfor the new record --\n\n M

ask_info(recbuff)
by t e_po5 = head.rec_count * REC_LGTH
1 seek ( f d
(
ong ) byte_pos,0);

);

sizeof(head);

(continued)

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

write(fd,recbuff REC_LGTH);
head r ec_count + +
,

break

case

rrn = as k_r rn(

/*
if

update existing record */

if rrn is too big, print error message

(rrn >= head rec_coun t ) {
pr i nt f "Record Number is too large");
printfC... returning to menu ...");
break

...

/* otherwise, seek to the record ... */

byte_po5 = rrn * REC_LGTH + s i zeof (head )
1 seek ( fd ,( long)
byte_pos,0);

display it and ask for changes

read_and_show( );
(change( )
i f
/*

...

pr int f ("\n\n nput the revised Va lues \n\n")

ask_info(recbuff)
lseek(fd,(long) byte_pos,0);
write(fd,recbuff ,REC_LGTH);
I

break
/* end swi tch */
>
/* end while */
;

rewrite correct record count to header before leaving

lseek(fd,OL,0);
write(fd,&head,sizeof(head))
close(fd);
/*

menu( ) ...
local function to ask user for next operation.
Returns numeric value of user response

static menu( ) {
int choice;
char response [10

printf ("\n\n\n\n
FILE UPDATING PROGRAMNn" )
pr intf ("\n\nYou May Choose to:\n\n");
printf("\t1.
Add a record to the end of the file\n");
pr intf ("\ t2
Retrieve a record for Upda t i ng\n")
pr int f ( "\ t 3
Leave the Pr ogram\n\n" )
.

C PROGRAMS: UPDATE.C

pr i nt f ("Enter the number of your

get s( response)
choice = atoi(response);
ret ur n( cho ice)

");

choice:

/ *

as k_i nf o(

...

function to accept input of name and address fields,

writing them to the buffer passed as a parameter
local

static as k_i nf o( char recbuffM)

{

fie ld_count
i nt
char response 50
,

i
]

for

clear the record buffer */

(i = 0; i < REC_LGTH; r ecbuf f

get

for

the fields

(i=0;

'\0)

*prompt[i]

'NO

i++)

printf ( %s" prompt[i]);

ge t s response )
f 1 d_t o_r ecbuf f (recbuff response)
,,

>
>

as k_r rn(

...

function to ask for the relative record number of the

record that is to be updated.
local

static as k_r r n(
i

r r

char response

[ 1

M \n\n nput
i nt f
the Relative Record Number of the Record
that\n");
");
pr i nt f ("\ tyou want to update:
get s( response)
rrn = atoi(response);
return(rrn)

read_and_show( ) ...
Note that this
local function to read and display a record.
function does not include a seek -- reading starts at the
current position in the file

static

ead_and_show(

(continued)

FUNDAMENTAL

FILE

STRUCTURE CONCEPTS

char recbuff [MAX_REC_S ZE +

int

MAX_REC_S ZE +13;
I

scan.pos, data_lgth;

5can_po5 =
readCfd, recbuff ,REC_LGTH);
;

M \n\n\n\nExi 5 1 i
i nt f
ng Record Content 5 \n" )
/* ensure that record ends with
\
recbuff CREC_LGTH] =

null

data_lgth = strlen(recbuff);
while ((scan_po5 = get_f 1 d( f eld recbuff scan_pos
data_lgth))
,

printf

\t%s\n", field)

/ *

changeC

...

local function to ask user whether or not he wants to change

otherwise
if the answer is yes,
the record.
Returns
1

*/
{
static change( )
char response [

(
"\n\nDo you want to change this record?\n");
i nt f
Answer Y or N, followed by <CR> ==>");
printfC"
gets(response)
ucaseC response, response);
returnCCresponseCO] '== Y') ?
0);

167

PASCAL PROGRAMS

Pascal Programs

The

Pascal programs listed in the following pages correspond to the

programs discussed in the text. Each program is organized into one or more
files, as

follows.

writstrm.pas

Writes out

name and

address information as a stream of con-

secutive bytes.
readstrm.pas

Reads

writrec.pas

Writes a variable length record file that uses a byte count

the beginning of each record to give its length.

readrec.pas

Reads through a file, record by record, displaying the

from each of the records on the screen.

stream

file as

input and prints

to the screen.

Supports functions for reading individual records or

get. pre

These functions are needed by the program

Searches sequentially through a

find. pas

file

fields

fields.

in readrec.pas.

for a record

with

a partic-

ular key.

Allows new records

update. pas

to be

added to

file,

or old records to be

changed.

Support function for update. pas, which converts

type strng to a variable of type datarec.

stod.prc

In addition to these

the tools for operating

contained in Appendix

files,

there

a file called tools. pre,

variables of type strng.

a variable

which contains

listing

tools. pre is

the end of the textbook.

We have added line numbers to some of these Pascal listings to assist the
reader in finding specific

The

files

program statements.
do not contain
and stod.prc.

that contain Pascal functions or procedures but

main programs

are given the extension

.pre, as in get. pre

168

PASCAL PROGRAMS

Writstrm.pas
Some

things to note about writstrm.pas:

The comment {SB-} on

piler, instructing

Without

line 6

to handle

this directive

a directive to the

keyboard input

we would

function properly in the

WHILE

as a

Turbo

Pascal

com-

standard Pascal

file.

not be able to handle the len_str()

loop on line 36.

The comment {$1 tools. pre} on line 24 is also a directive to the

Turbo Pascal compiler, instructing it to include the file tools. pre in
the compilation. The procedures read_str, len_str, and fwrite_str are
the

file tools. pre.

we choose not
conforming to standard Passtrng type, which is a packed array
The length of the strng is stored in

Although Turbo Pascal supports

come

to use that type here to

a special string type,

closer to

cal. Instead, we create our own

{O..MAX_REC_SIZE} of char.

the zeroth byte of the array as a character value. If

value in the zeroth byte of the array, then

ORD(X)

X
is

the character

the length of

the string.

The assign statement on line 31 is one that is nonstandard. It is

Turbo Pascal procedure, which, in this case, assigns filename to
file,

further operation

all

PROGRAM writstrm

outfile will

operate on the disk

out-

file.

NPUT OUTPUT)
,

writes out name and address information as

consecutive bytes >

stream of

{$B->

Directive to the Turbo Pascal compiler, instructing

handle keyboard input as a standard Pascal file >

8
9
1

CONST
DELIM_CHR

MAX_REC_SIZE

'I';
=

255;

13
14

TYPE

strng
inp_list
filetype

17
18
19
20
21

=
=

MAX_REC_S ZE of char;
packed array
first address ,c ity, state zip)
( last
packed arrayC1..40l of char;
[

VAR

response
resp_type
filename
outfile

array [inp
inp_list;
filetype;
text;

list]

strng

PASCAL PROGRAMS: READSTRM.PAS

{$1

25
26
27
28

tools. pre)
Another directive,
too Is. pre

30
31

instructing the compiler to include the file

);

32
33

writeC'Type in a last name, or press <CR>

read 5 trCresponset last )
while ( 1 en_5 t r response las t ) >
)
DO
BEGIN
{
get all the input for one person }
write(
First Name: ');
r ead_5 tr(response[first
)
write(
Addr es s
)
read s trCresponsetaddress] )

34
35
36
37
38
39
40

exit:

);

42
43
44
45
46
47
48
49

writeC

City: );
ead s trCresponsetcity] )
writeC'
State: ');
r ead_5 trCresponset state]
writeC'
Zip: ');
r

)-,

read_5 trCresponselzip]);
write the responses to the file }
resp_type := last TO zip DO
f wr i t e_5 trCoutfile, response! resp_type

for

52
53
54

start the next round of

input

writeC'Type in a last name, or press <CR>

read_5 t rC response [last])

55
56
57
58
59

BEGIN {main}
writeC Enter the name of the file:
readlnCfi lename)
assignCoutfi le f i lename)
rewriteCoutfi le)

exit:

');

END;

closeCoutfile)
END.

Readstrm.pas

PROGRAM readstrm
{

NPUT OUTPUT)
,

program that reads a stream file Cfields separated by

delimiters) as input and prints it to the screen >
A

CONST
DELIM_CHR

MAX_REC_SIZE

!
=

255;
(continued)

170

PASCAL PROGRAMS

TYPE
5 1

rng

letype

packed array
MAX_REC_S ZE
packed array M..40] of char;

lename

char;

VAR

f
f

letype

str

integer
i nt eger
5 1 rng

ld_count
ld_len

{$1

text

tools. prc>

FUNCTION readfield (VAR infile

{

text; VAR str

strng):

int eger

Function readfield reads characters from file infile until it

reaches end of file of a "I".
Readfield puts the characters in
str and returns the length of str >

VAR
i

integer;
char

BEGIN
i

while (not EOF(infile)) and (ch <> DELIM_CHR) DO

BEGIN
read (infile, ch);
i

s t r

=
[

END;
i

strCOJ := CHR(i);
readf ield
=
i
END;
:

BEGIN {MAIN}
write ('Enter the name of the file that you wish to open:
readln (filename);
assign(infile,filename);
reset (infile);
f

ld_coun t

');

fld_len := readf ie 1 d( i nf i le 5 t r )
while (fld_len > 0) DO
BEGIN
fld_count := fld_count +
2)
writeC field #
f 1 d_count
{
write_str()
wr i te_st r(st r )
,

tools. pre

PASCAL PROGRAMS: WRITREC.PAS

fld_len

END;
ose

( i

readfield(infile str)
7

le)

END.

Writrec.pas
Note about

writrec.pas: After writing the rec_\gth to outfile

write a space to the

PROGRAM wntrec

This

NPUT OUTPUT)
,

line 69,

because in Pascal values to be read into

must be separated by

integer variables
1

file.

spaces, tabs, or end-of-line markers.

{$B->

CONST
DELIM_CHR =
MAX_REC_SIZE

8
9

TYPE

255;

5trng = packed array

filetype = packed array
[

1
1

'I'

12
13
14
15

16
17
18
19
20

MAX_REC_S ZE

C 1

.40

char;

VAR

filename
outfile
response
buffer
rec_lgth
{$1

filetype;

text;

5 t r

strng;
integer;

too Is. prc>

22
23
24
25
26
27
28
29
30
31

32
33
34
35
36

PROCEDURE
C

ld_to_buf er(VAR buff:

This procedure concatenates

buff >

strng;

and

strng);

delimiter to end of

VAR

strng;
d_5 t r
BEGIN
cat_str(buff,s);
d_str[0] := CHRC1 );
:= DELIM_CHR;
d_str[1
cat_str(buff d_s t r
:

END;
(continued)

172

37
38
39
40
41

PASCAL PROGRAMS

BEGIN {main}
write( 'Enter the name of the file you wish to create
readln(filename)
assignCoutfile,filename)
rewriteCoutfile);

');

42
43
44
45
46
47
48
49
50

writeC'Enter Last Name -- or <CR> to exit: );

read_5 t r( response)
while ( 1 en_st r response) > 0) DO
BEGIN
=
bufferCO]
CHR(O);
{Set length of string
in buffer to
>
f 1 d_t o_buf f er (buffer response)
;

writeC

52
53
54
55
56
57
58
59
60

ead_5 t r( response)
f 1 d__t o_b ufferCbuffer
response)

First name

writeC

Address:');

rea d_5 t r( response)

f 1 d_t o_buf f er (buffer response)
;

writeC
read_5 tr(response)
f 1 d_t o_buf f er (buffer response)
writeC'
r ead_5 trCresponse)
f 1 d_t o_buf f er Cbuffer response)
writeC'
read_5 t rC response)
f 1 d_t o_buf f er Cbuffer response)

City:

State:

);

62
63
64
65
66
67
68
69

Zip:');

write out the record length and buffer contents

{
rec_lgth := 1 en_s t r C buf f er )
writeCoutfile,rec_lgth)
wr i t eCout f i le
)
f wr i t e_5 1 rCoutfile, buffer);

73
74
75
76
77
78

{
prepare for next entry >
writeC'Enter Last Name -- or <CR> to exit:
read_5 t rC response)

END;

closeCoutfile)
END.

Readrec.pas

PROGRAM readrec
{

');

NPUT OUTPUT)
,

This program reads through a file, record by record, displaying

the fields from each of the records on the screen. >

PASCAL PROGRAMS: READREC.PAS

173

{$B->

CONST
=
i nput_5 i ze
255
DELIM_CHR = h
MAX_REC_SIZE = 255;
;

TYPE
npu t_5 i ze
of char;
strng = packed array
filetype = packed array M..40] of char;
[

VAR
f

i 1

ename

out f

ec_coun
5can_po5

{$1
{$1

ex t

nt eger

ec_l gt h
1

tools. pre
get. prc>

eger
i nt eger
i nt eger
strng;
strng;
i

d_coun
buffer
field
f

letype

BEGIN {main}
write( 'Enter name of file to read:
readln (filename);
as5ign(outfile,filename)
reset (outfile);

');

rec_count
=
5can_po5
=
rec_lgth := ge t_rec out f i le buff er )
while rec_lgth >
DO
BEGIN
,rec_count);
wr telnC Record
rec_count := rec_count +
f ld_count
=
5can_po5 := ge t_f 1 d( ield buf f er scan_pos r ec_l gt h )
while 5can_po5 >
DO
BEGIN
);
writeC
f 1 d_count
Field
writ e_5 tr(field);
fld_count := fld_count + 1;
5can_po5 := get_f 1 d( f i e 1 d buf f er scan_pos r ec_l g t h
:

END;

rec_lgth

END;

close(outfile)
END.

get_r ec out f i le buf f er

PASCAL PROGRAMS

Get.prc

FUNCTION get_rec(VAR fd:

VAR buffer:

text;

strng);

integer;

{
A function that reads a record and its length from file fd.
The function returns the length of the record. If EOF is
encountered get_rec() returns
>

VAR

rec_lgth
integer;
space
char
BEGIN
if EOF(fd) then
get_r ec
=
else
BEGIN
readCfd ,rec_lgth)
r ead( f d
space)
f r ead_5 tr(fd, buffer, r ec_l gt h )
get_rec := rec_lgth
END
:

END;

FUNCTION get_fld(VAR

rng buff er
;

rec_lgth:
{

integer):

rng VAR scanpos: integer;

integer;
;

function that starts reading at scanpos and reads characters

from the buffer until it reaches a delimiter or the end of the
record. It returns scanpos for use on the next call. }

VAR
f

pos

nt eger

BEGIN

scanpos = rec_lgth then

get_fld :=
else
BEGIN
if

pos

scanpos := scanpos +
fieldCfposl := buf fer scanpos
while (f ieldtfpos] <> DELIM_CHR) and (scanpos
BEGIN
fpos := fpos
1;
scanpos := scanpos +1;
fieldCfposl := buf fer scanpos
1

rec_lgth)DO

END;
if

fieldCfpos]
f ieldCO]
:

DELIM_CHR then
CHRCfpos -

:=
=

,v~l-

PASCAL PROGRAMS: FIND.PAS

175

else

fieldCO] := CHR(fpos);
get_fld := scanpos
END
END;

Find. pas

PROGRAM find
i

NPUT OUTPUT)
,

This program reads through a file, record by record, looking

for a record with a particular key.
If a match occurs, when
all the fields in the record are displayed.
Otherwise a message
is displayed indicating that the record was not found. >

{$B->

CONST

MAX_REC_SIZE
DELIM_CHR =

255;

=
'I'

TYPE

strng = packed array

MAX_REC_S IZE of char;
filetype = packed array C1..403 of char;
[

VAR
f

i 1

ename

out f
last

first

search_key
length
matched
rec_l gt
buffer
5can_po5
key_f ound
field
$

{ $

tools, pre
get pre >

letype

ext
s t rng
s t rng
s t rng
t

nt eger

boolean

nt eger

strng
i nt eger
strng;
strng;
;

BEGIN {main}
writeC'Enter name of file to search:
readlnCf i lename)

);

as5ign(outfile,filename)
reset(outfile);

(continued)

PASCAL PROGRAMS

writeCEnter last name:

read_5 t

r (

las

');

t )

writeCEnter first name:

');

read_5 1 r(first);
makekeydast first sear ch_k ey )
,

matched := FALSE;
rec_lgth := ge t_rec out f i 1 e buff er )
while ((not matched) and (rec_lgth
Beg

0)) DO

=
5can_po5
5can_po5 := ge t_f 1 d( las t buff er scan_pos rec_l gth )
5can_po5 := get_f 1 d( f i r 5 t buf f er scan_pos rec_l gt h )
makekey(last first k ey_f ound )
if cmp_s t r ( key_f ound search_key ) =
then
matched := TRUE
else
rec_lgth := get_rec(out f i 1 e buf f er )
:

END;
lose( out

le)

record found, print the fields

mat ched t hen
BEGIN
wr i t e 1 n( Record found:');

i t

5can_po5

{
break out the fields >
5can_po5 := ge t_f 1 d( f i e 1 d buf f er scan_pos r ec_l gt h )
while 5can_po5 >
DO
BEGIN
writ e_5 tr(field)
5can_po5 := get_f 1 d( f i e Id buf f er scan_pos rec_l gt h
,

END;

END

else

writeln(

Record not found.

);

END.

Update. pas
Some

things to note about update. pas:

name and address fields are read in as

and procedure Jld_to_buffer() writes the fields to strbuff (also of
type strng). Writing strbuff to outfile would result in a type mismatch,
since outfile is a file of type datarec. However, the procedure stod(),
In the procedure ask_info(), the

strngs,

177

PASCAL PROGRAMS: UPDATE.PAS

located in stod.prc, converts a variable of type strng to a variable of

type datarec to write the buffer to the

cated on

The

lines

seek() statements

dard; they are features of

PROGRAM update

file.

The

calls to stod() are lo-

210 and 237.

NPUT OUTPUT )
,

lines 212, 229, 239,

Turbo

and 250 are not stan-

Pascal.

{$B->
{

program to open or create a fixed length record file for

updating.
Records may be added or changed.
Records to be
changed must be accessed by relative record number }

CONST

MAX_REC_SIZE
REC_LGTH

DELIMCHR

255;
64;
'I';

TYPE
s t r

packed array
MAX_REC_S ZE of char;
packed array M..40] of char;
RECORD
len
integer;
data
packed array
REC_LGTH of cha

filet ype
datarec

END;

VAR
f

i 1

ename
f

i 1

response
menu cho ice
strbuff
by t

head
r r n

drecbuff
i

ec_count

{$1
{$1

tools. pr c

stod.prc

{ $

get

pre

PROCEDURE
{

d__t

s t r ng

integer
datarec
integer
datarec
integer
integer

pos

filetype;
file of datarec
char
i nt eger
;

ld_to_buf er(VAR buff: strng;

o_buf f er concatenates strng

end of buff

and

strng);

delimiter to the

(continued)

178

PASCAL PROGRAMS

VAR

d_str
strng;
BEGIN
ca t_5 t r ( but f 5 )
d_str[0] := CHRC1);
d_strt
=
DELIM_CHR;
cat_str(buf f ,d_str)
:

END;

FUNCTION menu integer

{local function to ask user for next operation.

value of user response }

Returns numeric

VAR

choice
int eger
BEGIN
writeln;
writelnC
FILE UPDATING PROGRAM' )
writeln;
writelnC'You May Choose to: );
writeln;
writelnC'
Add a record to the end of the file');
writelnC'
2.
Retrieve a record for updating');
writelnC'
3.
Leave the program');
writeln;
writeC'Enter the number of your choice: ');
r eadl nCchoice)
writeln;
=
menu
choice
:

END;

PROCEDURE ask_infoCVAR strbuff: strng);

{local procedure to accept input of name and address fields
writing them to the buffer passed as a parameter >
VAR

response
strng;
BEGIN
{
clear the record buffer
clear_strCbuff)
:

get the fields }

{
writeC'
Last Name: ');
r ead_5 trCresponse)
f 1 d_t o_buf f er C strbuff response)
writeC'
First Name: ');
r ead_5 trCresponse)
buffer Cstrbuff response)
f 1 d
t o
;

179

PASCAL PROGRAMS: UPDATE.PAS

writeC

Address:
)
ead_5 tr(response);
f 1 d_t o_b ufferCstrbuff response)

94
95
96
97
98
99
100
1

City:

');

read_5 tr(response);
f 1 d_ to bufferCstrbuff, response);
writeC'
State: ');
r ead_5 tr( response);
f 1 d_t o_buf f er (strbuff
response);
writeC'
Zip:');
r ead_5 tr(response);
fid' to buff er C strbuff response);
,

03
04

writeC

105
106
1

i t

END;

108
09
110
1

FUNCTION ask_rrn:

integer;

function to ask for the relative record number of the record

that is to be updated.
>

114
1

116
1

17
18
19

120

VAR
rrn

integer;

BEGIN
i t e 1 n(
npu t the relative record number of the record that');
writeC'
you want to update: ');
read ln(rrn)

wr i t e 1 n
122
as k r un
=
rrn
123
END;
124
PROCEDURE read_and_show;
125
126 {procedure to read and display a record. This procedure does not
127
include a seek -- reading starts at the current file position
128
129 VAR
130
5can_po5
i n t eger
131
dr ecbuf f
da t ar ec
132
integer
133
data__l g t h
integer
134
field
5 t r ng
135
strbuff
5 t r ng
136
BEGIN
137
scan pos
138
readCoutfi le drecbuff)
139
140
<
convert drecbuff to type strng }
141
strbuffCO] := CHR( drecbuff 1 en )
142
for i :=
to drecbuff .len DO
121

(continued)

PASCAL PROGRAMS

143:
strbufffi] := dr ecbuf f da t a [ i
144:
145:
wr i t e 1 n( Ex i s t i ng Record Contents');
146:
writeln;
147:
148:
data_lgth := 1 en_s t r ( 5 rbuf f )
149:
5can_po5 := get_f 1 d( f i e 1 d 5 rbuf f scan_pos da ta_l g t h )
while scan_pos >
D)
150:
BEGIN
151:
write_5tr(f ield)
152:
153:
5can_po5 := ge t_f 1 d( f i e 1 d 5 rbuf f scan = pos,data:= lgth)
154:
END
155: END;
156:
157:
158: FUNCTION change: integer;
159:
160: { function to ask the user whether or not to change the
161:
record.
Returns
if the answer is yes,
otherwise.
162:
163: VAR
164:
char;
response
165: BEGIN
166:
writeln('Do you want to change this record?');
167:
wnteC
Answer Y or N, followed by <CR> = = >);
168:
readln(response);
1 69:
writeln;
y ) then
170:
if (response =
Y ) or (response =
=
171:
change
172:
else
73
change
=
174: END;
175: BEGIN {main}
176:
write( 'Enter the name of the file: ');
177:
read 1 n( f i 1 ename )
178:
ass i gn( out f i 1 e f i 1 ename )
179:
180:
write('Does this file already exist? (respond Y or N): ');
181:
read ln( response )
writeln
82:
183:
if (response = Y') OR (response = y') then
184:
BEGIN
185:
open outfile
>
r ese t ( ou t f i 1 e )
{
186:
get header
>
read(out f i le ,head)
{
187:
{
read in record count >
rec_count := head.len
188:
END
189:
else
190:
BEGIN
create outfile
}
191:
rewr te(outf i le)
(
initialize record count }
{
192:
rec_count := 0;
]

PASCAL PROGRAMS: UPDATE.PAS

193:
194:
195:
196:
197:
198:
199:
200:
201:
202:
203:
204:
205:
206:
20 7:
208:
209:
210:
211:
212:
213:
214:
215:
216:
217:
218:
219:
220:
221:
222:
223:
224:
225:
226:
227:
228:
229:
230:
231:
232:
233:
234:
235:
236:
237:
238:
239:
240:
241

rec_count;
to REC_LGTH DO
head.dataU] := CHR(O);
wr i te( out f i 1 e head )
head.len
for

place in header record

set header data to nulls)

write header rec
}

{
i

END;

main program loop -- call menu and then jump to options

{
menu_choice := menu;
while menu_choice < 3 DO
BEGIN
CASE menu_choice OF
1

add a new record >

i
BEGIN
writeln( Input the information for the new record --');
writeln;
writeln;
ask_info(strbuff );
{convert strbuff to type datarec}
stodCdrechbuf f st rbuf f )
rrn := rec_count + 1;
seek(outfile,rrn);
wr i t e( ou t f i 1 e dr ecbuf f )
rec count := rec count +
:

END;
2

update existing record

BEGIN
rrn
{

ask_rrn;

rrn is too big, print error message ...

(rrn > rec_count) or (rrn < 1) then
BEGIN
wr i te( Record Number is out of range');
wr i teln( "... returning to menu...')
END

else
BEGIN
seek(outf

otherwise, seek to the record

le ,rrn)

...

display it and ask for changes

read_and_show
{

...

then
change =
BEGIN
writeln(' Input the revised Values: ');
ask_info(strbuff );
convert strbuff to type
{
stod(drecbuf f st rbuf f )
datarec }
seek(outf i le,rrn)
wr iteCoutf i le ,drecbuf f )
END
1

(continued)

PASCAL PROGRAMS

242
243
244
245
246
247
248
249
250

END;

menu_choice
END; { while

rewrite correct record count to header before leaving

{
head.len := rec
count;
seekCoutf i le 0)
write(outfile,head);
close(outfile)
,

251

252
253

END
END
CASE >
<

END

Stod.prc
PROCEDURE stod (VAR drecbuff: datarec; strbuff; strng);
{

A procedure that converts

variable of type datarec

variable of type strng to

VAR
i
i nt eger
BEGIN
drecbuff. len
:

for

:=
mi n( REC_LGTH
to drecbuff. len DO

drecbuff. datati]

END;

en_s

strbuffCi];

{
Clear the rest of the buffer
while i < REC_LGTH DO
BEGIN

drecbuff .datati]
END

r ( s t

rbuf f

) )

Organizing Files
for Performance

CHAPTER OBJECTIVES
Look

at several

Look

at storage

space in a
I

Develop

approaches to data compression.

compaction as a simple

way of reusing

file.

procedure for deleting fixed-length records

file space to be reused dynamically.

that allows vacated

Illustrate the use

avail
I

linked

lists

and

stacks to

manage an

list.

Consider several approaches to the problem of deleting

variable-length records.

Introduce the concepts associated with the terms

internal

fragmentation and external fragmentation

Outline some placement strategies associated with the reuse of space in a variable-length record file.

Provide an introduction to the idea underlying

a binary

search
I

Undertake an examination of the limitations of binary

searching.

Develop

a keysort

procedure for sorting larger

files;

in-

vestigate the costs associated with keysort.

Introduce the concept of

pinned

record.

183

CHAPTER OUTLINE
5.1

Data Compression
Using

Finding Things Quickly: An

Introduction to Internal Sorting
and Binary Searching

5.3

Notation
5.1.2 Suppressing Repeating
Sequences
5.1.3 Assigning Variable-length
5.1.1

a Different

Finding Things in Simple Field

and Record Files
5.3.2 Search by Guessing: Binary
5.3.1

Codes
5.1.4 Irreversible

Compression

Techniques
5.1.5

5.2

Compression

5.3.3 Binary Search versus Sequential

UNIX

Reclaiming Space in
5.2.1

Search
5.3.4 Sorting a Disk File in

Files

5.3.5

RAM

The Limitations of Binary

Record Deletion and Storage

Searching and Internal Sorting

Compaction
Keysorting

5.4

5.2.2 Deleting Fixed-length Records

Reclaiming Space
Dynamically
for

Description of the

5.4.1

Method

5.2.3 Deleting Variable-length

Another Solution: Why Bother

to Write the File Back?
5.4.4 Pinned Records

Records

5.4.3

5.2.4 Storage Fragmentation

5.2.5 Placement Strategies

how

have already seen

consider

how

a file

and records and other

file

Method

5.4.2 Limitations of the Keysort

important

be accessed

file

when

some

for the

cases reorganize,

file

deciding on

structures. In this chapter

organization, but the motivation

organize, or in

it is

a little different.
files in

system designer to

how

to create fields

continue to focus on

We look at

direct response to the

ways

need to

improve performance.
In the first section

smaller.

look

Compression techniques

basic information in the

how we

let

make

organize
files

files

make them

smaller by encoding the

file.

Next we look at ways to reclaim unused space in files to improve

performance. Compaction is a batch process that we can use to purge holes
of unused space from a file that has undergone many deletions and updates.
Then we investigate dynamic ways to maintain performance by reclaiming
space made available by deletions and updates of records during the life of
a file.

In the third section

sorting

them

we examine

the

problem of reorganizing files by

Then, in an effort to find

to support simple binary searching.

DATA COMPRESSION

method, we begin a conceptual line of thought that will

continue throughout the rest of this text: We find a way to improve file
performance by creating an external structure through which we can access
a better sorting

the

5.1

file.

Data Compression
In this section

reasons for

Use

look

making

files

some ways

make

smaller. Smaller

smaller.

files

less storage, resulting in cost savings;

Can be

transmitted faster, decreasing access time or, alternatively,

lowing the same access time with

and

Can be processed

as to take

al-

lower and cheaper bandwidth;

faster sequentially.

Data compression involves encoding the information

way

There are many

files

less space.

Many

in a file in

such

different techniques are available for

Some are very general and some are designed only for
of data, such as speech, pictures, text, or instrument data. The
variety of data compression techniques is so large that we can only touch on
the topic here, with a few examples.
compressing

data.

specific kinds

5.1.1

Using a Different Notation

Remember our
fields,

fields

address

file

such

as these are

good candidates

"state" field in the address

many

from Chapter

had several fixed-length

including "state," "zip code," and "phone number." Fixed-length

bits are really

could represent

all

file

needed for

for compression. For instance, the

required

this field?

two ASCII

bytes, 16 bits.

How

Since there are only 50 states,

possible states with only six bits. (Why?) Thus,

could

encode all state names in a single one-byte field, resulting in a space savings
of one byte, or 50%, per occurrence of the state field.
This type of compression technique, in which we decrease the number
of bits by finding a more compact notation/ is one of many compression
techniques classified as redundancy reduction. The 10 bits that we were able to
throw away were redundant in the sense that having 16 bits instead of 6
provided no extra information.

"'"Note that the original two-letter notation

tation tor the full state

name.

used for "state"

itself a

more compact no-

ORGANIZING FILES FOR PERFORMANCE

What

are the costs of this

compression scheme? In

this case, there are

many:

using a pure binary encoding,

we have made

the

file

unreadable

by humans.

whenever we add a new stateand a similar cost for decoding when we need
to get a readable version of state name from the file.
We must also now incorporate the encoding and/or decoding modules in all software that will process our address file, increasing the
complexity of the software.
incur

name

some

field to

cost in encoding time

our

file,

With so many costs, is this kind of compression worth it? We can

answer this only in the context of a particular application. If the file is
already fairly small, if the file is often accessed by many different pieces of
software, and if some of the software that will access the file cannot deal
with binary data (e.g., an editor), then this form of compression is a bad
idea. On the other hand, if the file contains several million records and is
generally processed by one program, compression is probably a very good
idea. Since the encoding and decoding algorithms for this kind of
compression are extremely simple, the savings in access time is likely to
exceed any processing time required for encoding or decoding.

5.1.2 Suppressing Repeating Sequences

Imagine an 8-bit image of the sky that has been processed so only objects
above a certain brightness are identified and all other regions of the image
are set to some background color represented by the pixel value 0. (See Fig.
5.1.)

Sparse arrays of this sort are very good candidates for compression of

which in this example works as follows.

unused byte value to indicate that a run-length
code follows. Then, the run-length encoding algorithm goes like this:

a sort called run-length encoding,

First,

choose one

special,

Read through the

values to the

file

pixels that

make up

in sequence, except

the image, copying the pixel

where the same

more than once in succession.

Where the same value occurs more than once

pixel value oc-

curs

in succession, substi-

tute the following three bytes, in order:

The special run-length code indicator;

The pixel value that is repeated; and
The number of times that the value is

repeated (up to 256 times).

DATA COMPRESSION

187

FIGURE 5.1 The empty space

in this astronomical image is
represented by repeated se-

quences of the same value

and is thus a good candidate
for compression. (This FITS
image shows a radio continuum structure around the spiral galaxy NGC 891 as observed with the Westerbork
Synthesis radio telescope in
The Netherlands.)

we wish to compress an image using run-length

we can omit the byte Oxff from the rep-

For example, suppose

encoding, and

find that

resentation of the image.

indicator.

How

would we

choose the byte Oxff as our run-length code

encode the following sequence of hexadecimal

byte values?

22 23 24 24 24 24 24 24 24 25 26 26 26 26 26 26 25 24

The

first three pixels are to be copied in sequence. The runs of 24 and 26 are
both run-length encoded. The remaining pixels are copied in sequence. The
resulting sequence is

22 23 ff

07 25 ff

26 06 25 24

Run-length encoding is another example of redundancy reduction.

(Why?) It can be applied to many kinds of data, including text, instrument
data, and sparse matrices. Like the compact notation approach, the
run-length encoding algorithm is a simple one whose associated costs rarely
affect performance appreciably.
Unlike compact notation, run-length encoding does not guarantee any
particular amount of space savings. A "busy" image with a lot of variation
will not benefit appreciably from run-length encoding. Indeed, under some

ORGANIZING FILES FOR PERFORMANCE

pressed" image that

prevent

"com(Why? Can you

the aforementioned algorithm could result in a

circumstances,

larger than the original image.

this?)

5.1.3 Assigning Variable-length Codes

Suppose you have two

symbols to use in an encoding scheme: a

have to assign combinations of dots and
dashes to letters of the alphabet. If you are very clever, you might determine
the most frequently occurring letters of the alphabet (e and t) and use a
single dot for one and a single dash for the other. Other letters of the
alphabet will be assigned two or more symbols, with the more frequently
occurring letters getting fewer symbols.
Sound familiar? You may recognize this scheme as the oldest and most
common of the variable-length codes, the Morse code. Variable-length codes,
in general, are based on the principle that some values occur more
dot ("") and

different

dash ("-").

You

frequently than others, so the codes for those values should take the least

amount of

space. Variable-length codes are another

form of redundancy

reduction.

A variation on the compact notation technique, the Morse code can be

implemented using a table lookup, where the table never changes. In
contrast, since many sets of data values do not exhibit a predictable
frequency distribution, more modern variable-length coding techniques

dynamically build the tables that describe the encoding scheme.

most

successful

of these

the

Huffman

code,

One of the

which determines the

probabilities of each value occurring in the data set, and then builds a binary

which the search path

tree in

value.

frequently occurring values are given shorter search paths in

then turned into a table, much like a Morse code table,

encode and decode the data.
For example, suppose we have a data set containing only the seven
letters shown in Fig. 5.2, and each letter occurs with the probability
the tree. This tree

that can be used to

indicated.

The

third

row

in the figure

shows

be assigned to the letters. Based on Fig.

encoded as "101000000001."

the

Huffman codes

5.2, the string

FIGURE 5.2 Example showing the Huffman encoding for a set of

seven letters, assuming certain probabilities. (From Lynch, 1985.

Let ter

Probability:
Code

abed
0.4

0.1

010

0.1
011

0.1

0000

0001

0010

0011

that

would

"abde" would be

DATA COMPRESSION

In the example, the letter a occurs

others, so

number of

bits

case as

many

much more

assigned the one-bit code

as four bits are required.

This

minimum

letters is three, yet in this

necessary trade-off to insure

that the distinct codes can be stored together, without delimiters

them, and

often than any of the

Notice that the

needed to represent these seven

between

be recognized.

still

5.1.4 Irreversible Compression Techniques

The techniques we have

discussed so far preserve

all

information in the

original data. In effect, they take advantage of the fact that the data, in

its

removed and
Another type of compression, irreversible
based on the assumption that some information can be

original form, contains redundant information that can be

then reinserted
compression,

at a later time.

sacrificed.

An example of irreversible compression would be shrinking a raster

image from, say, 400-by-400 pixels to 100-by-100 pixels. The new image
contains one pixel for every 16 pixels in the original image, and there is no
way, in general, to determine what the original pixels were from the one
new

pixel.

compression

Irreversible

less

compression, but there are times

common

when

in data files than reversible

the information that

or no value. For example, speech compression

technique that transmits

be synthesized

lost

of little

voice coding,

paramaterized description of speech, which can

the receiving end with varying

5.1.5 Compression

often done

amounts of

distortion.

UNIX

Both Berkeley and System

V UNIX

provide compression routines that are

heavily used and quite effective. System

has routines called pack and

which use Huffman codes on a byte-by-byte basis. Typically, pack

achieves 25 to 40% reduction on text files, but appreciably less on binary
files that have a more uniform distribution of byte values. When pack
unpack,

compresses

it automatically appends a ".z" to the end of the packed

any future user that the file has been compressed using the
standard compression algorithm.
Berkeley UNIX has routines called compress and uncompress, which use
an effective dynamic method called Lempel-Ziv (Welch, 1984). Except for
using different compression schemes, compress and uncompress behave

file,

a file,

signalling to

"^Irreversible

compression

sometimes

average information (entropy)

called

reduced.

"entropy reduction" to emphasize that the

ORGANIZING FILES FOR PERFORMANCE

almost the same

files it

pack and unpack." Compress appends

has compressed.
1"

Since these routines are readily available on

effective general-purpose routines,

it is

".Z"

to the

end

UNIX systems and are very

wise to use them whenever there are

not compelling reasons to use other techniques.

5.2

Reclaiming Space
Suppose
that the

record in

new

record

the extra data?

from the

You

a
is

in Files

variable-length record

file is

could append

end of the

to the

a way
What do you do with

modified in such

longer than the original record.

file

and put

original record space to the extension of the record.

rewrite the

whole record

the end of the

(unless the

file

pointer

You

could

needs to be

of the record. Each solution

drawback: In the former case, the job of processing the record is more
awkward and slower than it was originally; in the latter case, the file
contains wasted space.
In this section we take a close look at the way file organization
deteriorates as a file is modified. In general, modifications can take any one
of three forms:

sorted), leaving a hole at the original location

has

Record addition;
Record updating; and
Record deletion.
If the

only kind of change to

deterioration of the kind

a file

cover in

variable-length records are updated, or

record addition, there

chapter.

this

when

only

when

either fixed- or variable-

length records are deleted, that maintenance issues

become complicated and

interesting. Since record updating can always be treated as a record deletion

followed by

When

record addition, our focus

record has been deleted,

we want

on the

effects

of record deletion.

to reuse the space.

5.2.1 Record Deletion and Storage Compaction

files smaller by looking for places in a file where
and then recovering this space. Since empty spaces
occur in files when we delete records, we begin our discussion of
compaction with a look at record deletion.

Storage compaction

there

no data

makes

at all,

"''Many implementations of System

Berkeley extensions.

V UNIX

also support

compress and uncompress

RECLAIMING SPACE

Any

record-deletion

mark

place a special

address

for

simple and usually workable approach

Chapter

field in a deleted record.

file

we might

place an asterisk as the

Figures 5.3(a) and 5.3(b)

show

name and

first

address

similar to the one in Chapter 4 before and after the second record

marked

deleted.

(The dots

padding between the

Once we

how

last field

the ends of records

to
to

name and

For example, in the

in each deleted record.

developed

file

must provide some way

strategy

recognize records as deleted.

191

IN FILES

and 2 represent

and the end of each record.)

are able to recognize a record as deleted, the next question

from the record. Approaches to this problem that

rely on storage compaction do nothing at all to reuse the space for a while.
The records are simply marked as deleted and left in the file for a period of
time. Programs using the file must include logic that causes them to ignore
records that are marked as deleted. One nice side effect of this approach is
that

to reuse the space

it is

usually possible to allow the user to "undelete" a record with very

This

little effort.

rather than destroy

once. After deleted records have accumulated for

program
out

(Fig.

used to reconstruct the

5.3c).

If there

compaction

records.

also possible,

It is

mark

particularly easy if you keep the deleted

some of the original data,

The reclamation of the space from the deleted

field,

through

file

with

all

in a special

our example.

as in

records happens

some

time,

all at

special

the deleted records squeezed

enough space, the simplest way to do this

copy program that skips over the deleted
though more complicated and time-consuming,

a file

do the compaction in place. Either of these approaches can be used with

both fixed- and variable-length records.
to

FIGURE 5.3 Storage requirements of sample file using 64-byte fixed-length records, (a)
Before deleting the second record, (b) After deleting the second record, (c) After compaction
the second record is gone.

Ames! John! 123 Maple Stillwater OK 74075

Morrison!Sebastian!9035 South Hillcrest Forest Village OK 74820
Brown! Martha! 625 KimbarkiDes Moines IA 50311
!

(a)

Ames! John 1123 Maple Stillwater OK 174075

rrison!Sebastian!9035 South Hillcrest Forest Village OK 74820

Brown Martha 625 KimbarkiDes Moines IA 50311
*!

(b)

Ames John 123 Maple Stillwater OK 74075

Brown! Martha 1625 KimbarkiDes Moines IA 50311
!

(c)

ORGANIZING FILES FOR PERFORMANCE

The

decision about

how

can be based on either the

often to run the storage compaction

number of deleted

accounting

programs, for example,

compaction procedure on certain files

often

at the

program

records or on the calendar. In

makes sense

end of the

fiscal

run

year or

some

other point associated with closing the books.

5.2.2 Deleting Fixed-length Records

for

Reclaiming

Space Dynamically
most widely used of the storage
There are some applications, however,
that are too volatile and interactive for storage compaction to be useful. In
these situations we want to reuse the space from deleted records as soon as
possible. We begin our discussion of such dynamic storage reclamation
with a second look at fixed-length record deletion, since fixed-length
records make the reclamation problem much simpler.
In general, to provide a mechanism for record deletion with subsequent
reutilization of the freed space, we need to be able to guarantee two things:
Storage compaction

the simplest and

reclamation methods

discuss.

That deleted records are marked in some special way; and

That we can find the space that deleted records once occupied so we
can reuse that space when we add records.

a method of meeting the first requirement: We

by putting a field containing an asterisk at the

have already identified

mark records

as deleted

beginning of deleted records.

If you are working with fixed-length records and are willing to search
sequentially through a file before adding a record, you can always provide

you have provided the first. Space reutilization can

form of looking through the file, record by record, until a deleted
record is found. If the program reaches the end of the file without finding
a deleted record, then the new record can be appended at the end.
Unfortunately, this approach makes adding records an intolerably slow
process if the program is an interactive one and the user has to sit at the

the second guarantee if

take the

terminal and wait as the record addition takes place.

To make

record reuse

happen more quickly, we need

A way
A way
Linked

Lists

know immediately if there are empty slots in the

jump directly to one of those slots if they exist.
The use of

available records can

structure in
its

a linked

list

for stringing together

meet both of these needs.

linked

list

file;

and

all

of the

a data

which each element or node contains some kind of reference

successor in the

list.

(See Fig. 5.4.)

RECLAIMING SPACE

FIGURE 5.4 A linked

193

IN FILES

list.

If you have a head reference to the first node in the list, you can move
through the list by looking at each node, and then at the node's pointer
field, so you know where the next node is located. When you finally
encounter a pointer field with some special, predetermined end-of-list
value, you stop the traversal of the list. In Fig. 5.4 we use a 1 in the pointer
field to mark the end of the list.
When a list is made up of deleted records that have become available

space within the

file,

new

the

list is

usually called an avail

list.

When

inserting a

any one available record is just as

good as any other. There is no reason to prefer one open slot over another
since all the slots are the same size. It follows that there is no reason for
ordering the avail list in any particular way. (As we see later, this situation
changes for variable-length records.)
record into

The

Stacks

which
So, if

all

simplest

insertions

we have

way

file,

to handle a

list

managed

as a stack.

RRN

Head
pointer

Head

\
S

RRN 3,

stack

a list in

RRN
5

list.

record

looks like this before and

(3)

one end of the

RRN

pointer

as a stack that contains relative

and 2, and then add

the addition of the new node:

When

list is

and removals of nodes take place

an avail

numbers (RRN)
after

fixed-length record

RRN

added to the top or front of a stack, we say that it

next thing that happens is a request for some
available space, the request is filled by taking RRN 3 from the avail list.

new node

pushed onto the stack.

If the

ORGANIZING FILES FOR PERFORMANCE

This

called popping the stack.

only records 5 and

The

list

Linking and Stacking Deleted Records

Now we

know immediately if there are empty slots in the

jump directly to one of those slots if they exist.

to
to

contains

can meet the two

space from deleted records.

criteria for rapid access to reusable

A way
A way

which

returns to a state in

need

file;

and

Placing the deleted records on

a stack meets both criteria. If the pointer

of the stack contains the end-of-list value, then we know that
there are not any empty slots and that we have to add new records by
appending them to the end of the file. If the pointer to the stack top contains

to the top

a valid

node

reference,

then

available, but also exactly

Where do we keep
or

a separate file,

we need

structures.

when

is it

we know

where

to find

the stack?

Is it a

not only that

reusable slot

it.

separate

somehow embedded

list,

perhaps maintained in

within the data

file?

Once

again,

be careful to distinguish between physical and conceptual

The

deleted, available records are not actually

moved anywhere
we need them,

they are pushed onto the stack. They stay right where

located in the

file.

The

stacking and linking

rearranging the links used to

make one

done by arranging and

available record slot point to the

we are working with fixed-length records in a disk file, rather

memory addresses, the pointing is not done with pointer variables

next. Since

than with
in the

formal sense, but through relative record numbers (RRNs).

Suppose we

are

working with

contained seven records

(RRNs

fixed-length record

file

that

once

0-6). Furthermore, suppose that records 3

have been deleted, in that order, and that deleted records are marked by
first field with an asterisk. We can then use the second field of
a deleted record to hold the link to the next record on the avail list. Leaving
out the details of the valid, in-use records, Fig. 5.5(a) shows how the file
might look.
Record 5 is the first record on the avail list (top of the stack) since it is
the record that is most recently deleted. Following the linked list, we see

and

replacing the

that record 5 points to record 3. Since the link field for record 3 contains -1,

which

our end-of-list marker,

we know

that record 3

the last slot

available for reuse.

Figure 5.5(b) shows the same

file after

record

1 is

also deleted.

Note

that

the contents of all the other records on the avail list remain unchanged.
Treating the list as a stack results in a minimal amount of list reorganization
when we push and pop records to and from the list.
If we now add a new name to the file, it is placed in record 1, since
RRN 1 is the first available record. The avail list would return to the

RECLAIMING SPACE

List

head

(first

available record) -* 5
2

Edwards

Bates

195

IN FILES

Wills

*-l

Masters

Chavez

(a)

List

head

(first

available record)

Edwards

Wills

*-l

(b)

List

head

(first

available record)

1st

new

Wills

rec

Edwards

3rd new rec

Masters

2nd new

Chavez

rec

(c)

FIGURE 5.5 Sample

file

showing linked

5, in that order, (b) After deletion of

new

lists of

deleted records,

records 3, 5, and

(a)

After deletion of records 3

records.

shown

configuration
the avail

list,

the size of the

5.5c). If yet

avail list
at

in Fig. 5.5(a). Since there are

file.

After that, however, the avail

name

empty and

the end of the

still

could add two more names to the

another

added to the

that the

name

file,

file

two record

slots

without increasing

would be empty (Fig.

program knows that the

list

the

requires the addition of a

new

record

file.

Implementing Fixed-length Record Deletion

nisms that place deleted records on
list

Implementing mechaand that treat the avail

need a suitable place to keep

linked avail

as a stack is relatively straightforward.

list

RRN

of the first available record on the avail list. Since this is

information that is specific to the data file, it can be carried in a header
record at the start of the file.
When we delete a record we must be able to mark the record as deleted,
and then place it on the avail list. A simple way to do this is to place an *
the

and

that order, (c) After insertion of three

ORGANIZING FILES FOR PERFORMANCE

(or some other special mark) at the beginning of the record as a deletion
mark, followed by the RRN of the next record on the avail list.
Once we have a list of available records within a file, we can reuse the

space previously occupied by deleted records. For this

single function that returns either (1) the
(2)

the

RRN

appended

ot the next record to be

we would

of a reusable record
if

no reusable

write
slot,

slots are

available.

5.2.3 Deleting Variable-length Records

Now that we have a

mechanism for handling an avail list of available space

once records are deleted, let's apply this mechanism to the more complex
problem of reusing space from deleted variable-length records. We have
seen that to support record reuse through an avail list, we need

A way

to link the deleted records together into a

list (i.e.,

place to

put

a link field);

An
An

algorithm for adding newly deleted records to the avail list; and
algorithm for finding and removing records from the avail list

when we

are ready to use them.

Avail List of Variable-length Records What kind of file structure

do we need to support an avail list of variable-length records? Since we will
want to delete whole records and then place records on an avail list, we need
a structure in which the record is a clearly defined entity. The file structure
in which we define the length of each record by placing a byte count of the
record contents

the beginning of each record will serve us well in this

regard.

can handle the contents of

a deleted

did with fixed-length records. That

the

first field,

followed by

is,

binary link

variable-length record just as

field

can place

a single asterisk in

pointing to the next deleted

list. The avail list itself can be organized just as it was

with fixed-length records, but with one difference: We cannot use relative
record numbers (RRNs) for links. Since we cannot compute the byte offset

record on the avail

of variable-length records from

their

RRNs,

the links

must contain the byte

offsets themselves.

illustrate,

suppose

begin with

variable-length record

containing the three records for Ames, Morrison, and

earlier.

Figure 5.6(a) shows what the

file

Brown

file

introduced

looks like (minus the header)

and Fig. 5.6(b) shows what it looks like after the

before any
deletion of the second record. The periods in the deleted record signify
deletions,

discarded characters.

RECLAIMING SPACE

197

IN FILES

HEAD. FIRST_AVAIL: -1
40 Ames! John! 123 Maple Stillwater OK !74075 64 Morrison Sebastian
!9035 South Hillcrest Forest Village OK 74820 45 Brown !Martha 62
5 Kimbark!Des Moines IA 50311
!

(a)

HEAD. FIRST AVAIL: 43

40 Ames John 123 Maple Stillwater OK 74075 64
!

*
-1
45 Brown Martha 62
!

Kimbark!Des MoinesilA 50311!

(b)

FIGURE 5.6 A sample

sample

file

cluded), (b)

stored

Sample

file for illustrating

variable-length record deletion,

(a) Original

variable-length format with byte count (header record not

file

after deletion of the

in-

second record (periods show discarded

characters).

Adding and Removing Records

Let's address the questions

and removing records to and from the

list

of adding

together, since they are clearly

With fixed-length records we could access the avail list as a stack

because one member of the avail list is just as usable as any other. That is not
true when the record slots on the avail list differ in size, as they do in a
variable-length record file. We now have an extra condition that must be
met before we can reuse a record: The record must be the right size. For the
related.

moment we

define right size as "big enough." Later

find that

to be more particular about the meaning of right size.

even likely, that we need to search through the avail list for
a record slot that is the right size. We can't just pop the stack and expect the
first available record to be big enough. Finding a proper slot on the avail list
now means traversing the list until a record slot is found that is big enough
to hold the new record that is to be inserted.
For example, suppose the avail list contains the deleted record slots
shown in Fig. 5.7(a), and a record that requires 55 bytes is to be added. Since
the avail list is not empty, we traverse the records whose sizes are 47 (too
small), 38 (too small), and 72 (big enough). Having found a slot big enough
to hold our record, we remove it from the avail list by creating a new link
that jumps over the record (Fig. 5.7b). If we had reached the end of the avail
list before finding a record that was large enough, we would have appended

sometimes useful
It is

the

new

possible,

record

the end of the

file.

ORGANIZING FILES FOR PERFORMANCE

Size

Removed record

72
(b)

FIGURE 5.7 Removal of a record from an avail list with variablelength records, (a) Before removal, (b) After removal.

Since this procedure for finding a reusable record looks through the
entire avail

list

necessary,

we do

not need

putting newly deleted records onto the

somewhere on
it.

list,

a sophisticated

If a

list,

just as

method

for

record of the right size

our get-available-record procedure eventually finds

we can continue to push new members

we do with fixed-length records.

follows that

the

this

list.

onto the front of

Development of algorithms for adding and removing avail list records

to you as part of the exercises found at the end of this chapter.

is left

5.2.4 Storage Fragmentation

Let's look again at the fixed-length record version
(Fig. 5.8).

The

of our three-record

dots at the ends of the records represent characters

file

use as

padding between the last field and the end of the records. The padding is
wasted space; it is part of the cost of using fixed-length records. Wasted
space within
Clearly,

record

we want

called internal fragmentation.

minimize internal fragmentation.

working with fixed-length

FIGURE 5.8 Storage requirements of sample

records,

file

using 64-byte fixed-length records.

Ames John 123 Maple Stillwater OK 74075

Morrison! Sebastian 19035 South Hillcrest Forest Village OK 174820
Brown Martha 625 Kirabark'Des Moines IA 50311
I

are

attempt such minimization by

RECLAIMING SPACE

IN FILES

40 Ames! John! 123 Maple Stillwater OK 74075 64 Morrison Sebastian

!9035 South Hillcrest Forest Village OK 74820 45 Brown iMartha 62
5 KimbarkiDes Moines IA 50311
!

FIGURE 5.9 Storage requirements of sample

count field.

record.

record length that

But unless the

a certain

using variable-length records with a

file

what we need for each

we have to put up with

as close as possible to

actual data

fixed in length,

amount of internal fragmentation

One of the

choosing

in a fixed-length record

attractions of variable-length records

that they

file.

minimize

wasted space by doing away with internal fragmentation. The space set
aside for each record is exactly as long as it needs to be. Compare the
fixed-length example with the one in Fig. 5.9, which uses the variablelength record structure
a byte count followed by delimited data fields.
The only space (other than the delimiters) that is not used for holding data
in each record is the count field. If we assume that this field uses two bytes,
this amounts to only six bytes for the three-record file. The fixed-length
record file wastes 24 bytes in the very first record.
But before we start congratulating ourselves for solving the problem of
wasted space due to internal fragmentation, we should consider what
happens in a variable-length record file after a record is deleted and replaced
with a shorter record. If the shorter record takes less space than the original
record, internal fragmentation results. Figure 5.10 shows how the problem

FIGURE 5.10 Illustration of fragmentation with variable-length records, (a) After deletion of
the second record (unused characters in the deleted record are replaced by periods), (b) After
the subsequent addition of the record for Al Ham.

HEAD. FIRST AVAIL: 43

40 Ames

John 123 Maple Stillwater OK 74075 64

*
[

-1

45 Brown Martha 62
]

KimbarkjDes Moines IA 50311

(a)

HEAD. FIRST_AVAIL: -1
40 Ames John; 123 Maple Stillwater OK 74075
OK J70332;
5 KimbarkiDes Moines IA 50311
!

(b)

64 Ham;Al[28 Elm Ada

45 Brown [Martha 62
\

200

ORGANIZING FILES FOR PERFORMANCE

HEAD. FIRST_AV'AIL: 43

40 Ames John 123 Maple Stillwater OK 74075 35 *

-1
26 Ham Al 28 Elm Ada OK 70332 45 Brown Martha 6
25 Kimbark Des Moines IA 50311
;

FIGURE 5.1 1 Combatting internal fragmentation by putting the unused part of the deleted
back on the avail list.

slot

could occur with our sample

and the following record

when

file

the second record in the

file is

deleted

added:

Ham|Al|28 E lm|Ada|OK|70332|
appears that escaping internal fragmentation

vacated by the deleted record

not so easy. The slot

37 bytes larger than

needed for the new

as part of the new record, they are

and are therefore unusable. But instead of keeping the
64-byte record slot intact, suppose we break it into two parts: one part to
hold the new Ham record, and the other to be placed back on the avail list.

record. Since

not on the avail

treat the extra

37 bytes

list

we would take only as much space as necessary

would be no internal fragmentation.
Figure 5.11 shows what our file looks like if we

Since

for the

Ham

record,

there

use this approach to

Al Ham. We steal the space for the Ham record from the
end of the 64-byte slot and leave the first 35 bytes of the slot on the avail list.
(The available space is 35 rather than 37 bytes because we need two bytes to

insert the record for

form

new

The 35

size field for the

bytes

on the

still

Ham

avail

record.)

can be used to hold yet another record.

list

Figure 5.12 shows the effect of inserting the following 25-byte record:

Lee|Ed|Rt

2|Ada|OK|74820|

As we would expect, the new record is carved out of the 35-byte record that
is on the avail list. The data portion of the new record requires 25 bytes, and

FIGURE 5.12 Addition of the second record into the

slot originally

occupied by a single de-

leted record.

HEAD,

FIRST AVAIL: 43
1

-1 ... 25 Lee Ed
40 Ames John 123 Maple Stillwater OK 74075 8 *
Rt 2 Ada OK 74820 26 Ham Al 28 Elm Ada OK 70332 45 Brown Martha 6
25 Kimbarkl Des Moines IA 50311
;

RECLAIMING SPACE

then

we need two more

in the record

still

bytes for another size

on the

avail

201

IN FILES

This leaves eight bytes

field.

list.

What are the chances of finding a record that can make use of these eight
Our guess would be that the probability is close to zero. These eight

bytes?

bytes are not usable, even though they are not trapped inside any other

The

an example of external fragmentation.

record. This

the avail

rather than being locked inside some other record, but

list

space

actually
is

on
too

fragmented to be reused.
There are some interesting ways to combat external fragmentation.
One way, which we discussed at the beginning of this chapter, is storage
compaction. We could simply regenerate the file when external fragmentation becomes intolerable. Two other approaches are as follows:
If two record slots on the avail list are physically
them to make a single, larger record slot. This is

adjacent,

combine

called coalescing the

holes in the storage space.

Try

minimize fragmentation before it happens by adopting a

placement strategy that the program can use as it selects a record
from the avail list.
to

Coalescing holes presents some interesting problems. The avail

not kept in physical record order;
physically adjacent, there

adjacent to each other

provides

on the

discussion

developing

if there are

no reason
avail

this

two

list is

deleted records that are

presume

Exercise 15

list.

slot

that they are linked

the end of this chapter

problem along with

framework

for

a solution.

The development of

better placement strategies,

warrants

matter.

among

alternative strategies

a topic that

not

however,

a different

separate discussion, since the choice

obvious

might seem

at first

glance.

5.2.5 Placement Strategies

Earlier

discussed

ways

add and remove variable-length records from

add records by treating the avail list as a stack, putting

When we need to remove a record slot from the
avail list (to add a record to the file), we look through the list, starting at the
beginning, until we either find a record slot that is big enough or reach the
end of the list.
This is called afirst-fit placement strategy. The least possible amount of
work is expended when we place newly available space on the list, and we
are not very particular about the closeness of fit as we look for a record slot
to hold a new record. We accept the first available record slot that will do
an avail

list.

deleted records at the front.

202

ORGANIZING FILES FOR PERFORMANCE

the job, regardless of whether the slot

or whether

it is

a perfect

10 times bigger than

what

needed

fit.

more orderly approach for placing

on the avail list, keeping them in either ascending or descending
sequence by size. Rather than always putting the newly deleted records at
the front of the list, these approaches involve moving through the list,
could, of course, develop a

records

looking for the place to insert the record to maintain the desired sequence.
If we order the avail list in ascending order by size, what is the effect on
the closeness of

fit

of the records that are retrieved from the

list?

Since the

procedure searches sequentially through the avail list until it

encounters a record that is big enough to hold the new record, the first
record encountered is the smallest record that will do the job. The fit
retrieval

and the new record's needs would be as close as

we can make it. This is called a best-fit placement strategy.
A best-fit strategy is intuitively appealing. There is, of course, a price to
be paid for obtaining this fit. We end up having to search through at least
a part of the list not only when we get records from the list, but also when
we put newly deleted records on the list. In a real-time environment the
extra processing time could be significant.
A less obvious disadvantage of the best-fit strategy is related to the idea
of finding the best possible fit: The free area left over after inserting a new
record into a slot is as small as possible. Often this remaining space is too
small to be useful, resulting in external fragmentation. Furthermore, the
slots that are least likely to be useful are the ones that will be placed toward
the beginning of the list, making first-fit searches increasingly long as time

between the

available slot

goes on.

These problems suggest an alternative strategy: What if we arrange the

it is in descending order by size? Then the largest record slot on
the avail list would always be at the head of the list. Since the procedure that
retrieves records starts its search at the beginning of the avail list, it always
returns the largest available record slot if it returns any slot at all. This is
known as a worst- fit placement strategy. The amount of space in the record
slot beyond what is actually needed is as large as possible.
A worst-fit strategy does not, at least initially, sound very appealing. But
avail list so

consider the following:

The procedure
only

for

at the first

enough

removing records can be simplified

element of the avail

list.

If the first

looks

record slot

not

do the job, none of the others will be.

By extracting the space we need from the largest available slot, we
are assured that the unused portion of the slot is as large as possible,

large

decreasing the likelihood of external fragmentation.

203

FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING AND BINARY SEARCHING

What can you conclude from all of this? It should be clear that no one
placement strategy is superior for all circumstances. The best you can do is
formulate a series of general observations and then, given a particular design
seems most appropriate. Here are
have to be yours.

situation, try to select the strategy that

some
'

suggestions.

The judgment

will

Placement strategies make sense only with regard

able-length record

files.

With fixed-length

to volatile, vari-

records, placement

sim-

ply not an issue.

is lost due to internal fragmentation, then the choice is between

and best fit. A worst-fit strategy truly makes internal fragmentation worse.
If the space is lost due to external fragmentation, then one should give

If space
first fit

careful consideration to a worst-fit strategy.

5.3

Finding Things Quickly: An Introduction to Internal

Sorting and Binary Searching
This text begins with
storage.

discussion of the cost of accessing secondary

You may remember

accessing

magnify

that the

magnitude of the difference between

RAM and seeking information on a fixed disk such that, if we

the time for a RAM access to 20 seconds, a similarly magnified
is

disk access

would

take 58 days.

So far we have not had to pay much attention to this cost. This section,
then, marks a kind of turning point. Once we move from fundamental
organizational issues to the matter of searching a file for a particular piece of
information, the cost of a seek becomes a major factor in determining our

And what is true for searching is all the more true for sorting. If
you have studied sorting algorithms, you know that even a good sort
involves making many comparisons. If each of these comparisons involves
approach.

a seek,

the sort

Our

agonizingly slow.

discussion of sorting and searching, then, goes

getting the job done.

beyond simply

develop approaches that minimize the number of

disk accesses and that therefore minimize the

amount of time expended.

This concern with minimizing the number of seeks continues to be

focus throughout the rest of this text. This
for

ways

to order

major

just the beginning of a quest

and find things quickly.

5.3.1 Finding Things

All of the

Simple Field and Record

programs we have written up

strengths they offer, share a major failing:

Files

to this point, despite

The only way

any other

to retrieve or find

204

ORGANIZING FILES FOR PERFORMANCE

record with any degree of rapidity

to look for

relative record

number (RRN). If the file has fixed-length records, knowing the RRN
us compute the record's byte offset and jump to it using direct access.
But what if we do not
want? How likely is it that

"What

RRN

the byte offset or

of the record

question about this file would take the form,

23?" Not very likely, of course. We are
the record stored in
a

RRN

much more
question

know

lets

know

likely to

likely to take the form,

"What

its

key, and the

the record for Bill Kelly?"

Given the methods of organization developed so far, access by key

a sequential search. What if there is no record containing the
requested key? Then we would have to look through the entire file. What
if we suspect that there might be more than one record that contains the
key, and we want to find them all? Once again, we would be doomed to
looking at every record in the file. Clearly, we need to find a better way to
handle keyed access. Fortunately, there are many better ways.
implies

5.3.2 Search by Guessing: Binary Search

Suppose we

are looking for a record for Bill Kelly in a

fixed-length records, and suppose the

ascending order by key.

file is

of 1,000

KELLY BILL

by comparing

start

file

sorted so the records appear in

canonical form of the search key) with the middle key in the

file,

(the

which

whose RRN is 500. The result of the comparison tells us which half
of the file contains Bill Kelly's record. Next, we compare KELLY BILL with
the middle key among records in the selected half of the file to find out

the key

which quarter of the

file Bill

Kelly's record

until either Bill Kelly's record

found or

is in.

This process

repeated

have narrowed the number of

potential records to zero.

This kind of searching

binary searching

shown

comparisons to find
that

it is

not in the

called binary searching.

in Fig. 5.13.

Bill Kelly's record, if

file.

Compare

this

w ith
T

algorithm for

Binary searching takes

it is

in the

a sequential

most 10

or to determine

file,

search for the record.

1,000 records, then it takes at most 1,000 comparisons to find a

given record (or establish that it is not present); on the average, 500
If there are

comparisons are needed.

5.3.3 Binary Search versus Sequential Search

In general, a binary search

with n records takes

a file

most

|_log FtJ

comparisons 1

"In this text, log x refers to the logarithm function to the base

intended,

so indicated.

When

any other base

FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING

205

AND BINARY SEARCHING

function to perform a binary search in the file associated with the

logical name INPUT. Assumes that INPUT contains REC0RD_C0UNT records.
Searches for the key KEY_S0UGHT. Returns RRN of record containing
key if the key is found; otherwise returns -1

FUNCTION:

n_sea r ch( NPUT

LOW :=0
HIGH := REC0RD_C0UNT

KEY_S0UGHT, REC0RD_C0UNT

initialize lower bound for searching

intialize high bound -- we subtract
from the count since RRNs start from

/*
-

while (LOW <= HIGH)

GUESS := (LOW
HIGH)

find

*/
1

midpoint

read record with RRN of GUESS

place canonical form of key from record GUESS into KEY_F0UND

(KEY_S0UGHT < KEY_F0UND)

HIGH := GUESS else if (KEY_S0UGHT > KEY_F0UND)
:= GUESS +
LOW

GUESS is too high

so reduce upper
bound
/* GUESS is too low
/* increase lower
bound

I*
/*

*/
*/
/

else

return(GUESS)
endwh i

match -- return the RRN

return (-1)

FIGURE 5.13 The bin_search(

loop completes,
)

function

then key was not

found

pseudocode.

and on average approximately

|_log

A binary search is

comparisons.

therefore said to be

that a sequential search

of the same

O (log n).

file

In contrast,

requires at

most

you may

recall

n comparisons,

and

which is to say that a sequential search is O(n).

The difference between a binary search and a sequential search becomes

on average

Vi n,

even more dramatic as we increase the size of the file to be searched. If we

double the number of records in the file, we double the number of
comparisons required for sequential search; when binary search is used,
doubling the file size adds only one more guess to our worst case. This
makes sense, since we know that each guess eliminates half of the possible
choices. So, if we tried to find Bill Kelly's record in a file of 2,000 records,
it

would

take at

most
1

-I-

[Jog 2,000J

=11

comparisons,

206

ORGANIZING FILES FOR PERFORMANCE

whereas

sequential search

-n

would average
=

1,000 comparisons,

and could take up to 2,000 comparisons.

Binary searching is clearly a more attractive way to find things than is
sequential searching. But, as you might expect, there is a price to be paid
before we can use binary searching: Binary searching works only when the
list of records is ordered in terms of the key we are using in the search. So,
to make use of binary searching, we have to be able to sort a list on the basis
of a key.
Sorting is a very important part of file processing. Next, we look at
some simple approaches to sorting files in RAM, at the same time
introducing some important new concepts in file structure design. In
Chapter 7 we take a second look at sorting, when we deal with some tough
problems that occur when files are too large to sort in RAM.

5.3.4 Sorting a Disk

File in

RAM

Consider the operation of any internal sorting algorithm with which you
The algorithm requires multiple passes over the list that is to be
sorted, comparing and reorganizing the elements. Some of the items in the
list are moved a long distance from their original positions in the list. If such
an algorithm were applied directly to data stored on a disk, it is clear that
there would be a lot of jumping around, seeking, and rereading of data.
This would be a very slow operation
unthinkably slow.
are familiar.

If the entire

alternative

contents of the

file

disk, but this

from the disk

to read the entire file

the sorting there, using an internal

way we

can access

can be held in

sort.

RAM,

into

very attractive

memory, and then do

We still have to access the data on the

sequentially, sector after sector, without

having to incur the cost of a lot of seeking and the cost of multiple passes
over the disk.
This is one instance of a general class of solutions to the problem of
minimizing disk usage: Force your disk access into a sequential mode,
performing the more complex, direct accesses in RAM.
Unfortunately, it is often not possible to use this simple kind of
solution, but when you can, you should take advantage of it. In the case of
sorting, internal sorts are increasingly viable as the
increases.

which
in

good

sorts files in

Chapter

illustration

RAM if

of an internal sort

amount of RAM space

the

can find enough space. This

UNIX
utility

sort utility,
is

described

207

FINDING THINGS QUICKLY: AN INTRODUCTION TO INTERNAL SORTING AND BINARY SEARCHING

5.3.5 The Limitations of Binary Searching and Internal Sorting

Let's look at three problems associated with our "sort, then binary search"
approach to finding things.

Problem

Two

Binary Searching Requires More than One or

Accesses

In the average case,

binary search requires approximately

n] + Vi comparisons. If each comparison requires a disk access, a series

of binary searches on a list of 1,000 items requires, on the average, 9.5
accesses per request. If the list is expanded to 100,000 items, the average
search length extends to 16.5 accesses. Although this is a tremendous
improvement over the cost of a sequential search for the key, it is also true
that 16 accesses, or even 9 or 10 accesses, is not a negligible cost. The cost
of this seeking is particularly noticeable, and objectionable, if we are doing
a large enough number of repeated accesses by key.

|_log

When we access records by relative record number (RRN) rather than

by key, we are able to retrieve a record with a single access. That is an order
of magnitude of improvement over the 10 or more accesses that binary
searching requires with even a moderately large file. Ideally, we would like
to

RRN

approach

performance,

retrieval

while

maintaining

still

the

advantages of access by key. In the following chapter, on the use of index

structures,

Problem

we
2:

begin to look

Keeping

ways

a File Sorted Is

move toward

Very Expensive

use a binary search has a price attached to

order by key. Suppose

often as

are

this ideal.

it:

working with

We must keep
a file to

search for existing records. If

Our
the

file

ability to

in sorted

which we add records

leave the

file

unsorted

on the average each search

requires reading through half the file. Each record addition, however, is
very fast, since it involves nothing more than jumping to the end of the file
and writing a record.

order, doing sequential searches for records, then

If,

an alternative,

substantially

on the

But we encounter
all

keep the

file

in sorted order,

cost of searching, reducing

difficulty

when we add

the records in sorted order. Inserting a

a record, since

new

can cut

to a handful

we want

record into the

down

of accesses.

file

keep

requires,

not only read through half the records, but that we

open up the space required for the insertion. We are
doing more work than if we simply do sequential searches on an

on the average,

that

also shift the records to

actually

unsorted

The

file.

costs

of maintaining

a file that

can be accessed through binary

searching are not always as large as in this example involving frequent

record addition. For example,

it is

often the case that searching

required

208

ORGANIZING FILES FOR PERFORMANCE

much more

frequently than

record addition. In such

more than

benefits of faster retrieval can

sorted.

As another example,

the

This can be an

circumstance, the

of keeping the

offset the costs

many

file

which record
additions can be accumulated in a transaction file and made in a batch mode.
By sorting the list of new records before adding them to the main file, it is
possible to merge them with the existing records. As we see in Chapter 7,
such merging is a sequential process, passing only once over each record in
file.

So, despite

appears to be
searching also

its

there are

efficient, attractive

approach to maintaining the

us see

what

However, knowing

the costs of binary

the requirements will be for better solutions

problem of finding things by key. Better solutions

one of the following conditions:

to the

file.

problems, there are situations in which binary searching

useful strategy.

lets

applications in

have to meet

will

at least

They

flr

will not involve reordering of the records in the

y^iew record

file

They

will be associated with data structures that allow for substan-

tially

In the chapters that follow

these categories. Solutions of the

indexes.

They can

of the

file.

develop approaches that

first

fall

Problem

also involve hashing. Solutions

Internal Sort

to use binary searching

works only

sort

of the second type can

Works Only on Small

file

ability

internal

into the

cannot do

that,

need a different kind of sort.

In the following section we develop

then

so large that

in order.

Our

file.

can read the entire contents of

If the file

file

Files

limited by our ability to sort the

computer's electronic memory.

into each of

type can involve the use of simple

involve the use of tree structures, such as a B-tree, to keep the

a variation

called a keysort. Like internal sorting, keysort

large a

file it

can sort, but

its

limit

keysort begins to illuminate

larger.

internal sorting

limited in terms of

how

More importantly, our work on

new approach

to the

problem of finding

things that will allow us to avoid the sorting of records in a

5.4

when

added; and

file.

Keysorting
Keysort,
sort a

keys;

sometimes referred

to as tag sort,

based on the idea that

when we

RAM the only things that we really need to sort are the record
into RAM during the
therefore, we do not need to read the whole
file in

sorting process. Instead,

file

read the keys from the

file

into

RAM,

sort

KEYSORTING

them, and then rearrange the records

ordering of the keys.

the

according to the

file

Since keysort never reads the complete set of records into

can sort larger

files

than

regular internal sort, given the

209

new

memory,

same amount of

RAM.
5.4.1 Description of the Method

we assume that we are dealing with a fixed-length

of the kind developed in Chapter 4, with a count of the number
records stored in a header record. We begin by reading the keys into an
array of identically sized character fields, with each row of the array
containing a key. We call this array KEYNODES[], and we call the key
field KEYNODES[].KEY. Figure 5.14 illustrates the relationship between
the array KEYNODES[] and the actual file at the time that the keysort
procedure begins.
There must, of course, be some way of relating the keys back to the
records from which they have been extracted. Consequently, each node of
the array KEYNODES[] has a second field KEYNODES[].RRN that
contains the RRN of the record associated with the corresponding key.
The actual sorting process simply sorts the KEYNODES[] array
according to the KEY field. This produces an arrangement like that shown

To keep
record

things simple,

file

FIGURE 5.14 Conceptual view of KEYNODES array to be used

routine, and record array on secondary store.

RAM

by internal sort

KEYNODES

array

Records

RRN

KEY

HARRISON SUSAN

Harrison Susan

KELLOG BILL

KelloglBilll 17 Maple...

HARRIS MARGARET

Harris

BELL ROBERT

Bell! Robert

RAM

387 Eastern...

Margaret 4343 West...

8912

Hill...

On secondary store

210

ORGANIZING FILES FOR PERFORMANCE

%/
3 -^7*
.

KEYNODES array

Records

RRN

KEY
BELL ROBERT

Harrison Susan

HARRIS MARGARET

KelloglBUIl 17 Maple...

HARRISON SUSAN

Harris

KELLOG BILL
In

Bell

RAM

Margaret 4343 West.

Robert 8912
I

Hill...

On secondary store

FIGURE 5.15 Conceptual view of

in Fig. 5.15.

way

387 Eastern..

KEYNODES

array and

file

after sorting keys in

The elements of KEYNODES[]

that the first element has the

to the first position in the

file,

are

now

RAM

sequenced in such

RRN of the record that should be moved

the second element identifies the record that

should be second, and so forth.

Once KEYNODES[] is sorted, we are ready to reorganize the file

this new ordering. This process can be described as follows:

according to
for

number of records

Seek in the input

file

to the record

whose

RRN

KEYNODES[i].RRN.
Read

this

record into

buffer in

RAM.

Write the contents of the buffer out to output

file.

Figure 5.16 outlines the keysort procedure in pseudocode. This pro-

cedure works

much

the

same way

that a

normal

internal sort

would work,

but with two important differences:

RAM

array, we simply read

Rather than read entire records into a
each record into a temporary buffer, extract the key, and then dis-

j card

it;

and

When we
read

them

are writing the records out in sorted order,

in a

second time, since they are not

all

we have

stored in

RAM.

211

KEYSORTING

PROGRAM: keysort
open input file as
N F LE
create output file as DUT_FILE
I

read header record from

N F LE and write a copy to 0UT_FILE
:= record count from header record
I

REC_CDUNT
/*

for

read in records;
1

up KEYNODES array */

set

REC_C0UNT

N_F LE into BUFFER

read record from
extract canonical key and place it in KE YN0DE5
KEYNODEStil .RRN = i
I

KEY

KEY
thereby ordering RRNs correspondingly */
sort KEYNODESC
sort (KEYNODES, REC_C0UNT)

/*
/*

for

read in records according to sorted order, and write them

in this order

out
i

REC_C0UNT

N_F LE to record with RRN of KEYNODES

seek in
N_F LE
read the record into BUFFER from
I

close
end PROGRAM

*/
*/

RRN

write BUFFER contents to 0UT_FILE

N_F LE and 0UT_FILE
I

FIGURE 5.16 Pseudocode

for keysort.

fUUJ&

?/w>

5.4.2 Limitations of the Keysort Method

At first glance, keysorting appears to be an obvious improvement over sorts
performed entirely in RAM; it might even appear to be a case of getting
something for nothing. We know that sorting is an expensive operation and
that we want to do it in RAM. Keysorting allows us to achieve this
objective without having to hold the entire

file in

RAM

once.

But, while reading about the operation of writing the records out in
sorted order, even a casual reader probably senses a cloud on this apparently

bright horizon. In keysort

before

can write out the

we need to read in the records a second time

new sorted file. Doing something twice is never

desirable.

But the problem

Look
them out

to the

input

sequentially. Instead,

file

worse than

that.

carefully at the for loop that reads in the records before writing

from the sorted

new

file.

You

can see that

KEYNODES[]

to seek to each record

and read

are

to the
it

are not reading

RRNs

through the

in sorted order,

moving

of the records. Since

we have

working

in before writing

back out, creating the

21 2

ORGANIZING FILES FOR PERFORMANCE

sorted

file

requires as

many random

As we have noted

records.

seeks into the input

number of

between the time required

difference

times,

read

all

sequentially and the time required to read those

seek to each record separately.

What

worse,

file as

there

the records in a

same records
are

there are

an enormous
if

performing

accesses in alternation with write statements to the output

file.

file

we must
of these

all

So, even the

which would otherwise appear to be sequential,

most
in
cases involves seeking. The disk drive must move the head back and
forth between the two files as it reads and writes.
writing of the output

file,

The getting-something-for-nothing

aspect of keysort has suddenly

work of sorting in RAM,

from the map supplied
trivial matter when the only

evaporated. Even though keysort does the hard

turns out that creating a sorted version of the

by the

KEYNODES[]

array

not

at all a

file

copies of the records are kept on secondary store.

5.4.3 Another Solution:

Why

Bother to Write the

The fundamental

idea behind keysort

an entire record

when

File

an attractive one:

Back?

Why work

with

the only parts of interest, as far as sorting and

searching are concerned, are the fields used to form the key? There

compelling parsimony behind this idea, and it makes keysorting look

promising. The promise fades only when we run into the problem of
rearranging
It is

all

the records in the

interesting to ask

file

so they reflect the new, sorted order.

whether we can avoid

bothering with the task that

this

giving us trouble:

problem by simply not

What

we just

skip the

time-consuming business of writing out a sorted version of the file? What

if, instead, we simply write out a copy of the array of canonical key nodes?
If we do without writing the records back in sorted order, writing out the
contents of our KEYNODES[] array instead, we will have written a
program that outputs an index to the original file. The relationship between
the

two
This

files is illustrated in Fig.

5.17.

an instance of one of our favorite categories of solutions to

computer science problems:

some

part of a process begins to look like a

Can you do without it? Instead

of creating a new, sorted copy of the file to use for searching, we have
created a second kind of file, an index file, that is to be used in conjunction
with the original file. If we are looking for a particular record, we do our
binary search on the index file, then use the RRN stored in the index file
record to find the corresponding record in the original file.
There is much to say about the use of index files, enough to fill several
chapters. The next chapter is about the various ways we can use simple
indexes, which is the kind of index we illustrate here. In later chapters we
bottleneck, consider skipping

altogether.

KEYSORTING

Index

Original

file

BELL ROBERT

Harrison

HARRIS MARGARET

Kellogg

HARRISON SUSAN

Harris

KELLOGG BILL

Belli Robert 18912 Hill

FIGURE 5.17 Relationship between the index

file

Susan 387 Eastern

Bill
I

213

Maple.

Margaret 4343 West

;

and the data

file.

about different ways of organizing the index to provide more flexible

and easier maintenance.

talk

access

5.4.4 Pinned Records

we discussed the problem of updating and maintaining files.

of that discussion revolved around the problems of deleting records
and keeping track of the space vacated by deleted records so it can be reused.
An avail list of deleted record slots is created by linking all of the available
slots together. This linking is done by writing a link field into each deleted
record that points to the next deleted record. This link field gives very
specific information about the exact physical location of the next available
In section 5.2

Much

record.

When a
we

file

contains such references to the physical locations of records,

say that these records are pinned.

particular choice

these

files

of terminology

containing an avail

that cannot

You

can gain an appreciation for this

you consider

(such as an index

become what

file)
is

file

pinned record
or in

some other

one
file

contain references to the physical location of the

moved,

these references

no longer lead

to the record;

are called dangling pointers, pointers leading to incorrect,

meaningless locations in the

file.

Clearly, the use of pinned records in a

difficult

the effects of sorting one of

of deleted records.

be moved. Other records in the same

record. If the record

they

list

and sometimes impossible. But what

file

can

make

we want

sorting

to support rapid

ORGANIZING FILES FOR PERFORMANCE

access

by key, while

deletion?

One

solution

still
is

reusing the space

to use an index

records, while keeping the actual data

the
a

problem of finding things

file

file in its

made

available

Once again,
we need to take

original order.

leads to the suggestion that

close look at the use of indexes, which, in turn, leads us to the next

chapter.

SUMMARY
we look at ways to organize or reorganize files to improve
performance in some way.
Data compression methods are used to make files smaller by re-encoding
data that goes into a file. Smaller files use less storage, take less time to
transmit, and can often be processed faster sequentially.
The notation used for representing information can often be made more
compact. For instance, if a two-byte field in a record can take on only 50
values, the field can be encoded using only 6 bits instead of 16. Another
form of compression called run-length encoding encodes sequences of
repeating values, rather than writing all of the values in the file.
A third form of compression assigns variable-length codes to values
depending on how frequently the values occur. Values that occur often are
given shorter codes, so they take up less space. Huffman codes are an example
of variable-length codes.
In this chapter

Some compression

techniques are

tion in the encoding process.

The

irreversible in that

UNIX utilities

they lose informa-

compress, uncompress, pack,

and unpack provide good compression in UNIX.

A second way to save space in a file is to recover space in the file after
it has undergone changes. A volatile file, one that undergoes many changes,
can deteriorate very rapidly unless measures are taken to adjust the file
organization to the changes. One result of making changes to files is storage
fragmentation.
Internal fragmentation occurs

fixed-length

record

file,

when

there

internal

wasted space within

variable-length record

file

when one

a record.

fragmentation can result

variable-length records are stored in fixed slots.

record

when

can also occur in

replaced by another record of

smaller size. External fragmentation occurs when holes of unused space

between records are created, normally because of record deletions.
There are a number of ways to combat fragmentation. The simplest is
storage compaction, which squeezes out unused space caused by external
fragmentation by sliding all of the undeleted records together. Compaction
is generally done in a batch mode.
a

by record

keep the sorted order of the

SUMMARY

Fragmentation can be dealt with dynamically by reclaiming deleted space

records are added. The need to keep track of the space to be reused
makes this approach more complex than compaction.
We begin with the problem of deleting fixed-length records. Since

when

finding the

first field

of a fixed-length record

can be accomplished by placing

Since

all

records in

very easy, deleting

mark

a special

fixed-length record

record

in the first field.

are the

file

same

size,

the reuse

of deleted records need not be complicated. The solution we adopt consists

of collecting all the available record slots into an avail list. The avail list is
created

by stringing together

the deleted records to

all

form

a linked

list

deleted record spaces.

In a fixed-length record

other

maintain the linked avail

are

added to the

slots are

any one record

file,

slot

just as usable as any

they are interchangeable. Consequently, the simplest

slot;

avail

list is

list

removed from

to treat

as a stack.

Newly

way

available records

by pushing them onto the front of the list; record

list by popping them from the front of the

the avail

list.

Next,

form

still

consider the matter of deleting variable-length records.

linked

of available record

list

but with variable-length

slots,

we need to be sure that a record slot is the right size to hold the new
Our initial definition of right size is simply in terms of being big
enough. Consequently, we need a procedure that can search through the
avail list until it finds a record slot that is big enough to hold the new record.
records
record.

Given such

on the

deleted records

complementary function that places newly

list, we can implement a system that deletes and

and

a function,

avail

reuses variable-length records.

We then consider the amount and nature of fragmentation that develops

inside a

file

due to record deletion and

space

internally if the

develop

new

because

procedure that breaks

two or more

into
a

lost

reuse.

Fragmentation can happen

locked up inside

a record.

smaller ones, using exactly as

record, leaving the remainder

on the

much

avail

list.

space as

record slot

a single, large, variable-length

needed for

see that, although

could decrease the amount of wasted space, eventually the remaining

fragments are too small to be useful. When this happens, the space is lost to
this

external fragmentation

There are
fragmentation.

things that one can do to minimize external

(1)

compacting the

of fragmentation becomes excessive;

level

on the
a

number of
They include

avail

list

make

larger,

placement strategy to select slots

file

in a batch

mode when

generally useful slots; and

for

reuse in a

way

adopting
minimizes

(3)

that

fragmentation. Development of algorithms for coalescing holes

part

of the exercises

careful discussion.

the

record slots

(2) coalescing adjacent

left as

the end of this chapter. Placement strategies need

215

21 6

ORGANIZING FILES FOR PERFORMANCE

The placement

strategy used

record deletion and reuse procedures

simply, "If the record slot
in sorted order,

to this point
is a first-fit

big enough, use

it."

by the variable-length

strategy. This strategy

list

two other placement

easy to implement either of

keeping the avail

strategies:

which a new record is placed in the smallest slot that is

enough to hold it. This is an attractive strategy for variablelength record files in which the fragmentation is internal. It involves
more overhead than other placement strategies.
Worst fit, in which a new record is placed in the largest record slot
Best fit, in
still

big

available.

The

idea

have the left-over portion of the

slot

large as possible.

no firm rule for selecting a placement strategy; the best one can do
judgment based on a number of guidelines.
In the third major section of this chapter, we look at ways to find things
quickly in a file through the use of a key. In preceding chapters it was not

There
is

use informed

knowing its physical location or

we explore some of the problems and

possible to access a record rapidly without

relative

record number.

Now

opportunities associated with keyed direct access.

This

key

develops

chapter

only

one

method of

binary searching. Binary searching requires O

finding

records

comparisons to

(log n)

with n records, and hence is far superior to sequential

works only on a sorted file, a sorting
procedure is an absolute necessity. The problem of sorting is complicated
by the fact that we are sorting files on secondary storage rather than vectors
in RAM. We need to develop a sorting procedure that does not require
seeking back and forth over the file.
Three disadvantages are associated with sorting and binary searching as
developed up to this point:
find a record in a

file

searching. Since binary searching

Binary searching
searching, but

per record.

an enormous improvement over sequential

it still

The need

in applications

where

usually requires

more than one

for fewer disk accesses

a large

becomes

number of records

two

accesses

especially acute

are to be accessed

key.

The requirement

that the file be kept in sorted order can be expenFor active files to which records are added frequently, the cost
of keeping the file in sorted order can outweigh the benefits of binary searching.
sive.

A RAM

sort can be used only

the size of the

files

that

given our sorting tools.

relatively small

files.

This limits

could organize for binary searching,

KEY TERMS

The

problem can be solved

third

partially

by developing more powerful

sorting procedures, such as a keysort. This approach to sorting resembles a

RAM

sort in

Instead,

most

respects, but does not use

reads in only the keys

uses the sorted

from the

RAM

to hold the entire

file.

records, sorts the keys, and then

of keys to rearrange the records on secondary storage so

list

they are in sorted order.

The disadvantage to a keysort is that rearranging a file of n records

random seeks out to the original file, which can take much more
time than does a sequential reading of the same number of records. The

requires

inquiry into keysorting

not wasted, however. Keysorting naturally leads

the suggestion that

we merely w rite
T

the sorted

list

of keys off to

secondary storage, setting aside the expensive matter of rearranging the

This

list

file.

of keys, coupled with RRN tags pointing back to the original

an example of an index. We look at indexing more closely in

records,

Chapter

This chapter closes with

cost of sorting

elsewhere

(in

a discussion of another, potentially hidden,

and searching. Pinned records are records that are referenced

same

the

position in the

file.

file

or in

some

other

file)

according to their physical

Sorting and binary searching cannot be applied to a

containing pinned records, since the sorting, by definition,

change the physical position of the record. Such

file

likely to

change causes other

record to become inaccurate, creating the problem of

references to this

dangling pointers.

KEY TERMS
Avail

list.

list

of the space, freed through record deletion, that

available for holding

chapter, this

list

new

records. In the examples considered in this

of space took the form of

linked

list

of deleted

records.

Best fit. A placement strategy for selecting the space on the avail list
used to hold a new record. Best-fit placement finds the available
record slot that is closest in size to what is needed to hold the new
record.

Binary search. A binary search algorithm locates a key in a sorted list

by repeatedly selecting the middle element of the list, dividing the
list in half, and forming a new, smaller list from the half that conThis process

tains the key.

the kev that

Coalescence.

continued until the selected element

sousht.

two

deleted, available records are physically adjacent,

they can be combined to form

a single, larger available

This process of combining smaller available spaces into

record space.
a larger

one

217

ORGANIZING FILES FOR PERFORMANCE

known

Coalescence

as coalescing holes.

way

to counteract the

problem of external fragmentation.

Compaction. A way of getting

all

rid

the records together so there

by sliding
between them.

external fragmentation

all

no space

Data compression. Encoding information

lost

in a file in

such

way

as to

take up less space.

External fragmentation. A form of fragmentation that occurs in a file

when there is unused space outside or between individual records.
First fit. A placement strategy for selecting a space from the avail list.
First-fit placement selects the first available record slot large enough
to hold the

new

record.

Fragmentation. The unused space within

locked within individual records

a file.

The space can be

(internal fragmentation) or outside or

between individual records (external fragmentation).

code. A variable-length code in which the lengths of the
codes are based on their probability of occurrence.
Internal fragmentation. A form of fragmentation that occurs when
space is wasted in a file because it is locked up, unused, inside of

Huffman

records. Fixed-length record structures often result in internal frag-

mentation.
Irreversible compression. Compression in which information

Keysort.
entire

method of sorting a file that does not require holding the

file in memory. Only the keys are held in memory, along

with pointers that

tie

these keys to the records in the

they are extracted.

The keys

used to construct

new

sorted order.
less

cess

Linked
cific

RAM

file

new

file

from which

list

of keys

that has the records in

The primary advantage of a keysort is

a RAM sort. The disadvantage
a

file

and the sorted

than does

of constructing
list.

are sorted,

version of the

that
is

requires

that the pro-

requires a lot of seeking for records.

collection of nodes that have been organized into a spe-

sequence by means of references placed in each node that point

to a single successor node.

The

logical

order of

different than the actual physical order of the

er's

is lost.

linked

nodes

list is

in the

often

comput-

memory.

Pinned record.

record

pinned

structures that refer to

its

when

there are other records or

physical location.

It is

file

pinned in the

we are not free to alter the physical location of the record:

doing so destroys the validity of the physical references to the
record. These references become useless dangling pointers.
Placement strategy. As used in this chapter, a placement strategy is a
mechanism for selecting the space on the avail list that is to be used
sense that

to hold a

new

record added to the

file.

EXERCISES

Redundancy reduction. Any form of compression

219

that does not lose

information.

Run-length encoding.

compression method

peated codes are replaced by a count of the

which runs of

number of

re-

repetitions

by the code that is repeated.

in which all additions and deletions take

the code, followed

Stack.

kind of

list

same end.
Variable-length encoding. Any encoding scheme

place at

the

are

of different lengths.

which the codes

frequently occurring codes are given

shorter lengths than are frequently occurring codes.

ing

Worst

Huffman encod-

an example of variable-length encoding.

fit.

placement strategy for selecting

space from the avail

Worst-fit placement selects the largest record

small the

new

record

is.

slot,

regardless of

list.

how

Insofar as this leaves the largest possible

record slot for reuse, worst

fit

can sometimes help minimize external

fragmentation.

EXERCISES
our discussion of compression, we show how we can compress the
name" field from 16 bits to 6 bits, yet we say that this gives us a space
savings of 50%, rather than 62.5%, as we would expect. Why is this so?
What other measures might we take to achieve the full 62.5% savings?
1.

"state

2. What is redundancy reduction?

of redundancy reduction?

Why is run-length encoding an example

3. What is the maximum run length that can be handled in the run-length
encoding described in the text? If much longer runs were common, how
might you handle them?

Encode each of

results,

(a)
(b)

and indicate

how you might improve

the algorithm.

01 01 01 01 01 01 02 03 03 03 03 03 03 03 04 05 06 06 07
02 02 03 03 04 05 06 06 05 05 04 04

From

Fig. 5.2,

What

How

the following using run-length encoding. Discuss the

determine the Huffman code for the sequence "daeab".

the difference between internal and external fragmentation?

can compaction affect the amount of internal fragmentation in

What about

external fragmentation?

a file?

220

ORGANIZING FILES FOR PERFORMANCE

In-placc compaction purges deleted records

separate

new

file.

What

compaction compared

from a file without creating

and disadvantages of in-place
which a separate compacted file is

are the advantages

compaction

created?
8.

loss
9.

Why

is a worst-fit placement strategy a bad choice

of space due to internal fragmentation?

there

significant

Conceive of an inexpensive way to keep

amount of fragmentation

in a

a continuous record of the

This fragmentation measure could be

file.

used to trigger the batch processes used to reduce fragmentation.

10. Suppose a file must remain sorted.
placement strategies available?

11.

How

does

this affect the

range of

Develop a pseudocode description of a procedure for performing

compaction in a variable-length record file that contains size fields

in-place
at

the start of each record.

12.

Consider the process of updating rather than deleting

variable-length

record. Outline a procedure for handling such updating, accounting for the

update possibly resulting in either

longer or shorter record.

we raised the question of where to keep the stack

of available records. Should it be a separate list, perhaps
maintained in a separate file, or should it be embedded within the data file?
We choose the latter organization for our implementation. What advantages and disadvantages are there to the second approach? What other kinds
of file structures can you think of to facilitate various kinds of record
13. In section 5.4,

containing the

list

deletion?
14. In

some

the record

files,

each record has

inactive rather than deleted.

record?

a delete bit that is set to

to indicate that

deleted. This bit can also be used to indicate that a record

Could

reactivation be

What

required to reactivate an inactive

done with the deletion procedures we have

used?

we outlined three general approaches to the problem of

minimizing storage fragmentation: (a) implementation of a placement
strategy; (b) coalescing of holes; and (c) compaction. Assuming an
interactive programming environment, which of these strategies would be
used "on the fly," as records are added and deleted? Which strategies would
be used as batch processes that could be run periodically?
15. In this chapter

16.

Why

record

do placement

files?

strategies

make

sense only with variable-length

EXERCISES

17.

Compare

the average case performance of binary search with sequential

search for records, assuming

That the records being sought are guaranteed to be in the file;

That half of the time the records being sought are not in the file; and
That half of the time the records being sought are not in the file and
that missing records must be inserted.

Make

showing your performance comparisons

a table

for files of 1,000,

2,000, 4,000, 8,000, and 16,000 records.

18. If the records in exercise 17 are

how

blocked with 20 records per block,

does this affect the performance of the binary and sequential searches?
19.

internal sort

works only with

files

small enough to

fit

RAM.

Some computing systems provide users with an almost unlimited amount

of RAM with a memory management technique called virtual storage.
Discuss the use of internal sorting to sort large

files

on systems

that use

virtual storage.

Our discussion of keysorting covers the considerable expense associated

with the process of actually creating the sorted output file, given the sorted
vector of pointers to the canonical key nodes. The expense revolves around
20.

two primary

areas of difficulty:

jump around

file, performing many seeks

new, sorted order; and
Writing the output file at the same time we are reading the input
jumping back and forth between the files can involve seeking.

Having

in the input

retrieve the records in their

Design an approach to

this

problem

that uses buffers to hold a

number of

records, therefore mitigating these difficulties. If your solution

viable,

obviously the buffers must use

place entirely within electronic

less

RAM

than would

file;

a sort

to be

taking

memory.

Programming Exercises
21.

Rewrite the program update. c or

records to

fixed-length record

file

update. pas so

can delete and add

using one of the replacement procedures

discussed in this chapter.

program similar to the one described

works with variable-length record files.

22. Write a

but that
23.

Develop

in the preceding exercise,

pseudocode description of a variable-length record deletion

if the newly deleted record is contiguous with

procedure that checks to see

221

222

ORGANIZING FILES FOR PERFORMANCE

any other deleted records.

make

If there

a single, larger available

record

contiguity, coalesce the records to

slot.

Some

things to consider as

you

address this problem are as follows:

The

avail

list

does not keep records arranged in physical order; the

list is not necessarily the next deleted record

next record on the avail

in the physical file. Is

avail

list,

you do
b.

possible to

merge

two views of the

these

the physical order and the logical order, into a single

this,

what placement strategy

will

list? If

you use?

Physical adjacency can include records that precede as well as fol-

low

the

newly deleted

How

record.

will

you look

for a deleted

record that precedes the newly deleted record?

Maintaining two views of the list of deleted records implies that

you discover physically adjacent records you have to rearrange

links to

tions

update the nonphysical avail

would we encounter

list.

What

additional complica-

we were combining

the coalescing of

holes with a best-fit or worst-fit strategy?

Implement the bin_search() function in either C or Pascal. Write a driver

program named search to test the function bin_search(). Assume that the files
are created with the update program developed in Chapter 4, and then
sorted. Include enough debug information in the search driver and
bin_search() function to watch the binary searching logic as it makes
successive guesses about where to place the new record.

24.

25.

Modify

the bin_search() function so if the key

the relative record

number

The function should

also

not in the

file, it

returns

key would occupy were it in the file.

continue to indicate whether the key was found or
that the

not.

26. Rewrite the search driver

from

exercise 24 so

uses the

function developed in exercise 25. If the sought-after key

new
is

bin_search()

in the

file,

the

program should display the record contents. If the key is not found, the
program should display a list of the keys that surround the position that the
key would have occupied. You should be able to move backward or
forward through this list at will. Given this modification, you do not have
to remember an entire key to retrieve it. If, for example, you know that you
are looking for someone named Smith, but cannot remember the person's
first name, this new program lets you jump to the area where all the Smith
records are stored.

you recognize

You

the right

can then scroll back and forth through the keys until
first

name.

27. Write an internal sort that can sort a variable-length record

kind produced by the writrec programs in Chapter

file

of the

FURTHER READINGS

thorough treatment of data compression techniques can be found in Lynch (1985).

is described in Welch (1984). Huffman encoding is covered

The Lempel-Ziv method

many data structures texts, and also in Knuth (1973a).

Somewhat surprising, the literature concerning storage

reuse often does not consider these issues

fragmentation and

from the standpoint of secondary

storage.

Typically, storage fragmentation, placement strategies, coalescing of holes, and

garbage collection are considered in the context of reusing space within electronic

random

access

memory (RAM). As you

the concepts to secondary storage,

it is

read this literature with the idea of applying

necessary to evaluate each strategy in light of

the cost of accessing secondary storage.

used in electronic

RAM

Some

are too expensive

Discussions about space

management

strategies that are attractive

on secondary
in

RAM

when

storage.

are usually

heading "Dynamic Storage Allocation." Knuth (1973a) provides

found under the

a good, though

overview of the fundamental concerns associated with dynamic storage

placement strategies. Much of Knuth's discussion is reworked
and made more approachable by Tremblay and Sorenson (1984). Standish (1980)
provides a more complete overview of the entire subject, reviewing much of the
important literature on the subject.
This chapter only touches the surface of issues relating to searching and sorting
files. A large part of the remainder of this text is devoted to exploring the issues in
more detail, so one source for further reading is the present text. But there is much
more that has been written about even the relatively simple issues raised in this
chapter. The classic reference on sorting and searching is Knuth (1973b). Knuth
provides an excellent discussion of the limitations of keysort methods. He also
develops a very complete discussion of binary searching, clearly bringing out the
analogy between binary searching and the use of binary trees. Baase (1978) provides
a clear, understandable analysis of binary search performance.
technical,

allocation, including

223

Indexing

CHAPTER OBJECTIVES
Introduce concepts of indexing that have broad apfile systems.

plications in the design of

Introduce the use of a simple linear index to provide

rapid access to records in an entry-sequenced, variable-length record file.
Investigate the implications of the use of indexes
for

file

maintenance.

Describe the use of indexes to provide access to

records by more than one key.
Introduce the idea of an inverted
Boolean operations on lists.

Discuss the issue of when

address in the data file.
iS

list,

illustrating

bind an index

key

to an

Introduce and investigate the implications of selfindexing

files.

225

CHAPTER OUTLINE
6.1

What

6.2

6.3

an Index?

6.6

Secondary Keys

Simple Index with an EntrySequenced File

6.7

Improving the Secondary Index

Structure: Inverted Lists

Basic Operations on an Indexed,

Entry-Sequenced

Retrieval Using Combinations of

6.7.1

File

6.7.2

6.4

Indexes That Are

Hold
6.5

Too Large

First Attempt at a Solution

Better Solution: Linking the
List of References

Memory

Indexing to Provide Access by

Multiple Keys

6.1

A
A

What

6.8

Selective Indexes

6.9

Binding

an Index?

few pages of many books contain an index. Such an index is a table

a list of topics (keys) and numbers of pages where the topics can
be found (reference fields).
All indexes are based on the same basic concept
keys and reference
fields. The types of indexes we examine in this chapter are called simple

The

last

containing

indexes because they are represented using simple arrays

contain the keys and reference

fields. In later

chapters

of structures that

look

indexing

schemes that use more complex data structures, especially trees. In this
chapter, however, we want to emphasize that indexes can be very simple
and still provide powerful tools for file processing.
The index to a book provides a way to find a topic quickly. If you have
ever had to use a book without a good index, you already know that an
index is a desirable alternative to scanning through the book sequentially to
find a topic. In general, indexing is another way to handle the problem that
we explored in Chapter 5: An index is a way to find things.
Consider what would happen if we tried to apply the previous chapter's
methods, sorting and binary searching, to the problem of finding things in
a book. Rearranging all the words in the book so they were in alphabetical
order certainly would make finding any particular term easier but would
obviously have disastrous effects on the meaning of the book. In a sense, the
terms in the book are pinned records. This is an absurd example, but it
clearly underscores the power and importance of the index as a conceptual
tool. Since

works by

indirection, an index

without actually rearranging the

file.

lets

you impose order on a

file

This not only keeps us from disturbing

A SIMPLE INDEX WITH AN ENTRY-SEQUENCED

pinned records, but also makes matters such

expensive than they are with

Take,

227

FILE

much

record addition

less

a sorted file.

another example, the problem of finding books in

We want to be able to locate books by a specific author,

One way of achieving

a library.

their titles, or

have three copies of each book

and three separate library buildings. All of the books in one building would
be sorted by author's name, another building would contain books arranged
by title, and the third would have them ordered by subject. Again, this is an
absurd example, but one that underscores another important advantage of
subject areas.

this is to

indexing. Instead of using multiple arrangements, a library uses a card

catalog.

The

is actually a set of three indexes, each using a

and all of them using the same catalog number as a
Another use of indexing, then, is to provide multiple access

card catalog

different key field,

reference field.

paths to a

file.

also find that indexing gives us keyed access

variable-length record

our discussion of indexing by exploring this problem of

access to variable-length records and the simple solution that indexing

files.

Let's begin

provides.

6.2

A Simple Index

with an Entry-Sequenced File

Suppose we own an extensive collection of musical recordings and we want

keep track of the collection through the use of computer files. For each
recording, we keep the information shown in Fig. 6. 1. The data file records
are variable length. Figure 6.2 illustrates such a collection of data records.

refer to this data record file as Datafle.

There

are a

number of approaches

variable-length record

file

that could be used to create a

to hold these records; the record addresses used

in Fig. 6.2 suggest that each record

skip sequential access and easier

be preceded by

file

a size field that

maintenance. This

permits

the structure

use.

Suppose
initials for

identification

number

Title

Composer or composers
Artist or artists

Label (publisher)

we formed

the record

primary key for these records consisting of the

company

label

combined with

FIGURE 6.1 Contents of a data record.

the record

company's

228

Rec.
addr.

32t
77

INDEXING

number

Label

LO\
RCA

2312

2626

Title

Composer(s)

Romeo and Juliet

Prokofiev

Maazel

Beethoven

Julliard

Corea

Beethoven

Giulini

Springsteen

Quartet in

Artist(s)

Sharp Minor
132
167

WAR
ANG

396

COL
DG
MER
COL
DG

442

211

256
300
353

Touchstone

23699

Symphony No.

3795

38358

Nebraska

18807

Symphony No.
Coq d'or Suite
Symphony No.

75016

31809
139201

245

Beethoven

Karajan

Rimsky-Korsakov

Leinsdorf

Dvorak

Bernstein

Violin Concerto

Beethoven

Good News

Sweet Honev
the

tAssume there

Ferras

Sweet Honev in

Rock

the

Rock

a header recoi d that uses the first 32 bytes.

FIGURE 6.2 Sample contents of Datafile.

ID number. This

will

make

good primary key

since

should provide

file. We call this key the Label ID. The

ID consists of the uppercase form of the Label
followed immediately by the ASCII representation of the ID number.

unique key for each entry in the

form

canonical
field

for the Label

For example,

LDN2312

How

keyed access to
and then use binary searching?
Unfortunately, binary searching depends on being able to jump to the
middle record in the file. This is not possible in a variable-length record file
there is no
because direct access by relative record number is not possible
way to know where the middle record is in any group of records.
could

organize the

Could we

individual records?

file

sort the

to provide rapid

file

alternative to sorting

illustrates

such an index.

to construct an index for the

On the right is

the data

file

file.

Figure 6.3

containing information

about our collection of recordings, with one variable-length data record per
recording. Only four fields are shown (Label, ID number, Title, and

Composer), but

it is

easy to imagine the other information

filling

out each

record.

On the left is the index file,

key
data

(left justified,
file.

Each key

blank
is

filled)

each record of which contains

corresponding

12-character

to a certain Label

in the

associated with a reference field giving the address of the

A SIMPLE INDEX WITH AN ENTRY-SEQUENCED

Indexfile

Key

229

FILE

Datafile

Reference

Address of

field

record

Actual data record

ANG3795

167

LON

COL31809

353

RCA

COL38358

211

132

WAR

DG139201

396

167

ANG

DG18807

256

211

COL

FF245

442

256

300

MER

300

353

COL

396

132

442

FF 245 Good News Sweet Honey In The

LON2312
MER75016
RCA2626

WAR23699

2312

2626 Quartet in

C Sharp Minor

38358 Nebraska Springsteen

ANG3795
The

full

31809 Symphony No. 9 Dvorak

Violin Concerto

Beethoven

file.

ANG3795, for example,

number 167, meaning that

containing the

field

Symphony No. 9 Beethoven

byte of the corresponding data record.

the record containing

75016 Coq d'or Suite Rimsky

corresponds to the reference

FIGURE 6.3 Sample index with corresponding data

first

3795 Symphony No. 9 Beethoven

139201

Prokofiev

23699 Touchstone Corea

18807

Romeo and Juliet

information on the recording with Label ID

can be found starting

structure of the index

byte

file is

number 167

very simple.

It is

in the record
a

file.

fixed-length record

which each record has two fixed-length fields: a key field and a
byte-offset field. There is one record in the index file for every record in the
file

data

file.

Note

also

that

the index

Consequently, although Label ID

it is

not necessarily the

first

sorted,

whereas the data

ANG3795

entry in the data

the

file.

first

file

not.

entry in the index;

In fact, the data

file is

entry

which means that the records occur in the order that they are
entered into the file. As we see soon, the use of an entry-sequenced file can

sequenced,

make record
with

addition and

a data file that is

file

maintenance

kept sorted by

some

much
key.

simpler than

the case

230

INDEXING

PROCEDURE retrieve_record(KEY)
find position of KEY in Indexfile /* Probably using binary search */
look up the BYTE_0FFSET of the corresponding record in Datafile
use SEEK() and the byte_offset to move to the data record
read the record from Datafile

end PROCEDURE
FIGURE 6.4 Retrieve _record():

procedure to retrieve a single record from Datafile through

Indexfile.

Using the index to provide access

The steps needed to retrieve

to the data

matter.

shown

Datafile are

in the

a single

file

procedure retrieve_record(

this retrieval strategy is relatively straightforward,

now

are

The index
because

with

dealing with

file is

two

files

contains

simple

KEY

in Fig. 6.4.

the index

from
Although

some

features

uses fixed-length records (which

likely

it is

and the data

file

work with

considerably easier to

binary search) and because

the data

comment:

that deserve

by Label ID

record with key

than the data

file.

file

why we can search it

to be much smaller than

file.

file have fixed-length records, we impose

on the sizes of our keys. In this example we assume that the
primary key field is long enough to retain every key's unique iden-

requiring that the index

a limit

tity.

The

problems

use of a small, fixed key field in the index could cause

if a

key's uniqueness

truncated

away

it is

placed in the

fixed index field.

In the example, the index carries

no information other than

and the reference fields, but this need not be the

example, keep the length of each Datafile record

6.3

case.

the keys

could, for

in Indexfile.

Basic Operations on an Indexed, Entry-Sequenced

have noted that the process of keeping

searching for records can be very expensive.

using

sorted to permit binary

One of the

simple index with an entry-sequenced data

addition can take place

long

files

as the

index

much more

small

enough

record length

consisting of

no more than

short, this

not
a

quickly than with

to be held entirely in
a difficult

File

great advantages of
file

that record

sorted data

memory.

file as

If the index

condition to meet for small

files

few thousand records. For the moment our

231

BASIC OPERATIONS ON AN INDEXED, ENTRY-SEQUENCED FILE

discussions assume that the condition

met and

that the index

secondary storage into an array of structures called

when

consider what should be done

the index

INDEX[

is
].

read from
Later

too large to

fit

we
into

memory.
Keeping the index in memory as the program runs also lets us find
records by key more quickly with an indexed file than with a sorted one
since the binary searching can be performed entirely in memory. Once the
byte offset for the data record

hand, requires

found, then

The use of

required to retrieve the record.

a single

sorted data

file

that

coupled with

simple index requires the development of procedures to handle

of different

all

on the other

file,

seek for each step of the binary search.

The support and maintenance of an entry-sequenced

seek

tasks. Besides the retrieve_record(

ously, other procedures used to find things

number

algorithm described previ-

by means of the index include

the following:

Create the original empty index and data

Load the index

Rewrite the index

Add

files;

memory before using it;

from memory after using it;

into

file

records to the data

file

and index;
file; and

Delete records from the data

Update records

in the data

file.

Creating the Files

empty

files,

quite easily

Both the index file and the data file are created as
with header records and nothing else. This can be accomplished
by creating the files and writing headers to both files.

Memory We assume that the index file is

Loading the Index into

small

to
memory, so we define an array INDEX[ ] to
hold the index records. Each array element has the structure of an index

enough

fit

into primary

record. Loading the index

file

into

memory,

then,

simply

matter of

reading in and saving the index header record and then reading the records

from the index

file

into the

INDEX[

array. Since this will be a sequential

and since the records are short, the procedure should be written so it
reads a large number of index records at once, rather than one record at a
read,

time.

File from Memory

When processing of an
indexed file is completed, it is necessary to rewrite INDEXf ] back into the
index file if the array has been changed in any way. In Fig. 6.5, the

Rewriting the Index

procedure rewrite_index(

describes the steps for doing this.

232

INDEXING

PROCEDURE rewrite_index(
check a status flag that tells whether the INDEX [] array
has been changed in any way.
if there were changes, then
open the index file as a new empty file
update the header record and rewrite the header
write the index out to the newly created file

close the index file

end PROCEDURE
FIGURE 6.5 The rewrite_index() procedure.

It is

important to consider what happens

rewriting of the index

if this

does not take place, or takes place incompletely. Programs do not always

run to completion.
failures, against the

program designer needs

guard against power

wrong

operator turning the machine off at the

other such disasters.

One of the

time, and

serious dangers associated with reading an

memory and then writing it out when the program is over is that
copy of the index on'disk will be out of date and incorrect if the program
interrupted. It is imperative that a program contain at least the following

index into
the
is

two safeguards
Q'

to protect against this kind

There should be

when

the index

error:

mechanism

that permits the

out of date.

One

program

know

possibility involves setting a sta-

copy of the index in memory is changed. This

be written into the header record of the index file
on disk as soon as the index is read into memory, and then subsequently cleared when the index is rewritten. All programs could
check the status flag before using an index. If the flag is found to be
tus flag as

soon

as the

status flag could

set,

then the program would

If a

program

know

detects that an index

that the index

out of date.

out of date, the program must

have access to a procedure that reconstructs the index from the data
file. This should happen automatically, taking place before any attempt is made to use the index.

Record Addition

Adding

new

record to the data

file

requires that

add a record to the index file. Adding to the data file itself is easy. The
exact procedure depends, of course, on the kind of variable-length file
also

233

BASIC OPERATIONS ON AN INDEXED, ENTRY-SEQUENCED FILE

organization being used. In any case,

when we add

know

file

data record

should

which we wrote the

record. This information, along with the canonical form of the record's
array.
key, must be placed in the INDEX[
array is kept in sorted order by key, insertion of
Since the INDEX[
the starting byte_offset of the

location at

new

index record probably requires some rearrangement of the index.

way, the situation is similar to the one we face as we add records to a
sorted data file. We have to shift or slide all the records that have keys that
come in order after the key of the record we are inserting. The shifting
opens up a space for the new record. The big difference between the work
we have to do on the index records and the work required for a sorted data
array is contained wholly in memory. All of the
file is that the INDEX[
index rearrangement can be done without any file access.
the

In a

Record Deletion

Chapter 5

describe a

deleting records in variable-length record

files

number of approaches

that allow for the reuse

of the

space occupied by these records. These approaches are completely viable for
our data file since, unlike a sorted data file, the records in this file need not
be moved around to maintain an ordering on the file. This is one of the great
advantages of an indexed file organization: We have rapid access to
individual records by key without disturbing pinned records. In fact, the

indexing

itself pins all the records.

when we

from the data file we must also

from our index file. Since the index is
an array during program execution, deleting the index record
the other records to close up the space may not be an overly

course,

delete a record

delete the corresponding entry

contained in

and shifting

expensive operation. Alternatively,

as deleted, just as

we might mark

Record Updating
Q

could simply mark the index record

the corresponding data record.

Record updating

falls

into

two

categories:

The update changes

bring about

the value of the key field. This kind of update can

reordering of the index file as well as the data file.

Conceptually, the easiest

deletion followed

by an

implemented while
that he or she

The update

still

way

to think

this

kind of change

as a

addition. This delete/add approach can be

providing the program user with the view

merely changing

a record.

does not affect the key field. This

not require rearrangement of the index

second kind of update does

but may well involve re-

file,

file. If the record size is unchanged or decreased

by the update, the record can be written directly into its old space,
but if the record size is increased by the update, a new slot for the

ordering of the data

234

INDEXING

record will have to be found. In the

the rewritten record
field

6.4

latter case the starting

must replace the old address

address of

in the byte_offset

of the corresponding index record.

Indexes That Are Too Large to Hold

Memory

The methods we have been

discussing, and, unfortunately, many of the

advantages associated with them, are tied to the assumption that the index
file is small enough to be loaded into memory in its entirety. If the index is
large for this approach to be practical, then index access and
maintenance must be done on secondary storage. With simple indexes of
the kind we have been discussing, accessing the index on a disk has the

too

following disadvantages:

Binary searching of the index requires several seeks rather than takmemory speeds. Binary searching of an index
on secondary storage is not substantially faster than the binary

ing place at electronic

searching of a sorted

file.

Index rearrangement due to record addition or deletion requires shifting or sorting records on secondary storage. This is literally millions

of times more expensive than the cost of these same operations when
performed in electronic memory.

Although these problems are no worse than those associated with the
file that is sorted by key, they are severe enough to warrant the

use of any

consideration of alternatives.
in

memory, you should

A
A

Any

time

simple index

too large to hold

consider using

hashed organization if access speed

tree-structured index,

such

as a B-tree, if

top priority; or

you need

both keyed access and ordered, sequential

These alternative

file

the flexibility of

access.

organizations are discussed at length in the

chapters that follow. But, before writing off the use of simpje indexes on

secondary storage altogether, we should note that they provide some

important advantages over the use of a data file sorted by key even if the
index cannot be held in memory:

simple index makes

keyed access

possible to use a binary search to obtain

to a record in a variable-length record

file.

The index

provides the service of associating a fixed-length and therefore binary-searchable record with each variable-length data record.

235

INDEXING TO PROVIDE ACCESS BY MULTIPLE KEYS

If the

index records are substantially smaller than the data

file

records, sorting and maintaining the index can be less expensive than

would be
,

sorting and maintaining the data

cause there

less

information to

If there are pinned records in the data

rearrange the keys without

There
one

that

file.

move around

moving

file,

This

in the

simply be-

index

file.

the use of an index lets us

the data records.

another advantage associated with the use of simple indexes,

we have

not yet discussed.

It,

in itself, can

be reason enough to use

simple indexes even if they do not fit into memory. Remember the analogy
between an index and a library card catalog? The card catalog provides
multiple views or arrangements of the library's collection, even though
there is only one set of books arranged in a single order. Similarly, we can
use multiple indexes to provide multiple views of a data

6.5

file.

Indexing to Provide Access by Multiple Keys

One

question that might reasonably arise

business

using

pretty interesting, but

key such

at this

who would

DG18807? What

want

point

is,

"All this indexing

ever want to find a record

the

Symphony No.

9 record

by Beethoven."
our analogy between our index and a library card
Suppose we think of our primary key, the Label ID, as a kind of
catalog number. Like the catalog number assigned to a book, we have taken
care to make our Label ID unique. Now, in a library it is very unusual to
begin by looking for a book with a particular catalog number (e.g., "I am
looking for a book with a catalog number QA331T5 1959."). Instead, one
generally begins by looking for a book on a particular subject, with a
particular title, or by a particular author (e.g., "I am looking for a book on
functions," or "I am looking for The Theory of Functions by Titchmarsh.").
Given the subject, author, or title, one looks in the card catalog to find the
primary key, the catalog number.
Let's return to

catalog.

Similarly, we could build a catalog for our record collection consisting

of entries for album title, composer, and artist. These fields are secondary key
fields.

Just as the library catalog relates an author entry (secondary key) to a

card catalog

Composer

number (primary

Along with
this

key), so can

we build

an index

the similarities, there

that relates

an important difference between

kind of secondary key index and the card catalog

library,

file

to Label ID, as illustrated in Fig. 6.6.

in a library.

In a

once you have the catalog number you can usually go directly to the

236

INDEXING

Composer index
Secondary key

Primary key

BEETHOVEN

ANG3795

BEETHOVEN

DG139201

BEETHOVEN

DG18807

key

BEETHOVEN

RCA2626

^4 5i^i9^ r/^.

COREA

WAR23699

DVORAK

COL31809

PROKOFIEV

LON2312

RIMSKY-KORSAKOV

MER75016

SPRINGSTEEN

COL38358

SWEET HONEY IN THE R

FF245

book

**>

p^;TF6r^'^t'

ja>

FIGURE 6.6 Secondary key

index organized by composer.

books are arranged in order by catalog

books are sorted by primary key. The actual
data records in our file, on the other hand, are entry sequenced. Consequently, after consulting the composer index to find the Label ID, you must
consult one additional index, our primary key index, to find the actual byte
offset of the record that has this particular Label ID. The procedure is
summarized in Fig. 6.7.
Clearly it is possible to relate secondary key references (e.g.,
Beethoven) directly to a byte offset (211) rather than to a primary key
(DG18807). However, there are excellent reasons for postponing this
binding of a secondary key to a specific address for as long as possible.
These reasons become clear as we discuss the way that fundamental file
operations such as record deletion and updating are affected by the use of
stacks to find the

number.

since the

In other words, the

secondary indexes.

Record Addition
the

file

means adding

When
a

secondary index

present, adding a record to

record to the secondary index.

The

cost of doing this

237

INDEXING TO PROVIDE ACCESS BY MULTIPLE KEYS

PROCEDURE search_on_secondary KEY

(

search for KEY in the secondary index

once the correct secondary index record is found, set LABEL_ID
to the primary key value in the record's reference field
call retrieve _record(LABEL_ID) to get the data record

end PROCEDURE
FIGURE 6.7 Search _on_secondary: an algorithm to retrieve a single record from Datafile
through a secondary key index.

very similar to the cost of adding

record to the primary index: Either

must be shifted or a vector of pointers to structures needs to be

rearranged. As with primary indexes, the cost of doing this decreases
records

greatly if the secondary index can be read into electronic

changed

memory and

there.

Note that the key field in the secondary index file is stored in canonical
form (all of the composers' names are capitalized), since this is the form that
we want to use when we are consulting the secondary index. If we want to
print out the name in normal, mixed upper- and lowercase form, we can
pick up that form from the original data file. Also note that the secondary
keys are held to a fixed length, which means that sometimes they are
truncated. The definition of the canonical form should take this length
restriction into account if searching the index is to work properly.

One

important difference between

secondary index and

primary
sample
index illustrated in Fig. 6.6, there are four records with the key
BEETHOVEN. Duplicate keys are, of course, grouped together. Within
this group, they should be ordered according to the values of the reference
index

that a secondary index can contain duplicate keys. In the

fields. In this

example, that means placing them in order by Label ID. The

reasons for this second level of ordering

discuss retrieval based

Record Deletion

become

clear a little later, as

on combinations of two or more secondary keys.

Deleting

ences to that record in the

file

record usually implies removing

system. So, removing

would mean removing not only

all

refer-

record from the data

the corresponding record in the

in the secondary indexes that refer
of
the
records
primary index, but also all
to this primary index record. The problem with this is that secondary
indexes, like the primary index, are maintained in sorted order by key.
file

238

INDEXING

Consequently, deleting

would involve rearranging

record

records to close up the space

open by

left

the remaining

deletion.

This delete-all-references approach would indeed be advisable if the

secondary index referenced the data file directly. If we did not delete the
secondary key references, and if the secondary keys were associated with
actual byte offsets in the data file, it could be difficult to tell when these
references

were no longer

The

record problem.

would be pointing

and subsequent
be associated with different data records.
have carefully avoided referencing actual addresses in the

search to find the secondary key,

we do

time on primary key. Since the primary index does

changes due to record deletion, a search for the primary key of a

another search,
record

another instance of the pinned-

file,

secondary key index. After

reflect

to byte offsets that could, after deletion

space reuse in the data

But we

This

valid.

reference fields associated with the secondary keys

that

this

been deleted will fail, returning

sense, the updated primary key index

has

condition. In a

check, protecting us

from trying

to retrieve records that

Consequently, one option that

from the data

file is

open

to us

record-not-found

acts as a

kind of final

no longer

when we

exist.

delete a record

modify and rearrange only the primary key index.

We could safely leave intact the references

to the deleted record that exist in

the secondary key indexes. Searches starting

that lead to a deleted record are caught

from

when we

secondary key index

consult the primary key

index.
If there are a

from not having

number of secondary key

to rearrange

can be substantial. This

of these indexes

all

especially important

indexes are kept on secondary storage.

system, where the user

indexes, the savings that result

waiting

at a

It is

also

when a record is deleted

when the secondary key

important

an interactive

terminal for the deletion operation to

complete.

There

is,

of course,

a cost associated

with

this short cut:

Deleted records

With a file system that undergoes

few deletions, this is not usually a problem. With a somewhat more volatile
file structure, it is possible to address the problem by periodically removing
from the secondary index files all records that contain references that are no
take up space in the secondary index

files.

system is so volatile that even periodic

purging is not adequate, it is probably time to consider another index
structure, such as a B-tree, which allows for deletion without having to
rearrange a lot of records.

longer in the primary index. If a

file

In our discussion of record deletion, we find that the

primary key index serves as a kind of protective buffer, insulating the

Record Updating

239

RETRIEVAL USING COMBINATIONS OF SECONDARY KEYS

secondary indexes from changes

in the data

file.

This insulation extends to

record updating as well. If our secondary indexes contain references directly

to byte offsets in the data

file,

then updates to the data

file

that result in

changing a record's physical location in the file also require updating the
secondary indexes. But, since we are confining such detailed information to
the primary index, data file updates affect the secondary index only when
they change either the primary or the secondary key. There are three
possible situations:

Update changes the secondary key: If the secondary key

we may have

to rearrange the secondary

sorted order. This can be

key index so

changed, then
it

stays in

a relatively expensive operation.

Update changes the primary key: This kind of change has

a large

on the primary key index, but often requires only

that

affected reference field (Label_id in our example) in

all

impact

update the

the secondary

indexes. This involves searching the secondary indexes (on the un-

changed secondary keys) and rewriting the affected fixed-length

does not require reordering of the secondary indexes unless
the corresponding secondary key occurs more than once in the index. If a secondary key does occur more than once, there may be
some local reordering, since records having the same secondary key
are ordered by the reference field (primary key).
Update confined to other fields: All updates that do not affect either the
primary or secondary key fields do not affect the secondary key index, even if the update is substantial. Note that if there are several
secondary key indexes associated with a file, updates to records often
affect only a subset of the secondary indexes.
field. It

6.6

Retrieval Using Combinations of Secondary Keys

One of the most important applications of secondary keys involves using

two or more of them in combination to retrieve special subsets of records
from the data file. To provide an example of how this can be done, we will
extract another secondary key index from our file of recordings. This one
uses the recording's

title

respond to requests such

as the key, as illustrated in Fig. 6.8.

all

poser);

COL38358 (primary key access);

work (secondary key com-

the recordings of Beethoven's

and

can

Find the record with Label ID

Find

Now we

240

INDEXING

Title index

Primary key

Secondary key

COQ DOR SUITE

MER75016

GOOD NEWS

FF245

NEBRASKA

COL38358

QUARTET

IN C

SHARP M

RCA2626

ROMEO AND JULIET

LON2312

SYMPHONY NO.

ANG3795

SYMPHONY NO.

COL31809

SYMPHONY NO.

DG18807

TOUCHSTONE

WAR23699

VIOLIN CONCERTO

DG139201

FIGURE 6.8 Secondary key

index organized by recording
title.

Find

all

the recordings titled "Violin Concerto" (secondary key

title).

What

interesting,

however,

that

can also respond to

request that combines retrieval on the composer index with retrieval on the
title index, such as: Find all recordings of Beethoven's Symphony No. 9.
Without the use of secondary indexes, this kind of request requires a
sequential search through the entire file. Given a file containing thousands,
or even just hundreds, of records, this is a very expensive process. But, with
the aid of secondary indexes, responding to this request is simple and quick.
We begin by recognizing that this request can be rephrased as a Boolean

AND operation,

specifying the intersection of

two

Find all data records with:

composer = "BEETHOVEN" AND title

subsets of the data

"SYMPHONY NO.

file:

241

RETRIEVAL USING COMBINATIONS OF SECONDARY KEYS

We begin

our response to this request by searching the composer index

of Label IDs that identify records with Beethoven as the
composer. (An exercise at the end of this chapter describes a binary search
procedure that can be used for this kind of retrieval.) This yields the
following list of Label IDs:
for the

list

ANG3795
DG139201
DG18807
RCA2626
Next we
that

search the

title

index for the Label IDs associated with records

SYMPHONY NO.

have

9 as the

key:

title

ANG3795
COL31809
DG18807

Now we

perform the Boolean

combining the
in the

output

lists

which

match operation,

that appear in both

lists

are placed

list.

Composers
ANG3795
DG139201
DG 18807
RCA2626

AND,

members

so only the

Titles

ANG3795
CDL31809
>DG 18807
>

Matched list
>ANG379S
>DG18807

give careful attention to algorithms for performing this kind of

Note that this kind of matching is much

combined are in sorted order. That is the
reason why, when we have more than one entry for a given secondary key,
the records are ordered by the primary key reference fields.
match operation

Chapter

easier if the lists that are being

once

Finally,

we
file

we have

the

list

of primary keys occurring

both

lists.

can proceed to the primary key index to look up the addresses of the data
records.

ANG
DG

This

3795
18807

useful in a

Then we can
!

retrieve the records:

Symphony No.
Symphony No.

Beethoven
Beethoven

Guilini
Karajan

makes computer-indexed file systems

manual systems. We have
record, and yet. working through the

that far exceeds the capabilities of

secondary indexes,
in

the kind of operation that

way

only one copy of each data

them

9
9

order by

we have

title,

file

multiple views of these records:

by composer, or by any other

We can look at

field that interests us.

242

INDEXING

Using the computer's ability to combine sorted lists rapidly, we can even
combine different views, retrieving intersections (Beethoven AND Symphony No. 9) or unions (Beethoven OR Prokofiev OR Symphony No. 9) of
these views. And since our data file is entry sequenced, we can do all of this
without having to sort data file records, confining our sorting to the smaller
index records which can often be held in electronic memory.

Now
indexes,

we have

that

can look

of the design and uses of secondary

improve these indexes so they take less

a general idea

ways

space and require less sorting.

6.7

Improving the Secondary Index Structure:

Inverted Lists
The secondary index

structures that

we have

developed so

far result in

two

distinct difficulties:

have to rearrange the index file every time a new record is added
to the file, even if the new record is for an existing secondary key.
For example, if we add another recording of Beethoven's Symphony
No. 9 to our collection, both the composer and title indexes would
have to be rearranged, even though both indexes already contain entries for secondary keys (but not the Label IDs) that are being added.
If there are duplicate secondary keys, the secondary key field is repeated for each entry. This wastes space, making the files larger than
necessary. Larger index files are less likely to be able to fit in elec-

tronic

6.7.1 A

One

memory.

First

Attempt

at a Solution

simple response to these

structure so

example,

we might

use

is to change the secondary index

of references with each secondary key. For

difficulties

associates an array

record structure that allows us to associate up to

four Label ID reference fields with

BEETHOVEN

ANG3795

a single

DG139201

secondary key,

DG18807

as in

RCA2626

Figure 6.9 provides a schematic example of how such an index would look
if

used with our sample data

file.

243

IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS

Revised composer index

Set of primary key references

Secondary key

BEETHOVEN

ANG3795

COREA

WAR23699

DVORAK

COL31809

PROKOFIEV

LON2312

RIMSKY-KORSAKOV

MER75016

SPRINGSTEEN

COL38358

SWEET HONEY IN THE R

FF245

DG139201

FIGURE 6.9 Secondary key index containing space

RCA2626

DG18807

for multiple references for

each secondary

key.

The major
solution of our

contribution of this revised index structure

first difficulty:

toward the

the need to rearrange the secondary index

file

Looking at Fig. 6.9, we can

see that the addition of another recording of a work by Prokofiev does not
require the addition of another record to the index. For example, if we add

every time a

new

record

added

to the data

file.

the recording

ANG

36193

we need

Piano Concertos

and

Prokofiev

Francois

modify only the corresponding secondary index record by

inserting a second Label ID:

PROKOFIEV

ANG36193

LON2312

Since we are not adding another record to the secondary index, there is no
need to rearrange any records. All that is required is a rearrangement of the
fields in the existing

record for Prokofiev.

244

INDEXING

Although

this

secondary index
it

new

file

structure helps avoid the need to rearrange the

so often,

does have some problems. For one thing,

provides space for only four Label IDs to be associated with

a given key.
very likely case that more than four Label IDs will go with some key,
need a mechanism for keeping track of the extra Label IDs.

In the

second problem has to do with space usage. Although the structure

does help avoid the waste of space due to the repetition of identical keys,
this

space savings comes

at a potentially

high

cost.

length of each of the secondary index records to hold

we might

easily lose

by not repeating

extending the fixed

reference fields,

space to internal fragmentation than

gained

identical keys.

we don't want to waste any more space than we have to, we need
whether we can improve on this record structure. Ideally, what we
would like to do is develop a new design, a revision of our revision, that
Since

to ask

Retains the attractive feature of not requiring reorganization of the

secondary indexes for every

new

entry to the data

file;

Allows more than four Label IDs to be associated with each secondary key; and
Does away w ith the waste of space due to internal fragmentation.
T

6.7.2 A Better Solution: Linking the

List of

References

such as our secondary indexes, in which a secondary key leads to a set

of one or more primary keys, are called inverted lists. The sense in which the
list is inverted should be clear if you consider that we are working our way
backward from a secondary key to the primary key to the record itself.
The second word in the term inverted list also tells us something
Files

important: that

Our

are, in fact, dealing

with

list

of primary key references.

number of Label

revised secondary index, which

IDs for each secondary key, reflects this list aspect of the data more directly
than did our initial secondary index. Another way of conceiving of this list
aspect of our inverted list is illustrated in Fig. 6.10.
As Fig. 6. 10 shows, an ideal situation would be to have each secondary
key point to a different list of primary key references. Each of these lists
could grow to be just as long as it needs to be. If we add the new Prokofiev
record, the

PROKOFIEV

collects together a

list

of Prokofiev references becomes

ANG36193

LON2312

245

IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS

of primary
key references

Lists

Secondary key index

BEETHOVEN

ANG3795

COREA

DG139201

DVORAK

DG18807

PROKOFIEV

RCA2626

WAR23699

COL31809

LON2312

FIGURE 6.10 Conceptual view of the primary key reference fields as a series of

Similarly,

adding two

additional elements to the

new Beethoven

list

recordings

lists.

adds just two

of references associated with the Beethoven

key. Unlike our record structure that allocates

enough space

for four Label

IDs for each secondary key, the lists could contain hundreds of references,
if needed, while still requiring only one instance of a secondary key. On the
other hand,

if a list requires

internal fragmentation.

the

file

only one element, then no space

Most important of

all,

we need

can

set

if a new composer is added to the

up an unbounded number of different

varying length, without creating a large number of small

way

through the use of linked lists.

it consists of records with two

index so

a field containing the relative record

lost to

to rearrange only

of secondary keys

How

files?

file.

lists,

each of

The

simplest

could redefine our secondary

fields

number of

secondary key
the

first

field,

and

corresponding

246

INDEXING

primary key reference (Label ID) in the inverted list. The actual primary key
references associated with each secondary key would be stored in a separate,
entry-sequenced file.
Given the sample data we have been working with, this new design
would result in a secondary key file for composers and an associated Label
ID file that are organized as illustrated in Fig. 6.11. Following the links for
the list of references associated with Beethoven helps us see how the Label
ID List file is organized. We begin, of course, by searching the secondary
key index of composers for Beethoven. The record that we find points us

number (RRN) 3 in the Label ID List file. Since this is a

it is easy to jump to RRN 3 and read in its Label ID

to relative record

fixed-length

file,

(ANG3795). Associated with this Label ID is a link to a record with RRN

8. We read in the Label ID for that record, adding it to our list (ANG379
DG139201). We continue following links and collecting Label IDs until the
list

looks like

this:

ANG3795

The

link field in the last record read

of 1. As

a value

DG18807

DG139201

our

earlier

RCA2626

from the Label ID

programs,

List file contains

this indicates end-of-list, so

know that we now have all the Label ID

To illustrate how record addition affects the Secondary Index
ID List files, we add the Prokofiev recording mentioned earlier:

references for Beethoven records.

36193

ANG

You
last

one

record

Piano Concertos

and

Prokofiev

and Label

Francois

can see (Fig. 6.11) that the Label ID for this new recording is the
ID List file, since this file is entry sequenced. Before this

in the Label

added, there

only one Prokofiev recording.

has a Label

ID of

we want to keep the Label ID Lists in order by ASCII

character values, the new recording is inserted in the list for Prokofiev so it
logically precedes the LON2312 recording.
Associating the Secondary Index file with a new file containing linked
LON2312.

lists

Since

of references provides some advantages over any of the structures

considered up to this point:

The only time we need to rearrange the Secondary Index file is when
a new composer's name is added or an existing composer's name is
it was misspelled on input). Deleting or adding recomposer who is already in the index involves changing only the Label ID List file. Deleting all the recordings for a composer could be handled by modifying the Label ID List file, while

changed

(e.g.,

cordings for

leaving the entry in the Secondary Index

file

in place, using a value

247

IMPROVING THE SECONDARY INDEX STRUCTURE: INVERTED LISTS

Improved

revision of the

composer index

Secondary Index file

Label

BEETHOVEN

LON2312

COREA

RCA2626

DVORAK

WAR23699

PROKOFIEV

ANG3795

RIMSKY-KORSAKOV

COL38358

SPRINGSTEEN

DG18807

SWEET HONEY IN THE R

MER75016

COL31809

DG139201

FF245

ANG36193

FIGURE 6.1

Secondary key index referencing linked

primary key references.

the task

list

of entries for

this

empty.

In the event that

lists of

in its reference field to indicate that the

composer

List file

we do need
now since

quicker

to rearrange the

Secondary Index

file,

there are fewer records and each record

smaller.

Since there

less

need for sorting,

follows that there

penalty associated with keeping the Secondary Index

ondary storage, leaving more room

RAM

less

files

off

sec-

for other data struc-

tures.
[j

The Label ID

J
^

needs to be sorted.

List file

entry sequenced. That means that

never

ID List file is a fixed-length record file, it would be

implement a mechanism for reusing the space from de-

Since the Label

very easy to

leted records, as described in

Chapter

*^T

248

INDEXING

There

also at least

one potentially

significant disadvantage to this kind

of file organization: The Label IDs associated with

given composer are no

longer guaranteed to be physically grouped together. The technical term for

such "togetherness"

such

as this,

with

locality;

a linked,

less likely that there will

entry-sequenced structure

be locality associated with the

groupings of reference fields for a given secondary key. Note, for

list of Label IDs for Prokofiev consists of the very last and
the very first records in the file. This lack of locality means that picking up
the references for a composer that has a long list of references could involve
a large amount of seeking back and forth on the disk. Note that this kind of

logical

example, that our

seeking would not be required for our original Secondary Index

file

structure.

One obvious
List file in

antidote to this seeking problem

memory.

keep the Label ID

This could be expensive and impractical, given

many

secondary indexes, except for the interesting possibility of using the same
Label ID List file to hold the lists for a number of Secondary Index files.
Even if the file of reference lists were too large to hold in memory, it might
be possible to obtain
the

file

memory

performance improvement by holding only a part of

paging sections of the file in and out of

at a time,

they are needed.

Several exercises

more thoroughly. These

the end of the chapter explore these possibilities

are very important problems, since the notion

of
fundamental to the design of B-trees and
other methods for handling large indexes on secondary storage.
dividing the index into pages

6.8

Selective Indexes
Another interesting feature of secondary indexes
divide a

file

providing

into parts,

that they can be used to

selective view.

For example,

possible to build a selective index that contains only the

titles

classical

recordings in the record collection. If we have additional information about

file, such as the date the recording was released,
could build selective indexes such as "recordings released prior to 1970"
and "recordings since 1970." Such selective index information could be

the recordings in the data

combined
"List

all

into

Boolean

AND

operations to respond to requests such

1970." Selective indexes are sometimes useful

as,

Symphony released since

when the contents of a file fall

the recordings of Beethoven's Ninth

naturally and logically into several broad categories.

BINDING

6.9

249

Binding

recurrent and very important question that emerges in the design of

systems that

utilize

physical address of

its

indexes

is:

At what point

time

the hey

bound

file

to the

associated record?

system we are designing in the course of this chapter, the

primary keys to an address takes place at the time the files are
The secondary keys, on the other hand, are bound to an address

In the file

binding of our
constructed.

at the time that they are actually used.

Binding at the time of the file construction results in faster access. Once
you have found the right index record, you have in hand theT>yte offset of
the data record you are seeking. If we elected to bind our secondary keys to
their associated records at the time of file construction, so when we find the
record in the composer index we would know immediately that
the data record begins at byte 353 in the data file, secondary key retrieval
would be simpler and faster. The improvement in performance is
particularly noticeable if both the primary and secondary index files are used
on secondary storage rather than in memory. Given the arrangement we
designed, we would have to perform a binary search of the composer index
and then a binary search of the primary key index before being able to jump
to the data record. Binding early, at file construction time, does away
at*?\Mflfi|
entirely with the need to search on the primary key.
The disadvantage of binding directly in the tile, of binding tightly, is that
reorganizations of the data file must result in modifications to all bound

DVORAK

index

files. This reorganization cost can be very expensive, particularly

with simple index files in which modification would often mean shifting
records. By postponing binding until execution time, when the records are
actually being used, we are able to develop a secondary key system that
involves a minimal amount of reorganization when records are added or

deleted.

Another important advantage of postponing binding until a record is

actually retrieved is that this approach is safer. As we see in the system that

we set up, associating the secondary keys with reference fields consisting of
primary keys allows the primary key index to act as a kind of final check of
whether a record is really in the file. The secondary indexes can afford to be
wrong. This situation is very different if the secondary index keys are
bound, containing addresses. We would then be jumping directly
from the secondary key into the data file; the address would need to be
tightly

right.

250

INDEXING

This brings up a related safety aspect: It is always more desirable to have

make important changes in one place, rather than having to make them
in many places. With a bind-at-retrieval-time scheme such as we developed,
we need to remember to make a change in only one place, the primary key
index, if we move a data record. With a more tightly bound system, we
have to make many changes successfully to keep the system internally
consistent, braving power failures, user interruptions, and so on.
When designing a new file system, it is better to deal with this question
to

of binding intentionally and early

binding just happen. In general,

in the design process, rather than letting the

tight, in-the-data

binding

most

attractive

when
The

data

file is static

or nearly so, requiring

and updating of records; and

Rapid performance during actual retrieval

little

or no adding, de-

leting,

high priority.

For example, tight binding is desirable for file organization on a massproduced, read-only optical disk. The addresses will never change since no
new records can ever be added; consequently, there is no reason not to
obtain the extra performance associated with tight binding.

For

file

applications in

which record

addition, deletion, and updating

occur, however, binding at retrieval time

option. Postponing binding for as long as possible

operations simpler and safer. If the

and, in particular, if the indexes use
as B-trees, retrieval

additional

work

performance

file

structures are carefully designed,

usually quite acceptable, even given the

required by a bind-at-retrieval system.

SUMMARY

began

this

sorting as a
sorting,

chapter with the assertion that indexing

way of structuring

indexing

permits

variable-length record
addition, deletion,

and

files.

perform binary

If the

file

an alternative to

searches

for

keys

index can be held in memory, record

retrieval can

indexed, entry-sequenced

be done

much more

than with a sorted

quickly with an

file.

much more than merely improve on access time: They

new capabilities that are inconceivable with access
on sorted data records. The most exciting new capability

Indexes can do

can provide us with

methods based

so records can be found by key. Unlike

a file

more desirable
usually makes these

usually the

SUMMARY

involves the use of multiple secondary indexes. Just as

allows us to regard

a collection

subject order, so index

records in a data

file.

files

lists

a library

author order,

card catalog

title

order, or

allow us to maintain different views of the

find that

obtain different views of the

associated

of books

file,

cannot only use secondary indexes to

but that

can also combine the

of primary key references and thereby combine particular

views.
In this chapter

indexes of

two

The need
The need
added

address the problem of

how

to rid our secondary

liabilities:

to repeat duplicate secondary keys;

and

to rearrange the secondary indexes every time a record

to the data

file.

A first solution to these problems involves associating a fixed-size vector

with each secondary key. This solution results in an
internal fragmentation but serves to illustrate the
attractiveness of handling the reference fields associated with a particular
secondary key as a group, or list.
Our next iteration of solutions to our secondary index problems is
more successful and much more interesting. We can treat the primary key

of reference

fields

overly large

amount of

file, forming the necessary lists

through the use of link fields associated with each primary record entry. This
allows us to create a secondary index file that, in the case of the composer
index, needs rearrangement only when we add new composers to the data
file. The entry-sequenced file of linked reference lists never requires sorting.
We call this kind of secondary index structure an inverted list.
There are also, of course, disadvantages associated with our new
solution. The most serious disadvantage is that our file demonstrates less

references themselves as an entry-sequenced

of associated records are less likely to be physically adjacent.

A good antidote to this problem is to hold the file of linked lists in memory.
We note that this is made more plausible because a single file of primary
references can link the lists for a number of secondary indexes.
As indicated by the length and breadth of our consideration of

locality: Lists

secondary indexing, multiple keys, and inverted

among

most

the

lists,

interesting aspects of indexed access to

these topics are

files.

The concepts

of secondary indexes and inverted lists become even more powerful later, as
we develop index structures that are themselves more powerful than the
simple indexes that we consider here. But, even so, we already see that for
small files consisting of no more than a few thousand records, approaches
to inverted lists that rely merely on simple indexes can provide a user with
a great deal

of capability and

flexibility.

251

252

INDEXING

KEY TERMS
Binding. Binding takes place when
physical record in the data

key

associated with a particular

In general, binding can take place

file.

either during the preparation

program execution.

of the data file and indexes or during

former case, which is called tight binding,

In the

the indexes contain explicit references to the associated physical data

record. In the latter case, the connection

ular physical record
in the course

postponed

file.

file

that they are entered into the

index

a
is

key and

a partic-

actually retrieved

of program execution.

Entry-sequenced
Index.

between

until the record

a tool for

which the records occur

in the order

file.

finding records in

a file. It consists

key

on which the index is searched and a reference field that tells

where to find the data file record associated with a particular key.
Inverted list. The term inverted list refers to indexes in which a key may
be associated with a list of reference fields pointing to documents that
contain the key. The secondary indexes developed toward the end of
field

this

Key

chapter are examples of inverted

field.

The key

the canonical

field is the

lists.

portion of an index record that contains

form of the key

that

being sought.

.when records that will be accessed

given temporal sequence are found in physical proximity to each

Locality. Locality exists in

a file

in a

other on the disk. Increased locality usually results in better perfor-

mance, since records that are in the same physical area can often be
brought into memory with a single read request to the disk.
Reference field. The reference field is the portion of an index record
that contains information about

where

to find the data record con-

taining the information listed in the associated key field of the index.

Selective index.

selective index contains keys for only a portion

the records in the data

view of

file.

a specific subset

Simple index.

Such an index provides the user with

of the

file's

of
a

records.

All the index structures discussed in this chapter are sim-

ple indexes insofar as they are

all

built

around the idea of an ordered,

sequence of index records. All these simple indexes share a

common weakness: Adding records to the index is expensive. As we
see later, tree-structured indexes provide an alternate, more efficient

linear

solution to this problem.

EXERCISES

EXERCISES
1.

Until now,

was not possible

variable-length record

With
Does

file.

fixed-length record

mean

this

perform

binary search on

Why does indexing make binary search possible?

is possible to perform a binary search.
need not be used with fixed-length record

file it

that indexing

files?

Why

is title

chapter? If

not used

were used

primary key in the data file described in this

secondary key, what problems would have to

as a

be considered in deciding on
3.

What

canonical form for

titles?

the purpose of keeping an out-of-date-status flag in the header

record of an index? In a multiprogramming environment, this flag might be

found to be set by one program because another program is in the process

of reorganizing the index. How should the first program respond to this
situation?
4.

Explain

how

When

record in

the use of an index pins the data records in a

secondary key indexes

updated, corresponding primary and

may

not have to be altered, depending on

a data file

may

file.

whether the file has fixed- or variable-length records, and depending on the
type of change made to the data record. Make a list of the different updating
situations that can occur, and explain how each affects the indexes.

when you add the following recording

assuming that the composer index shown in Fig. 6.9
is used. How might you solve the problem without substantially changing
the secondary key index structure?
6.

Discuss the problem that occurs

to the recordings

LON

1259

file,

Fidelio

Beethoven
and when

What

How

are the structures in Fig. 6.11

an inverted

list,

is it

Maazel
useful?

changed by the addition of the

recording

LON

1259

Fidelio

Beethoven

Maazel

Suppose you have the data file described in this chapter, greatly
a primary key index and secondary key indexes organized
by composer, artist, and title. Suppose that an inverted list structure is used
to organize the secondary key indexes. Give step-by-step descriptions of
how a program might answer the following queries:
a. List all recordings of Bach or Beethoven; and
b. List all recordings by Perleman of pieces by Mozart or Joplin.
9.

expanded, with

253

254

10.

INDEXING

One

inverted

problem of diminished locality when using

same Label ID List file to hold the lists for several

possible antidote to the

lists is

to use the

of the secondary index files. This increases the likelihood that the secondary
indexes can be kept in primary memory. Draw a diagram of a single Label
ID List file that can be used to hold references for both the secondary index
of composers and the secondary index of titles. How would you handle the
difficulties that this arrangement presents with regard to maintaining the
Label ID List

file?

11. Discuss the following structures as antidotes to the possible loss

locality in a

secondary key index:

Leave space for multiple references for each secondary key (Fig. 6.9).
Allocate variable-length records for each secondary key value, where
each record contains the secondary key value, followed by the Label
IDs, followed by free space for later addition of new Label IDs. The
amount of free space left could be fixed, or it could be a function of
the size of the original
12.
file

list

of Label IDs.

The method and timing of binding

system

speed and

flexibility.

two important

affect

attributes

of a

Discuss the relevance of these attributes,

and the effect of binding time on them, for a hospital patient information
system designed to provide information about current patients by patient
name, patient ID, location, medication, doctor or doctors, and illness.

Programming and Design Exercises

13.

Implement the

retrieve_record( )

14. In solving the preceding

deciding

how many

procedure outlined in

mechanism for
each record. At least

problem, you have to create

bytes to read from the Datafile for

Fig. 6.4.
a

four options are open to you:

Jump

to the

byte_ofifiset,

read the size

field,

then use this informa-

tion to read the record.

Build an index

file

that contains a record size field that reflects the

true size

of the data record, including the

Datafile.

Use

many
c.

size field carried in the

the size field carried in the index

file

to decide

how

bytes to read.

Follow much the same strategy

as in

option

(b),

except use a

Datafile that does not contain internal size fields.

Jump

to the byte_offset

bytes (e.g., 512 bytes).

and read

Once

buffer, use the size field at the start

many

a fixed,

overly large

these bytes are read into a

number of

memory
how

of the buffer to decide

bytes to break out of the buffer.

EXERCISES

Evaluate each of these options, listing the advantages and disadvantages of

each.

Implement procedures

15.

to the

index

When

16.

some of
first

and to write back the

INDEXf

array

searching secondary indexes that contain multiple records for

we do
we want to

not want to find just any record for

the keys,

secondary key;
the

to read in

file.

given

find the first record containing that key. Finding

record allows us to read ahead, sequentially, extracting

all

of the

records for the given key. Write a variation of a binary search function that
returns the relative record

key.

The function should

17. If a Label

be held in

List file

memory

the

first

record containing the given

return a negative value

key cannot be found.

as the

entirety,

if the

one shown in Fig. 6.11 is too large to

might still be possible to improve
number of blocks of the file in memory. These

such

its

performance by retaining

number of

blocks are called pages. Since the records in the Label ID List

file

are each 16

bytes long, a page might consist of 32 records (512 bytes). Write

function

would hold the most recently used eight pages in memory. Calls for a
specific record from the Label ID List file would be routed through this
function. It would check to see if the record exists in one of the pages that
is already in memory. If so, the function would return the values of the
record fields immediately. If not, the function would read in the page
containing the desired record, either writing out or dumping the page that
was used least recently. Clearly, if a page has been changed, it needs to be
written out rather than dumped. When the program is over, all pages still
in memory must be checked to see if they should be written out.
that

18. Assuming the use of a paged index

problem, and given that the Label ID List
any particular order of data entry (initial

described in the preceding

file is
file

entry sequenced,

there

loading) that tends to give

performance than other methods? How does the use of an

organization method such as that described in problem 10, which combines
the linked lists from several secondary indexes into a single file, affect your
answer about performance?
better

19.

The Label ID

schemes

List file

entry sequenced. Development of paging

simpler for entry-sequenced

files

than for

files

that are kept in

sorted order. List the additional difficulties involved in the design of a

paging system for

a sorted index,

such

the possibility that there will be a

full,

design such a paging system.

new key when

the page in

which

primary key index. Accepting

that are only partially
will you handle the insertion of a

as the

number of pages

How
it

belongs

full?

255

256

INDEXING

FURTHER READINGS

have

much more

to say about indexing in later chapters,

subjects of tree-structured indexes and of indexed sequential

where we take up
file

organizations.

the

The

topics developed in the current chapter, particularly those relating to secondary

indexes and inverted

texts.

The few

files,

texts that

are also covered

list

by many other

file

and data structure

here are of interest because they either develop

more detail or present the material from a different viewpoint.

Wiederhold (1983) provides a survey of many of the index structures we
discuss, along with a number of others. His treatment is more mathematical than
that provided in our text. Users interested in looking at indexed files in the context
of PL/I and of large IBM mainframes will want to see Bradley (1982). A brief,
readable overview of a number of different file organizations is provided in J. D.
certain topics in

Ullman (1980).
Tremblay and Sorenson

(1984) provide a comparison of inverted

provides
users.

a similar discussion,

list

structures

Loomis (1983)
along with some examples oriented toward COBOL

with an alternative organization called

multilist

files.

Salton and McGill (1983) discuss inverted

application in information retrieval systems.

lists

the context of their

Cosequential Processing
and the Sorting of
Large Files

CHAPTER OBJECTIVES
Describe
tivities

a class

known

of frequently used processing ac-

as cosequential processes.

Provide a general model for implementing

eties of cosequential processes.
Illustrate the use

of the model to solve

all

vari-

number of

different kinds of cosequential processing prob-

lems, including problems other than simple merges

and matches.
I

Introduce heapsort as an approach to overlapping

I/O with sorting

Show how merging

very large

RAM.
provides the basis for sorting

files.

Examine the costs of K-way merges on

find ways to reduce those costs.
Introduce the notion of replacement

disk and

selection.

Examine some of the fundamental concerns

ated with sorting large

files

associ-

using tapes rather than

disks.

Introduce

UNIX

utilities for sorting,

merging, and

cosequential processing.

257

CHAPTER OUTLINE
7.1

Model

for

Implementing

7.5.3

Cosequential Processes

Matching Names in Two Lists

7.1.2 Merging Two Lists
7.1.3 Summary of the Cosequential

7.5.4

Hardware-based Improvements

7.5.5

Decreasing the Number of

Seeks Using Multiple-step

Merges

Model
7.5.6

7.2

Application of the Model to a

General Ledger Program
7.2.1

7.5.7

Model

to the

7.5.8

Using Two Disk Drives with

Replacement Selection

7.5.9

Ledger Program

Extension of the Model to Include

7.5.10 Effects of

A K-way Merge
A Selective Tree

Algorithm
7.3.2
for Merging
Large Numbers of Lists
7.4

Second Look

7.4.1

Sorting in

7.5.11

Reading

Multiprogramming

Conceptual Toolkit for

External Sorting
7.6

RAM

Sorting Files on Tape

7.6.1

7.6.2

Overlapping Processing and

I/O: Heapsort

7.4.2 Building the

Drives?

Processors?

Multiway Merging
7.3.1

Run Lengths Using

Replacement Selection
Replacement Selection Plus
Increasing

Multistep Merging

The Problem

7.2.2 Application of the

7.3

the File

Size

7.1.1

Processing

The Cost of Increasing

The Balanced Merge

The K-way Balanced Merge

7.6.3 Multiphase

Merges

7.6.4 Tapes versus Disks for External

Heap while

Sorting

in the File

7.4.3 Sorting while Writing out to the

7.7

Sort-Merge Packages

7.8

Sorting and Cosequential

Processing in UNIX

File

7.5

Merging as a Way of Sorting Large

Files on Disk
7.5.1

How Much

Time Does

7.8.1

Merge

Sort Take?
7.5.2 Sorting a File That

Sorting and Merging in

UNIX

7.8.2 Cosequential Processing

Utilities in

Ten

UNIX

Times Larger

Cosequential operations involve

the

coordinated processing of two or more

Sometimes the processing results

of the input lists; sometimes the goal is a matching, or
intersection, of the lists; and other times the operation is a combination of
matching and merging. These kinds of operations on sequential lists are the
basis of a great deal of file processing.
In the first half of this chapter we develop a general model for doing
cosequential operations, illustrate its use for simple matching and merging

sequential

lists to

produce a single output

in a merging, or union,

list.

259

A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES

it to the development of a more complex general

Next we apply the model to multiway merging, which is
component of external sort-merge operations. We conclude the

operations, and then apply

ledger program.

an essential

chapter with
trade-offs,

7.1

a discussion of external sort-merge procedures, strategies, and

paying special attention to performance considerations.

A Model

Implementing Cosequential Processes

for

Cosequential operations usually appear to be simple to construct; given the

information that

can be turned into

provide in
reality.

this chapter, this

However,

appearance of simplicity

also true that approaches to

cosequential processing are often confused, poorly organized, and incor-

These examples of bad practice are by no means limited to student

the problems also arise in commercial programs and in
textbooks. The difficulty with these incorrect programs is usually that they
are not organized around a single, clear model for cosequential processing.
Instead, they seem to deal with the various exception conditions and
problems in a cosequential process in an ad hoc rather than systematic way.
This section addresses such lack of organization head on. We present a
single, simple model that can be the basis for the construction of any kind
of cosequential process. By understanding and adhering to the design
principles inherent in the model, you will be able to write cosequential
procedures that are simple, short, and robust.
rect.

programs;

7.1.1 Matching

Names

Suppose

we want

Fig. 7.1.

This operation

We
a

Two

to output the
is

names common

and that the

lists

to the

two

lists

shown

usually called a match operation, or an intersection.

assume, for the moment, that

list,

Lists

will not allow duplicate keys within

are sorted in ascending order.

We begin by reading in the initial name from each list, and we find that
We output this first name as a member of the match set, or
intersection set. We then read in the next name from each list. This time the

they match.

name
lists

in List 2

is less

visually, as

than the

are

When we are processing these

we are trying to match the
and scan down List 2 until we either find it or
we eventually find a match for CARTER, so
name

in List

now, we remember

that

name CARTER from List 1,

jump beyond it. In this case,
we output it, read in the next name from each list, and continue the process.
Eventually we come to the end of one of the lists. Since we are looking for
names common to both lists, we know we can stop at this point.
Although the match procedure appears

number of
well.

to be quite simple, there are a

matters that have to be dealt with to

make

work reasonably

260

List

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

List 2

ADAMS
ANDERSON
ANDREWS
BECH
BURNS
CARTER
DAVIS
DEMPSEY
GRAY
JAMES
JOHNSON
KATZ
PETERS
ROSEWALD
SCHMIDT
THAYER
WALKER
WILLIS

ADAMS
CARTER
CHIN
DAVIS
FOSTER
GARWICK
JAMES
JOHNSON
KARNS
LAMBERT
MILLER
PETERS
RESTON
ROSEWALD
TURNER

FIGURE 7.1 Sample input

lists for

cosequential operations.

Initializing: We need to arrange things in such

dure gets going properly.

Synchronizing:
list is

never so

that the proce-

have to make sure that the current name from one

far

ahead of the current name on the other

end-of-file conditions:

When we

we need

program.

errors:

to halt the

When

list

that a

means reading the next name

sometimes from both lists.

this

Handling

Recognizing

way

match will be missed. Sometimes

from List 1, sometimes from List
file 2,

get to the end of either

an error occurs in the data

names or names out of sequence) we want

to detect

(e.g.,
it

file 1

duplicate

and take some

action.
Finally, we would like our algorithm to be reasonably efficient, simple,
and easy to alter to accommodate different kinds of data. The key to
accomplishing these objectives in the model we are about to present lies in
synchronization.
the way we deal with the second item in our list
At each step in the processing of the two lists, we can assume that we
have two names to compare: a current name from List 1 and a current name
from List 2. Let's call these two current names NAME_1 and NAME_2.
We can compare the two names to determine whether NAME_1 is less

than, equal to, or greater than

NAME_2:

261

A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES

NAME_1

List
If

NAME_1

List 2;

than

NAME_2, we

name from

read the next

greater than

NAME_2, we

read the next

name and

name from

and

names
names from
If the

is less

are the same,

the

two

output the

read the next

lists.

turns out that this can be handled very cleanly with a single loop

one three-way conditional statement, as illustrated in the

7.2. The key feature of this algorithm is that control always
the head of the main loop after every step of the operation. This means

containing

algorithm in Fig.
returns to

that
1

no extra

gets ahead

logic

required within the loop to handle the case

List 2, or List 2 gets

ahead of List

PROGRAM: match
call initialize!) procedure to:
- open input files LIST_1 and LIST_2
- create output file 0UT_FILE
- set MORE_NAMES_EXIST to TRUE
- initialize sequence checking variables
to get NAME_1 from LIST_1
to get NAME_2 from LIST_2

while (MORE_NAMES_EXIST)
if (NAME_1 < NAME_2)
call input () to get NAME_1 from LIST_1

else if (NAME_1 > NAME_2

call input () to get NAME_2 from LIST_2

/* match
names are the same */
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1
call input () to get NAME_2 from LIST_2
endif
endwhile
finish_up(

else

end PROGRAM

List

or the end-of-file

FIGURE 7.2 Cosequential match procedure based on a single loop.

call input ()
call input()

when

262

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

is reached on one list before it is on the other. Since each pass

through the main loop looks at the next pair of names, the fact that one
list may be longer than the other does not require any special logic. Nor
the while statement simply checks the
does the end-of-file condition
MORE_NAMES_EXIST flag on every cycle.

condition

The

logic inside

the loop

conditions can exist after reading

them. Since

when

the

Note

are

names

implementing

a
a

equally simple.

name; the if
match process
'

Only
else

three

possible

logic handles

all

here, output occurs only

are the same.

main program does not concern itself with such matters

sequence checking and end-of-file detection. Since their presence in the
that the

main loop would only obscure the main synchronization logic, they have
been relegated to subprocedures.
Since the end-of-file condition is detected during input, the setting of
the MORE_NAMES_EXIST flag is done in the inputf ) procedure. The
input( ) procedure can also be used to check the condition that the lists be in
strictly

ascending order (no duplicate entries within

Fig. 7.3 illustrates

one method of handling these

list).

tasks.

The algorithm

This "filling out"

FIGURE 7.3 Input routine for match procedure.

PROCEDURE:

input

()

/* input routine for MATCH procedure */

input arguments:
INP_FILE

PREVI0US_NAME

file descriptor for input file to be used

(could be LIST_1 OR LIST_2)
last name read from this list

arguments used to return values:

name to be returned from input procedure
NAME
flag used by main loop to halt processing
MORE_NAMES_EXIST
:

duplicate names, names out of order

FALSE

else if (NAME <= PREVIOUS_NAME)

issue sequence check error
abort processing
endif

PREVIOUS_NAME
end PROCEDURE

NAME

/* set flag to end processing

*/
*/

263

A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES

PROCEDURE:

initialize(

arguments used to return values:

PREV_1, PREV_2
previous name variables for the 2 lists
LIST_1, LIST_2
file descriptors for input files to be used
MORE NAMES EXIST
flag used by main loop to halt processing

/* set both the previous_name variables (one for each list) to

a value that is guaranteed to be less than any input value */

PREV_1
PREV_2

L0W_VALUE
L0W_VALUE

open file for List

as LIST_1
as LIST_2

if (both open statements succeed)

MORE_NAMES_EXIST

TRUE

end PROCEDURE
FIGURE 7.4 Initialization procedure for cosequential processing.

of the

input()

would

use.

All

procedure also indicates the arguments that the procedure

we need now

procedure
initialize(

that
)

complete the logic

begins

procedure,

the

shown

a description

in Fig. 7.4,

performs three

opens the input and output

sets the

MORE_NAMES_EXIST

sets the

previous_name variables (one for each

guaranteed to be

PREV_1

and

less

special

The

two

lists

tasks:

LOW_VALUE

to a value that

list)

of setting
that the procedure ineffect

first

two records

way.

Given these program fragments, you should be able

the

The

TRUE.

flag to

put() does not need to treat the reading of the

any

initialize(

files.

than any input value.

PREV_2

of the

main cosequential match procedure.

provided in

Fig.

7.1,

following

work through

the pseudocode,

and

demonstrate to yourself that these simple procedures can handle the various
resynchronization problems that these sample lists present.

7.1.2 Merging Two Lists

The three-way

test,

single-loop

model

for cosequential processing can

easily be modified to handle merging of

lists

well as matching,

264

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

PROGRAM: merge
call initialize!) procedure to:
- open input files LIST_1 and LIST_2
- create output file 0UT_FILE
- set MORE_NAMES_EXIST to TRUE
- initialize sequence checking variables

call input
call input

()
()

to get NAME_1 from LIST_1

to get NAME_2 from LIST_2

while (MORE_NAMES_EXIST)

< NAME_2)
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1

if (NAME_1

else if (NAME_1 > NAME_2

write NAME_2 to 0UT_FILE
call input () to get NAME_2 from LIST_2

/* match
names are the same */
write NAME_1 to 0UT_FILE
call input () to get NAME_1 from LIST_1
call input (J to get NAME_2 from LIST_2
endif
endwhile
finish_up(

else

end PROGRAM
FIGURE 7.5 Cosequential merge procedure based on a single loop.

illustrated in Fig. 7.5.

the if

else

Note

that

we now produce

list contents.
merge is
between matching and merging is that with

construction since

important difference

output for every case of

a union of the

merging we must read completely through each of the lists. This necessitates a change in our input( ) procedure, since the version used for matching sets the MORE_NAMES_EXIST flag to FALSE as soon as we
detect end-of-file for one of the lists. We need to keep this flag set to
TRUE as long as there are records in either list. At the same time, we must
recognize that one of the lists has been read completely, and we should
avoid trying to read from it again. Both of these goals can be achieved if

265

A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES

we simply

NAME

set the

variable for the completed

list

some

value

that

Cannot possibly occur as a legal input value; and

Has a higher collating sequence value than any possible legal input
value. In other words, this special value would come after all legal
put values in the

We
Fig. 7.6
files

file's

refer to this special value as

HIGH_VALUE

shows how

in-

ordered sequence.

are read to completion.

OTHER_LIST_NAME

whether the other input

list

HIGH_VALUE. The

pseudocode

can be used to ensure that both input

Note that we have

argument list so

the

has reached

its

add the argument

the

function

knows

end.

FIGURE 7.6 Input routine for merge procedure.

PROCEDURE:

input

/* input routine for MERGE procedure */

()

input arguments
INP_FILE

PREVIOUS_NAME
OTHER_LIST_NAME

file descriptor for input file to be used

(could be LIST_1 OR LIST_2)
last name read from this list
most recent name read from the other list

arguments used to return values:

name to be returned from input procedure
NAME
flag used by main loop to halt processing
MORE_NAMES_EXIST
:

and OTHER_LIST_NAME == HIGH_VALUE)

/* end of both lists
MORE_NAMES_EXIST := FALSE

if (EOF)

else if (EOF)
NAME := HIGH_VALUE
else if (NAME <= PREVIOUS_NAME)
issue sequence check error
abort processing
endif

PREVIOUS_NAME
end PROCEDURE

NAME

/* just this list ended */

/* sequence check

266

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Once

again, you should use this logic to work, step by step, through
provided in Fig. 7.1 to see how the resynchronization is handled
and how the use of the HIGH_VALUE forces the procedure to finish both

the

lists

before terminating. Note that the version of input(

HIGH_VALUE

incorporating the

logic can also be used for matching procedures, producing

The only disadvantage to doing so is that the matching

procedure would no longer terminate as soon as one list is completely
processed, but would go through the extra work of reading all the way
through the unmatched entries at the end of the other list.
correct results.

With these two examples, we have covered all of the

Now let us summarize the model before adapting
complex problem.
model.

Summary

7.1.3

of the Cosequential Processing

pieces of our
to a

Model

Generally speaking, the model can be applied to problems that involve the

performance of set operations (union, intersection, and more complex

on two or more sorted input files to produce one or more output

processes)
files.

In this

summary of the

we assume

cosequential processing model,

that

two input files and one output file. It is important to

understand that the model makes certain general assumptions about the
there are only

nature of the data and type of problem to be solved. Here

a list

of the

assumptions, together with clarifying comments.

Assumptions

Comments

Two

or more input files are to be

processed in a parallel fashion to produce one or more output files.

Each
fields,

file is

sorted

and

all files

on one or more key

It is

are ordered in the

the

same ways on the same

some

cases, there

key value

that

some cases an output file may be

same file as one of the input files.

the

not necessary that

same record

all files

have

structures.

fields.

must

exist a

high

greater than any le-

gitimate record key, and a low key

value that is less than any legitimate
record key.

The use of a high key value and a low

key value is not absolutely necessary,
but can help avoid the need to deal
with beginning-of-file and end-of-file
conditions as special cases, hence decreasing complexity.

Records are to be processed

sorted order.

in logical

The

physical ordering of records

is ir-

relevant to the model, but in practice

be very important to the way

is implemented. Physical
ordering can have a large impact on

may

the

model

processing efficiency.

267

A MODEL FOR IMPLEMENTING COSEQUENTIAL PROCESSES

Assumptions

Comments

For each file there is only one current

record. This is the record whose key
is accessible within the main synchro-

The model does not

prohibit looking
ahead or looking back at records, but
such operations should be restricted to
subprocedures and should not be allowed to affect the structure of the
main synchronization loop.

nization loop.

Records can be manipulated only

internal

memory.

program cannot alter a record

on secondary storage.

place

Given these assumptions, here

are the essential

components of the

model.
1.

cal

records in the respective

low

set to the
2.

One main
long

Current records for

Initialization.

all files

are read

the

logi-

first

all files

are

synchronization loop

as relevant records

used, and the loop continues as

remain.

Within the body of the main synchronization loop is a selection

based on comparison of the record keys from respective input file
such

cur r ent_f

i 1

two input

e1_k ey

cur rent_f

else if

else

end

the selection takes a

files,

cur r ent_f

e1_key

i 1

cur r ent_f

i 1

e2_k ey ) then

current keys equal

Input
read

form

e2_k ey ) then

files

and output

files

are sequence checked

the previous_key value with the current_key value

in.

After

a successful

by comparing

when

sequence check, previousjkey

current_key to prepare for the next input operation

sponding
5.

from

Previous_key values for

value.

records. If there are

files.

record

set to

on the corre-

file.

when end-of-file ocwhen high values have

High

values are substituted for actual key values

curs.

The main processing loop

occurred for

all

relevant input

terminates

files.

The

use of high values eliminates

the need to add special code to deal with each end-of-file condition.

(This step is not needed in a pure match procedure, since a match

procedure halts when the first end-of-file condition is encountered.)

268

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

All possible

I/O and error detection

activities are to

be relegated to

subprocesses, so the details of these activities do not obscure the

principal processing logic.

three-way test, single-loop model for creating cosequential

both simple and robust. You will find very few applications
requiring the coordinated sequential processing of two files that cannot be
handled neatly and efficiently with the model. We now look at a problem
This

processes

that

much more complex

than

simple match or merge, but that

nevertheless lends itself nicely to solution by means of the model.

7.2

Application of the Model to a General Ledger Program

7.2.1 The Problem
Suppose

are given the

problem of designing

general ledger

program

The system includes a journal file and a

the month-by-month summaries of the

part of an accounting system.

file.

The

ledger contains

associated with each of the bookkeeping accounts.

ledger
values

sample portion of the

FIGURE 7.7 Sample ledger fragment containing checking and expense accounts.

Acct.
no.

Account

title

101
102

Checking account #1
Checking account #2

505
510
515
520
525
530
535
540
545
550
555
560
565

Advertising expense
Auto expenses
Bank charges
Books and publications
Interest expense
Legal expense
Miscellaneous expense
Office expense
Postage and shipping
Rent
Supplies
Travel and entertainment
Utilities

Jan

Feb

Mar

1032.57
543 78

2114.56
3094.17

5219.23
1321.20

25.00
195.40
0.00
27.95
103.50

25.00
307.92
5.00
27.95
255.20

25.00
501.12
5.00
87.40
380 27

12.45
57.50
21.00
500.00
112.00
62.76
84.89

17.87
105.25
27.63
1000.00
167.50
198.12
190.60

23.87
138.37
57.45
1500.00
241.80
307.74
278 48

Apr

269

APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM

Debit/

Acct.
no.

Check no

Date

Description

101
510
101
550
101
505
102
540
101
510

1271
1271
1272
1272
1273
1273
670
670
1274
1274

04/02/86
04/02/86
04/02/86
04/02/86
04/04/86
04/04/86
04/07/86
04/07/86
04/09/86
04/09/86

Auto expense
Tune up and minor repair
Rent
Rent for April
Advertising
Newspaper ad re: new product
Office expense
Printer ribbons (6)
Auto expense
Oil change

credit

78 70
78.70
500.00
500.00
87.50
87.50
32.78
32.78
12.50
12.50
.

FIGURE 7.8 Sample journal entries.

ledger, containing only checking

and expense accounts,

illustrated in

Fig. 7.7.

The journal

file

monthly transactions that are ultimately to

Figure 7.8 shows what these journal transactions

contains the

be posted to the ledger file.

look like. Note that the entries in the journal

file

are paired. This

because

every check involves both subtracting an amount from the checking

account balance and adding an amount to

at least

one expense account. The

accounting program package needs procedures for creating

interactively,

probably outputting records to the

file as

this

journal

file

checks are keyed in

and then printed.

Once

file is complete for a given month, which means that

of the transactions for that month, the journal must be posted
to the ledger. Posting involves associating each transaction with its account
in the ledger. For example, the printed output produced for accounts 101,
102, 505, and 510 during the posting operation, given the journal entries in
Fig. 7.8, might look like the output illustrated in Fig. 7.9.
How is the posting process implemented? Clearly, it uses the account

the journal

contains

number

all

as a key to relate the

journal transactions to the ledger records.

possible solution involves building an index for the ledger, so

can

One
work

through the journal transactions, using the account number in each journal
entry to look up the correct ledger record. But this solution involves
seeking back and forth across the ledger file as we work through the
journal. Moreover, this solution does not really address the issue of creating
the output

list,

which

collected together. Before

all

journal entries for even the

the journal entries relating to an account are

could print out the ledger balances and collect

first

account, 101,

we would have

proceed

all

270

101

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Checking Account #1
1271 04/02/86 Auto expense
1272 04/02/86 Rent
1273 04/04/86 Advertising
1274 04/09/86 Auto expense

- 78.70
- 500.00
- 87.50
12.50

Prev.

510

5219.23

bal

1321.20

New bal: 4540.53

Checking account #2
670 04/07/86 Office expense

102

505

bal

32.78
New bal: 1288.42

Advertising expense
1273 04/04/86 Newspaper ad re: new product
25.00
Prev. bal:
Auto expenses
1271 04/02/86
1274 04/09/86

87.50
New bal:

112.50

78.70
12.50

Tune up and minor repair

Oil change
501.12
Prev. bal:

New bal:

592.32

FIGURE 7.9 Sample ledger printout showing the effect of posting from the journal.

the

way through

account 101

the journal

collect

list. Where would we save the transactions

them during this complete pass through

for

the

journal?

A much

better

solution

begin by collecting

all

the journal

transactions that relate to a given account. This involves sorting the journal
transactions

FIGURE 7.10

by account number, producing

List of journal transactions sorted

a list

ordered

by account number.

Debit/

Acct.
no.

Check no

Date

101
101
101
101
102

1271
1272
1273
1274
670
1273
1271
1274
670
1272

04/02/86
04/02/86
04/04/86
04/09/86
04/07/86
04/04/86
04/02/86
04/09/86
04/07/86
04/02/86

505
510
510
540
550

as in Fig. 7.10.

Description

Auto expense
Rent
Advertising
Auto expense
Office expense
Newspaper ad re: new product
Tune up and minor repair
Oil change
Printer ribbons (6)
Rent for April

credit

- 78.70
- 500.00
- 87.50
- 12.50
- 32.78

87.50
78.70
12.50
32.78
500.00

APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM

Ledger

Journal

list

101

Checking account #1

102
505
510

Checking account #2
Advertising expense
Auto expenses

FIGURE 7.1

Conceptual view

101
101
101
101
102

505
510
510

of cosequential

list

Auto expense
Rent
Advertising
Auto expense
Office expense
Newspaper ad re: new product
Tune up and minor repair
Oil change

1271
1272
1273
1274
670
1273
1271
1274

matching

of the ledger

and journal

files.

Now we can create our output list by working through both the ledger
and the sorted journal cosequentially meaning that we process the two lists
sequentially and in parallel. This concept is illustrated in Fig. 7.11. As we
start working through the two lists, we note that we have an initial match
on account number. We know that multiple entries are possible in the
journal file, but not in the ledger, so we move ahead to the next entry in the
journal. The account numbers still match. We continue doing this until the
,

account numbers no longer match.

then resynchronize the cosequential

by moving ahead in the ledger list.

This matching process seems simple, as it in fact is, as long as every
account in one file also appears in another. But there will be ledger accounts
for which there is no journal entry, and there can be typographical errors
that create journal account numbers that do not actually exist in the ledger.
Such situations can make resynchronization more complicated and can
result in erroneous output or infinite loops if the programming is done in an
ad hoc way. By using the cosequential processing model, we can guard
against these problems. Let us now apply the model to our ledger problem.
action

7.2.2 Application of the Model to the Ledger Program

The ledger program must perform two

needs to update the ledger

file

tasks:

with the correct balance for each ac-

count for the current month.

must produce

a printed

version of the ledger that not only shows

the beginning and current balance for each account, but also

lists all

the journal transactions for the month.

focus on the second task since

the

most

difficult.

Let's

look

again at the form of the printed output, this time extending the output to

272

101

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Checking account #1
1271
04/02/86 Auto expense
1272 04/02/86 Rent
1273 04/04/86 Advertising
1274 04/09/86 Auto expense
Prev.

102

515

520

bal:

1321.20

Advertising expense
1273 04/04/86 Newspaper ad

Auto expenses
1271
04/02/86
1274 04/09/86

32.78
New bal: 1288.42

new product
25.00

87.50
New bal:

Tune up and minor repair

Oil change
Prev. bal:
501.12

78.70
12.50
New bal:

592.32

re:

Prev.

510

5219.23

Checking account #2
670 04/07/86 Office expense
Prev.

505

bal:

- 78.70
- 500.00
- 87.50
12.50
New bal 4540.53

bal:

112.50

Bank charges
Prev.

bal:

5.00

New Bal:

5.00

Prev.

bal:

87.40

New bal:

87.40

Books and publications

FIGURE 7.12 Sample ledger printout

include a few

for the first six accounts.

more accounts

shown

in Fig. 7.12.

As you can

see, the

printed output from the ledger program shows the balances of

all

accounts, whether or not there were transactions for the account.

From

point of view of the ledger accounts, the process

a merge,

ledger
the

since even

unmatched ledger accounts appear in the output.

What about unmatched journal accounts? The ledger accounts and
journal accounts are not equal in authority. The ledger file defines the set of
legal accounts; the journal file contains entries that are to be posted to the

accounts listed in the ledger.

not match

The

existence of a journal account that does

ledger account indicates an error.

From

the point of view of the

one of matching. Our

procedure needs to implement a kind of combined merging/matching
algorithm while simultaneously handling the chores of printing account
title lines, individual transactions, and summary balances.
Another difference between the ledger posting operation and the

journal accounts,

straightforward

the posting process

strictly

matching and merging algorithms

that

the

ledger

procedure must accept duplicate entries for account numbers in the journal

273

APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM

while

still

earlier

treating a duplicate entry in the ledger as an error. Recall that our

matching and merging routines accept keys only

order, rejecting

The

functions that
variables that

ascending

duplicates.

all

inherent simplicity of the three-way

our favor

in strict

test,

single-loop

we make these modifications. First,

we use for the ledger and journal
we need for use in the main loop.

let's

model works

look

files,

input

at the

identifying the

Figure 7.13 presents

have

treated individual variables within the ledger record as return values to

draw

pseudocode for the procedure

that accepts input

from the

attention to these variables; in practice the procedure

ledger.

would probably

return

the entire ledger record to the calling routine so that other procedures could

have access to things such as the account title as they print the ledger. We
are overlooking such matters here, focusing instead on the variables that are

FIGURE 7.13 Input routine for ledger

PROCEDURE:

ledger_input

input arguments:
L_FILE

J_ACCT

file.

file descriptor for ledger file

current value of journal account number

arguments used to return values:

L_ACCT
account number of new ledger record
balance for this ledger record
L_BAL
flag used by main loop to halt processing
MORE RECORDS EXIST
static,

local variable that retains its value between calls

last acct number read from ledger file

PREV_L_ACCT

read next record from L_FILE, assigning values to L_ACCT and L_BAL
if (EOF)

and (J_ACCT == HIGH_VALUE)

:= FALSE

MORE_RECORDS_EXIST

/* end of both files

/* just ledger is done

/* sequence check
/* (permit no duplicates)

*/
*/

else if (EOF)

L_ACCT

HIGH_VALUE

else if (L_ACCT <= PREV_L_ACCT)

issue sequence check error
abort processing
endif

PREV_L_ACCT
end PROCEDURE

L_ACCT

274

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

involved in the cosequential logic. Note that since

for use with ledger entries,

number

this

function

strictly

can keep track of the previous ledger account

locally within the procedure rather than pass this value in as an

argument.
Figure 7.14 outlines the logic for the procedure used to accept input

from the journal

though

file.

It is

including that

respects,
a

similar to the ledger_input() procedure in

most

returns values for individual variables,

even

working implementation would probably return

the entire journal

record. Note, however, that the sequence-checking logic

journal_input(). In this procedure

same account number

FIGURE 7.14 Input routine for journal

PROCEDURE:

ournal_input

we need

different in

to accept records that

have the

previous records. Given these input procedures,

file.

input arguments:

J_FILE
L_ACCT

file descriptor for journal file

current value of ledger account number

arguments used to return values:

account number of new journal record
J_ACCT
amount of this journal transaction
TRANS_AMT
flag used by main loop to halt processing
MORE RECORDS EXIST
local variable that retains its value between calls
last acct number read from journal file
PREV_J_ACCT

static,

and (L_ACCT == HIGH_VALUE)

:= FALSE

MORE_RECORDS_EXIST

/* end of both files

/* just ledger is done

/* sequence check
/* (permit duplicates)

*/
*/

else if (EOF)

J_ACCT

HIGH_VALUE

else if (J_ACCT < PREV_J_ACCT)

issue sequence check error
abort processing
endif

PREV_J_ACCT
end PROCEDURE

J_ACCT

275

APPLICATION OF THE MODEL TO A GENERAL LEDGER PROGRAM

PROGRAM:

ledger

call initialize() procedure to:

- open input files L_FILE and J_FILE
- set MORE_RECORDS_EXIST to TRUE
call ledger_input

PREV_L_BAL

L_BAL

call journal_input

/* set starting ledger balance

for this first ledger account

/* we have read all the journal

entries for this account

while (MORE_RECORDS_EXIST)
if (L_ACCT

< J_ACCT)

print PREV_L_BAL and L_BAL

call ledger_input
if (L_ACCT < HIGH_VALUE)
print account number and title for new ledger account
PREV_L_BAL = L_BAL
endif
(

else if (L.ACCT > J.ACCT)

print error message
call journal_input

/* bad journal account number

/* match
add journal transaction amount
/* to ledger balance for this account

else

*/
*/

L_BAL := L_BAL + TRANS_AMT

output the transaction to the printed ledger
call journal_input
endif
endwhile
(

end PROGRAM
FIGURE 7.15 Cosequential procedure to process ledger and journal

files to

produce printed

ledger output.

can handle our cosequential processing and output as illustrated in

Fig. 7.15.

The reasoning behind

If the ledger

account

the three-way test

less

as follows:

than the journal account, then there are

no more transactions to add to this ledger account (perhaps there

were none at all), so we print out the ledger account balances and
read in the next ledger account. If the account exists (value

276

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

HIGH_VALUE), we

PREV_BAL

date the

journal account

print the

title line

for the

new

account and up-

variable.

If the

matched journal account, perhaps due to an input error. We print an

error message and continue.
If the account numbers match, then we have a journal transaction
that

action

less

than the ledger account, then

to be posted to the current ledger account.

amount

it is

an un-

add the trans-

to the account balance, print the description of the

Note that unlike

merging algorithms, we do

transaction, and then read in the next journal entry.

the

match

case in either the matching or

not read in

new

entry from both accounts. This

our acceptance of more than one journal entry for

is a

reflection

a single

ledger

account.

The development of
cosequential processing
contributes to

its

this

model

adaptability.

ledger posting procedure from our basic

illustrates

entirely different direction, extending

the simplicity of the

model

to enable cosequential processing

more than two input files at once. To

model to include multiway merging.

7.3

how

can also generalize the model in an

illustrate this,

we now

extend the

Extension of the Model to Include Multiway Merging

The most common
than two input files
lists

to create a single,

as the order

more
which we want to merge K input
sequentially ordered output list. Kis often referred to

application of cosequential processes requiring

K-way

merge, in

K-way merge.

7.3.1 A /(-way Merge Algorithm

Recall the synchronizing loop

use to handle

two-way merge of two

lists of names (Fig. 7.5). This merging operation can be viewed as a process
of deciding which of two input names has the minimum value, outputting

name, and then moving ahead in the list from which that name is taken.
of duplicate input entries, we move ahead in each list.
Given a min() function that returns the name with the lowest collating
sequence value, there is no reason to restrict the number of input names to
two. The procedure could be extended to handle three (or more) input lists
that

In the event

shown

in Fig. 7.16.

Clearly, the expensive part of this procedure

which

lists

the

name

occurs and which

files

the series of tests to see

therefore need to be read.

277

EXTENSION OF THE MODEL TO INCLUDE MULTIWAY MERGING

while (MORE_NAMES_EXIST)

OUT_NAME = min(NAME_l, NAME_2

write OUT_NAME to OUT_FILE

NAME_3

...

NAME_K

if (NAME_1 == OUT_NAME)
call input () to get NAME_1 from LIST_1
if (NAME.2 == 0UT_NAME)
call input () to get NAME_2 from LIST_2
if (NAME_3 == 0UT_NAME)
call input () to get NAME_3 from LIST_3

if (NAME_K == 0UT_NAME)
call input () to get NAME_K from LIST_K

endwhile
FIGURE 7.16 /(-way merge loop, accounting

for duplicate

names.

Note that since the name can occur in several lists, every one of these //tests
must be executed on every cycle through the loop. However, it is often
possible to guarantee that a single name, or key, occurs in only one

procedure becomes much simpler and more

reference our lists through a vector of list names

this case, the

listen, list[2], list

C3 J

...

efficient.

list.

Suppose

listCK]

and suppose we reference the names (or keys) that are being used from these
lists at any given point in the cosequential process through another vector:

named], name[23, name [3],

Then

the procedure

shown

nameCK]

in Fig. 7.17 can be used,

that the input () procedure attends to the

This procedure clearly

...

differs in

assuming once again

MORE_NAMES_EXIST

many ways from our

initial

flag.

three-way

procedure that merges two lists. But, even so, the

single-loop parentage is still evident: There is no looping within a list. We
determine which list has the key with the lowest value, output that key,

test,

single-loop

move
as

it is

ahead one key in that

powerful.

list,

and loop again. The procedure

simple

278

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

7.3.2 A Selection Tree

for

The K-way merge described

8 or so.

When we

Merging Large Numbers

in Fig. 7.17

begin merging

works nicely if K
number of

larger

sequential comparisons to find the key with the

noticeably expensive.

merge more than

of Lists

larger than

lists,

the set of

minimum

We see later that for practical

reasons

value becomes

it is

rare to

want

one time, so the use of sequential

comparisons is normally a good strategy. If there is a need to merge
considerably more than eight lists, we could replace the loop of comparisons with a selection tree.
Use of a selection tree is an example of the classic time versus space
trade-off that we so often encounter in computer science. We reduce the
time required to find the key with the lowest value by using a data structure
to save information about the relative key values across cycles of the
procedure's main loop. The concept underlying a selection tree can be
readily communicated through a diagram such as that in Fig. 7.18. Here we
have used lists in which the keys are numbers rather than names.
The selection tree is a kind of tournament tree in which each higher-level
node represents the "winner" (in this case the minimum key value) of the
to

eight

files

FIGURE 7.17 K-way merge loop, assuming no duplicate names.

/* initialize the process by reading in a name from each list

for i := 1 to K
call input () to get nameti] from listCi]

/* now start the K-way merge */

while (MORE_NAMES_EXIST)
/* find subscript of name that has the lowest collating
sequence value among the names available on the K lists

LOWEST
for

to K
if (nameti]
i

LOWEST
next

< name [LOWEST]

write nameCLOWEST] to 0UT_FILE

/* now replace the name that was written out
call input () to get nameCLOWEST] from listCLOWEST]

endwhile

A SECOND LOOK AT SORTING

279

RAM

7, 10,

List

9, 19,

List

11, 13,

List 2

18, 22,

24.

List 3

List 4

List 5

15. 20,

30.

List 6

8, 16,

29.

List 7

Input
12, 14, 21

FIGURE 7.18 Use of a selection tree to assist

mum value in a /(-way merge.

the selection of a key with mini-

comparison between the two descendent keys. The minimum value is

at the root node of the tree. If each key has an associated reference
to the list from which it came, it is a simple matter to take the key at the
root, read the next element from the associated list, and then run the
tournament again. Since the tournament tree is a binary tree, its depth is
always

rio g2
for a

merge of K

lists.

The number of comparisons

new tournament winner

being

7.4

a linear

Chapter 5
to

fit

of course, related to

is,

required to establish

this depth,

rather than

function of K.

A Second Look
enough

we
in

at Sorting in

RAM

considered the problem of sorting

RAM. The

operation

disk

file

that

small

described involves three separate

steps:

from disk

RAM.

Read the

Sort the records using a standard sorting procedure, such as Shell-

Write the

entire

file

into

sort.

The

total

steps.

file

back to

disk.

time taken for sorting the

file is

the

sum of the

times for the three

We see that this procedure is much faster than sorting the file in place,

on the

disk, because

both reading and writing are sequential.

280

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Can we improve on the time that it takes for this RAM sort? If we
assume that we are reading and writing the file as efficiently as possible, and
we have chosen the best internal sorting routine available, it would seem
not. Fortunately, there is one way that we might speed up an algorithm that
has several parts, and that is to perform some of those parts in parallel.
Of the three operations involved in sorting a file that is small enough to
fit into RAM, is there any way to perform some of them in parallel? If we
have only one disk drive, clearly we cannot overlap the reading and writing

how about doing either

that we sort the file?

operations, but
the

same time

7.4.1 Overlapping Processing and

Most of the time when we use an

the

whole

file in

memory

the reading or writing (or both) at

I/O:

Heapsort

internal sort

before

can

we have to

wait until

start sorting. Is there

we have

an internal

and that can begin sorting numbers

immediately as they are read in, rather than waiting for the whole file to be
in memory? In fact, there is, and we have already seen part of it in this
chapter. It is called heapsort, and it is loosely based on the same principle as
sorting algorithm that

reasonably

fast

the selection tree.

Recall that the selection tree compares keys as

time

key

a
it

new key

arrives,

it is

compared

goes to the front of the

because

means

that

tree.

is,

This

encounters them. Each

file is

and

it is

the largest

very useful for our purposes

can begin sorting keys

rather than waiting until the entire

That

to the others,

they arrive in

loaded before

RAM,

start sorting.

sorting can occur in parallel with reading.

Unfortunately, in the case of the selection tree, each time a new largest
key is found it is output to the file. We cannot allow this to happen if we
want to sort the whole file because we cannot begin outputting records until
we know which one comes first, second, etc., and we won't know this until
we have seen all of the keys.
Heapsort solves this problem by keeping all of the keys in a structure
called a heap. A heap is a binary tree with these properties:

Each node has a single key, and that key is less than or equal to the
key at its parent node.
i\2j It is a complete binary tree, which means that all of its leaves are on
no more than two levels, and all of the keys on the lowest level are
'*~\.)

in the leftmost position.

Because of properties 1 and 2, storage for the tree can be allocated

sequentially as an array in such a way that the indices of the left and
right children of node i are 2i and 2i + 1, respectively. Conversely,
the index of the parent of node j is |_j/2j.

A SECOND LOOK AT SORTING IN

23456789

281

RAM

P
ew^>

Q-

/\
FIGURE 7.19 A heap

both

Figure

its

tree form

and as

would be stored

an array.

19 shows a heap in both its tree form and as it would be stored

Note that this is only one of many possible heaps for the given

in an array.

of keys. In practice, each key has an associated record that is either stored
with the key or pointed to by a pointer stored with the key.
Property 3 is very useful for our purposes, because it means that a heap
is just an array of keys, where the positions of the keys in the array are
sufficient to impose an ordering on the entire set of keys. There is no need
for pointers or other dynamic data structuring overhead to create and
maintain the heap. (As we pointed out earlier, there may be pointers
associating each key with its corresponding record, but this has nothing to
do with maintaining the heap itself.)
set

in the array

7.4.2 Building the Heap while Reading

The algorithm

for heapsort has

two

parts. First

output the keys in sorted order. The

same time

that

The

shown

7.20.

Fig.

first

the File

build the heap, and then

stage can occur at virtually the

read in the data, so in terms of computer time

essentially free.
in

comes

basic steps in the algorithm for building the heap are

Figure 7.21 contains

sample application of

this

algorithm.

This describes

how we

build the heap, but

doesn't

the input overlap with the heap-building procedure.

tell

how

FIGURE 7.20 Procedure for building a heap.

For

REC0RD_C0UNT

Read in the next record and append it to the end of the

array; call it5 key K
While K 15 less than the key of its parent:
Exchange the record with key K with its parent
next

make

solve that problem,

282

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

FDCGHIBEA

New key to

Heap, after insertion

be inserted

of the

Selected heaps

new key

in tree

form

12345678'
F

D F

C F D

123456789
C F D G

123456789
C F D G H

123456789
B F C G H

123456789
BECFHIDG
123456789.

ABCEHIDGF

g'Nf

FIGURE 7.21 Sample application of the heap-building algorithm. The keys

G, H, I, B, E, and A are added to the heap in the order shown.

we need

to look at

not going to do
a

block of records

D, C,

how we perform the read operation. For starters, we are

we want a new record. Instead, we read

seek every time

at a

time into an input buffer, and then operate on

all

RAM
the input buffer for each new block of keys can be part of the RAM

the records in the block before going on to the next block. In terms of
storage,

is set up for the heap itself. Each time we read in a new block, we
append it to the end of the heap (i.e., the input buffer "moves" as the
heap gets larger). The first new record is then at the end of the heap array,
as required by the algorithm (Fig. 7.20). Once that record is absorbed into
the heap, the next new record is at the end of the heap array, ready to be
absorbed into the heap, and so forth.
Use of an input buffer avoids doing an excessive number of seeks, but
it still doesn't let input occur at the same time that we build the heap. We

area that

just

A SECOND LOOK AT SORTING

283

RAM

saw in Chapter 3 that the way to make processing overlap with I/O is to use
more than one buffer. With multiple buffering, as we process the keys in
one block from the file, we can simultaneously be reading in later blocks
from the file. If we use multiple buffers, how many should we use, and
where should we put them? We already answered these questions when we
decided to put each

new
a new

new

block

at the

block, the array gets bigger

end of the array. Each time

we add

the size of that block, in effect creating

file. So the number of buffers is the

and they are located in sequence in the array

input buffer for each block in the

number of blocks

in the

file,

itself.

Figure 7.22 illustrates the technique that

we append
employing

we have just

described,

where

block of records to the end of the heap, thereby

RAM-sized

set

of input buffers.

Now we

read in

new

blocks

having to wait for processing before reading in a

block. On the other hand, processing (heap building) cannot occur
given block until the block to be processed is read in, so there may

as fast as

new

each

can, never

be some delay in processing

processing speeds are faster than reading

speeds.

7.4.3 Sorting while Writing out to the

The second and

Again,

it is

First, let's

final step

File

involves writing out the heap in sorted order.

possible to overlap I/O (in this case writing) with processing.

look

at the

Again, there

algorithm for outputting the sorted keys

nothing inherent in

this

algorithm that

(Fig. 7.23).

lets it

overlap

with I/O, but we can take advantage of certain features of the algorithm to
make overlapping happen. First, we see that we know immediately which
record will be written first in the sorted file; next, we know what will come
second; and so forth. So as soon as we have identified a block of records, we
can write out that block, and while we are writing out that block we can be
identifying the next block, and so forth.
Furthermore, each time we identify a block to write out, we make the
heap smaller by exactly the size of a block, freeing that space for a new
output buffer. So just as was the case when building the heap, we can have
as many output buffers as there are blocks in the file. Again, a little
coordination is required between processing and output, but the conditions
exist for the two to overlap almost completely.
A final point worth making about this algorithm is that all I/O that it
performs is essentially sequential. All records are read in in the order in
which they occur in the file to be sorted, and all records are written out in
sorted order. The technique could work equally well if the file were kept on
tape or disk.

can be done with

importantly, since

all

minimum amount

I/O

sequential,

of seeking.

we know

that

284

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Total

RAM area allocated for heap

First input buffer. First part

added

to the heap, then the

of heap is built here. The first record

second record is added, and so forth.

Second input buffer. This buffer is being

filled while heap is being built in first buffer.

Second part of heap is built here. The first record

added to the heap, then the second record, etc.

Third input buffer. This buffer is filled

is being built in second buffer.

while heap

Third part of heap

built here.

1
Fourth input buffer
heap

filled while

being built in third buffer.

FIGURE 7.22 Illustration of the technique described in the text for overlapping input with
heap building in RAM. First read in a block into the first part of RAM. The first record is the
first record in the heap. Then extend the heap to include the second record, and incorporate
that record into the heap, and so forth. While the first block is being processed, read in the
second block. When the first block is a heap, extend it to include the first record in the second block, incorporate that record into the heap, and go on to the next record. Continue until
all blocks are read in and the heap is completed.

FIGURE 7.23 Procedure

For

for outputting the

contents of a heap

sorted order.

REC0RD_C0UNT

Output the record in the first position in the array (this

record has the smallest key).
Move the key in the last position in the array (call it K)
to the first position, and define the heap as having one
fewer member than it previously had.
While K is larger than both keys of its children:
Exchange K with the smaller of its two children's keys
next

MERGING AS A WAY OF SORTING LARGE

7.5

Way

Merging as a
In

Chapter 5

FILES

of Sorting Large Files

ran into problems

too large to be wholly contained in

when we needed

RAM. The

285

ON DISK

on Disk

to sort files that

chapter offered

were

a partial,

but

ultimately unsatisfactory, solution to this problem in the form of a key sort,

which we needed

to hold only the keys in

RAM,

along with pointers to

each key's corresponding record. Keysort had two shortcomings:

Once the keys were sorted, we then had to bear the substantial cost
of seeking to each record in sorted order, reading each record in and
then writing it out into the new, sorted file.
With keysorting, the size of the file that can be sorted is limited by
the number of key/pointer pairs that can be contained in RAM.
Consequently, we still cannot sort really large files.

RAM

As an example of the kind of file we cannot

sort with either a

sort
have a file with 800,000 records, each of which is
100 bytes long and contains a key field that is 10 bytes long. The total length
of this file is about 80 megabytes. Let us further suppose that we have one
megabyte of
available as a work area, not counting
used to
hold the program, operating system, I/O buffers, and so forth. Clearly, we
cannot sort the whole file in RAM. We cannot even sort all the keys in

or a keysort, suppose

RAM

RAM.
The multiway merge algorithm

discussed in section 7.3 provides the

beginning of an attractive solution to the problem of sorting large files such

as this one. Since
sorting algorithms such as heapsort can work in

RAM

place, using only a small

some temporary

amount of overhead

variables,

reading records into

can create

for maintaining pointers

a sorted subset

RAM until the RAM work area

work

area,

disk as a sorted subfile.

of our

almost

and

full file

full,

sorting

and then writing the sorted records back to

such a sorted subfile a run. Given the
memory constraints and record size in our example, a run could contain
approximately

the records in this

call

1,000,000 bytes of

RAM

-7
-r
100 bytes per record

Once we
again filling

example,

10,000 records.

we then read in a new set of records, once

and create another run of 10,000 records. In our
repeat this process until we have created 80 runs, with each run
create the first run,

RAM,

containing 10,000 sorted records.

Once we have the 80 runs in 80 separate files on disk, we can perform

an 80-way merge of these runs, using the multiway merge logic outlined in

286

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

800,000 unsorted records

iT
oo

<X~wOO
80 internal sorts

80 runs, each containing 10,000 sorted records

iic^-i

800,000 records in sorted order

FIGURE 7.24 Sorting through the creation

merging of runs.

of runs (sorted subfiles)

section 7.3, to create a completely sorted

records.

schematic view of

this

file

and subsequent

containing

all

provided in Fig. 7.24.

This solution to our sorting problem has the following
It

can, in fact, sort large

files,

the original

run creation and merging process

and can be extended to

features:

files

of any

size.

Reading of the input file during the run creation step is sequential,
and hence is much faster than input that requires seeking for every
record individually

(as in a keysort).

Reading through each run during merging and writing out the sorted
records

also sequential.

Random

accesses are required only as

switch from run to run during the merge operation.

MERGING AS A WAY OF SORTING LARGE

If a

heapsort

RAM

used for the

in section 7.4,

in-RAM

FILES

287

ON DISK

part of the merge, as described

can overlap these operations with I/O, so the in-

part does not add appreciably to the total time for the merge.

Since I/O

largely sequential, tapes can be used if necessary for both

input and output operations.

How Much Time Does

7.5.1

Merge Sort Take?

This general approach to the problem of sorting large

To compare
takes.

long

We do this
it

files

approach to others, we now look at

by taking our 800,000-record example

this

takes to

merge

sort

looks promising.

how much
file

time

and seeing

how

on the hypothetical disk drive whose

specifications are listed in Table 3.2. (Please note that our intention here

mean anything

any environment other

Nor do we want to
overwhelm you with numbers or provide you with magic formulas for
determining how long a particular sort on a real system will really take.
Rather, our goal in this section is to derive some benchmarks that we can
use to compare several variations on the basic merge sort approach to

not to derive time estimates that

than the hypothetical environment

sorting external

We
the

we have

posited.

files.)

can simplify matters by making the following assumptions about

computing environment:
Entire

files

are always stored in contiguous areas

and

a single

seek

(extents),

cylinder-to-cylinder seek takes no time. Hence, only one

required for any single sequential access.

Extents that span

such

on disk

way

more than one

track are physically staggered in

that only one rotational delay

required per access.

see in Fig. 7.24 that there are four times

During the sort phase:

Reading all records

into

when I/O

RAM for sorting

performed:

and forming runs; and

Writing sorted runs out to disk.

During the merge phase:

Reading sorted runs into
Writing sorted
Let's look at each

of these

Since

sort the

time from the

file

file.

RAM

for merging;

and

out to disk.
in order.

RAM

for Sorting and Forming Runs

one-megabyte chunks, we read in one megabyte at
a sense, RAM is a one-megabyte input buffer that

Reading Records into

Step
a

file

288

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

fill

up 80 times

form the 80

runs. In

computing the

total

time to input

we need to include the amount of time it takes to access each block

(seek time + rotational delay), plus the amount of time it takes to transfer
each block. We keep these two times separate because, as we see later in our
each run,

calculations, the role that each plays can vary significantly depending on the
approach used.
From Table 3.2 we see that seek and rotational delay times are 18 msec"
and 8.3 msec, respectively, so total time per seek is 26.3 msec* The
transmission rate is approximately 1,229 bytes per msec. Total input time
for the sort phase consists of the time required for 80 seeks, plus the time
required to transfer 80 megabytes:
"

80 seeks X 26.3 msec

Access:

80 megabytes

Transfer:

1,229 bytes/msec

=
=

67 seconds.

Total:

Step

2 seconds

65 seconds

Writing Sorted Runs out to Disk In this case, writing is just the
the same number of seeks and the same amount of data

reverse of reading
to transfer.

takes another 67 seconds to write out the 80 sorted runs.

RAM

for Merging Since we have

Step 3: Reading Sorted Runs into
for storing runs, we divide one megabyte into 80
one megabyte of
parts for buffering the 80 runs. In a sense, we are reallocating our one
megabyte of
as 80 input buffers. Each of the 80 buffers then holds
l/80th of a run (12,500 bytes), so we have to access each run 80 times to read
all of it. Since there are 80 runs, to complete the merge operation (Fig. 7.25)
we end up making

RAM

80 runs x 80 seeks

Total seek and rotation time

80 megabytes

is still

6,400 seeks.

then 6,400 X 26.3 msec

transferred, transfer time

is still

=168

seconds. Since

65 seconds.

computing environment has many active users pulling the read/write head to
other parts of the disk, seek time is actually likely to be less than the average, since many
of the blocks that make up the file are probably going to be physically adjacent to one another on the disk. Many will be on the same cylinder, requiring no seeks at all. However,
for simplicity we assume the average seek time.
""Unless the

*For simplicity, we use the term seek even though we really mean seek and rotational delay.
Hence, the time we give for a seek is the time that it takes to perform an average seek followed by an average rotational delay.

MERGING AS A WAY OF SORTING LARGE

1st

FILES

289

ON DISK

run = 80 buffers' worth (80 accesses)

mi mi

ii ii

2nd run = 80 buffers' worth (80 accesses)

H Ml Ml
1

III

800,000
sorted records

80th run = 80 buffers' worth (80 accesses)

FIGURE 7.25 Effect of buffering on the number of seeks required, where each run
large as the available work area in RAM.

Step

Writing Sorted File out to Disk

writing out the

Unlike steps
buffer,
before

it is

are

file,

and

now

actually

we need

know how

To compute

the time for

big our output buffers are.

RAM sorting space doubled as our I/O

RAM space for storing the data from the runs

where our big

using that

merged.

To keep

matters simple,

let

us assume that

can allocate two 20,000-byte output buffers." With 20,000-byte buffers,

we
we

need to make
80,000,000 bytes

4,000 seeks.

20,000 bytes per seek

Total seek and rotation time

Transfer time

is still

then 4,000 X 26.3 msec

=105

seconds.

65 seconds.

The time estimates for the four steps are summarized in the first row in
7.1. The total time for this merge sort is 537 seconds, or 8 minutes,
57 seconds. The sort phase takes 134 seconds, and the merge phase takes 403
Table

seconds.

To gain an appreciation of the improvement that this merge sort

approach provides us, we need only look at how long it would take us to
do one part of a nonmerging method like the keysort method described in

use

two

buffers to allow double buffering;

approximately the

size

a track

use 20,000 bytes per buffer because that

on our hypothetical disk

drive.

290

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

TABLE

7.1

Time estimates

for

merge

phase (steps 1 and 2)

phase is 403 seconds.

80-megabyte

file, assuming use of

Table 3.2. The total time for the sort
134 seconds, and the total time for the merge

sort of

hypothetical disk drive described

Number

Amount

Seek + Rotation

Transfer

Transferred
(Megabytes)

Time

(Seconds)

Seeks

Total Time
(Seconds)

Sort: reading

Sort: writing

6,400

168

233

Merge: reading
Merge: writing
Totals

4,000

105

170

10,560

320

277

260

537

Chapter

The

last

part of the keysort algorithm (Fig. 5.17) consists of this

for loop:

/*
/*

for

read in records according to sorted order, and write them

in this order

out
i

*/
*/

REC_CDUNT

seek in IN_FILE to record with RRN of KEYN0DES

N_F LE
read the record into BUFFER from
I

RRN

write BUFFER contents to DUT_FILE

This for loop requires us to do

a separate seek for

every record in the

file.

That is 800,000 seeks. At 26.3 msec per seek, the total time required to
perform that one operation works out to 21,040 seconds, or 5 hours, 50
minutes, 40 seconds!
Clearly, for large files the merge sort approach in general is the best
option of any that we have seen. Does this mean that we have found the best
technique for sorting large files? If sorting is a relatively rare event and files
are not too large, the particular approach to merge sorting that we have just
looked at produces acceptable results. Let's see how those results stand up
as we change some of the parameters of our sorting example.

7.5.2 Sorting a

The

first

File

That

Ten Times Larger

question that comes to

applicability of a

mind when we ask about the general

is, What happens when we make the

computing technique

problem bigger? In this instance, we need

up as we scale up the size of the file.

to ask

how

this

approach stands

MERGING AS A WAY OF SORTING LARGE

Before

look

how

bigger

file affects

291

ON DISK

FILES

the performance of our

merge sort, it will help to examine the kinds of I/O that are being done in
the two different phases, the sort phase and the merge phase. We will see
that for the purposes of finding ways to improve on our original approach,
we need pay attention only to one of the two phases.
A major difference between the sort phase and the merge phase is in the
amount of sequential (vs. random) access that each performs. By using

all I/O is,

minimal seeking, we
cannot algorithmically speed up I/O during the sort phase. No matter what
we do with the records in the file, we have to read them and write them all
at least once. Since we cannot improve on this phase by changing the way
we do the sort or merge, we ignore the sort phase in the analysis that

heapsort to create runs during the sort phase,

guarantee that

in a sense, sequential.^ Since sequential access implies

follows.

the

The merge phase is a different matter. In particular, the reading step of

merge phase is different. Since there is a RAM buffer for each run, and

these buffers get loaded and reloaded at unpredictable times, the read step of

the merge phase is to a large extent one in which random accesses are the
norm. Furthermore, the number and size of the RAM buffers that we read
the run data into determine the number of times we have to do random
accesses. If we can somehow reconfigure these buffers in ways that reduce
the number of random accesses, we can speed up I/O correspondingly. So,
if we are going to look for ways to improve performance in a merge sort
algorithm, our best hope is to look for ways to cut down on the number of random
accesses that occur while reading runs during the

What about

the write step of the

not influenced by differences in the

phase, this step

Improvements

in the

merge phase.

merge phase? Like

the other hand,

when we measure

the steps of the sort

way we

organize runs.

way we organize the merge sort do not affect this step.

we will see later that it is helpful to include this phase

the results of changes in the organization of the

merge

sort.

To sum up, since the merge phase is the only one in which we can
improve performance by improving the method, we concentrate on it from

now on. Now let's

get back to the question that

What happens when we make

the

we started

problem bigger? How,

time for the merge phase affected if our

file is

this section with:

for instance,

800,000?

"'"It

not sequential in the sense that in a multiuser

environment there

will be other users

pulling the read/write head to other parts of the disk between reads and writes, possibly

forcing the disk to do

seek each time

the

8,000,000 records rather than

reads or writes a block.

292

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

TABLE 7.2

Time estimates

merge

phase

800-megabyte

file, assuming use of

Table 3.2. The total time for the merge
19,186 seconds, or 5 hours, 19 minutes, 22 seconds.

for

sort of

hypothetical disk drive described

Number

Amount

Seek + Rotation

Transfer

Transferred
(Megabytes)

Time

(Seconds)

Seeks

Total Time
(Seconds)

Merge: Reading

640,000

800

16,832

651

Merge: Writing

40,000

800

1,050

651

1,703

680,000

1,600

17,882

1,302

19,186

Totals

we increase
space, we

17,483

file by a factor of 10 without increasing the

need to create more runs. Instead of 80 initial
10,000-record runs, we now have 800 runs. This means we have to do an
800-way merge in our one megabyte of
space. This, in turn, means
that during the merge phase we must divide
into 800 buffers. Each of
the 800 buffers holds 1 /800th of a run, so we would end up making 800
seeks per run, and

RAM

the size of our

clearly

RAM
RAM

800 runs x 800 seeks/run

The times

for the

merge phase

are

640,000 seeks altogether.

summarized

Table

7.2.

Note

that

is over 5 hours and 19 minutes, almost 50 times greater than

80-megabyte file. By increasing the size of our file, we have gotten
ourselves back into the situation we had with keysort, where we can't do
the job we need to do without doing a huge amount of seeking. In this
instance, by increasing the order of the merge from 80 to 800, we made it
necessary to divide our one-megabyte RAM area into 800 tiny buffers for
doing I/O, and because the buffers are tiny each requires many seeks to

the total time

for the

process

its

corresponding run.

to improve performance, clearly we need to look for ways

improve
on
to
the amount of time spent getting to the data during the
merge phase. We will do this shortly, but first let us generalize what we
If

we want

have just observed.

7.5.3 The Cost of Increasing the

File Size

Obviously, the big difference between the time it took to merge the
8-megabyte file and the 800-megabyte file was due to the difference in total
seek and rotational delay times. You probably noticed that the number of

MERGING AS A WAY OF SORTING LARGE

seeks for the larger

100

file is

number of seeks

100 times the

for the first

the square of the difference in size between the

formalize this relationship as follows: In general, for a

runs where each run

for each

of the runs

K seeks

two

files.

file,

and

can

K-way merge of K

RAM space available,

as large as the

293

ON DISK

FILES

the buffer size

size

RAM

space

are required to read in

K runs

all

= I x

size

of each run,

of the records

in each individual run.

merge operation requires K2 seeks.

2
Hence, measured in terms of seeks, our sort merge is an 0(K ) operation.
Since K is directly proportional to N (if we increase the number of records
from 800,000 to 8,000,000, K increases from 80 to 800) it also follows that
2
our sort merge is an 0(N ) operation, measured in terms of seeks.
Since there are

This

brief,

altogether, the

formal look establishes the principle that

as files

grow

large,

can expect the time required for our merge sort to increase rapidly.

would be very

nice if

could find some ways to reduce

this

time.

Fortunately, there are several:

Allocate

more hardware, such

as disk drives,

RAM,

and I/O chan-

nels;

Perform the merge in more than one step, reducing the order of each
merge and increasing the buffer size for each run;
Algorithmically increase the lengths of the initial sorted runs; and
Find ways to overlap I/O operations.
In the following sections

with the

first:

Invest in

look

each of these in

detail,

beginning

more hardware.

7.5.4 Hardware-based Improvements

have seen that changes in our sorting algorithm can improve

performance. Likewise, there are changes that we can make in our hardware
that will also improve performance. In this section we look at three possible
changes to a system configuration that could lead to substantial decreases in
sort time:

Increasing the
Increasing the

Increasing the

amount of RAM;
number of disk drives; and
number of I/O channels.

RAM

Increasing the Amount of

It should be clear now that when we
have to divide limited buffer space into many small buffers, we increase

294

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

overwhelm all other sorting

number of seeks is
file size, given a fixed amount

seek and rotation times to the point where they

operations.

Roughly speaking,

the increase in the

proportional to the square of the increase in

total buffer space.

RAM space ought to have a

A larger RAM size means longer and

stands to reason, then, that increasing

substantial effect

total sorting time.

fewer initial runs during the sort phase, and it means fewer seeks per run
during the merge phase. The product of fewer runs and fewer seeks per run

means

a substantial

reduction in total seeks.

with our 8,000,000-record file, which took

about 5 hours, 20 minutes using one megabyte of RAM. Suppose we are
able to obtain 4 megabytes of
buffer space for our sort. Each of the
Let's test this conclusion

RAM

would

from 10,000 records to 40,000 records, resulting

in 200 40,000-record runs. For the merge phase, the internal buffer space
would be divided into 200 buffers, each capable of holding 1 /200th of a run,
meaning that there would be 200 X 200 = 40,000 seeks. Using the same
time estimates that we used for the previous two cases, the total time for
this merge is 56 minutes, 45 seconds, nearly a sixfold improvement.
initial

runs

increase

Number

of Dedicated Disk Drives If we could have a

no other users contending for use
of the same read/write heads, there would be no delay due to seek time after
the original runs are generated. The primary source of delay would now be
rotational delays and transfers, which would occur every time a new block

Increasing the

separate read/write head for every run and

had to be read in.

For example, if each run is on a separate, dedicated drive, our 800-way
merge calls for only 800 seeks (one seek per run), down from 640,000, and
cutting the total seek and rotation times from 11,500 seconds to 14 seconds.
Of course we can't configure 800 separate disk drives every time we want
to do a sort, but perhaps something short of this is possible. For instance,
if we had two disk drives to dedicate for the merge, we could assign one to
input and the other to output, so reading and writing could overlap
whenever they occurred simultaneously. (This approach takes some clever
buffer management, however. We discuss this later in this chapter.)
Increasing the

Number

of I/O Channels

two transmissions can occur

If there

only one I/O

same time, and the total

transmission time is the one we have computed. But if there is a separate
I/O channel for each disk drive, I/O can overlap completely.
For example, if for our 800-way merge there are 800 channels from 800
channel, then no

the

disk drives, then transmissions can overlap completely. Practically speaking,

it is

unlikely that 800 channels and 800 disk drives are available, and

MERGING AS A WAY OF SORTING LARGE

even
all

FILES

295

ON DISK

it is unlikely that all transmissions would overlap because

would not need to be refilled at one time. Nevertheless,
the number of I/O channels could improve transmission time

they were,

buffers

increasing

substantially.

see that there are

control over

how

ways

which external sorting occupies

have

are likely to

at least

improve performance

configured. In those environments in

a large

some such

we are not able to expand

might have. When this is

system

the case,

improve performance, and

this

we have some

our hardware

percentage of computing time,

control.

On the other hand, many times

specifically to

meet sorting needs

we need to look for

what we do now.

that

algorithmic ways to

7.5.5 Decreasing the Number of Seeks Using

Multiple-step Merges

One of the

hallmarks of

a solution to a file structure

problem,

opposed

of a mere data structure problem, is the attention given to the

between accessing information on disk and
accessing information in RAM. If our merging problem involved only
operations, the relevant measure of work, or expense, would be the
number of comparisons required to complete the merge. The merge pattern
that would minimize the number of comparisons for our sample problem,
in which we want to merge 800 runs, would be the 800-way merge
considered. Looked at from a point of view that ignores the cost of seeking,
to the solution

enormous

difference in cost

RAM

this

K-way merge

has the following desirable characteristics:

Each record

read only once.

If a selection tree

used for the comparisons performed in the mergnumber of com-

ing operation, as described in section 7.3, then the

parisons required for
tion of

Since

K-way merge of N

records

(total) is a

func-

log K.
directly proportional to

this is

0(N

numbers of comparisons), which

reasonably efficient even as N grows large.

tion (measured in

log N) operato say that

it is

be very good news were we working exclusively in

sort procedure is to be able to sort
files that are too large to fit into RAM. Given the task at hand, the costs
associated with disk seeks are orders of magnitude greater than the costs of
operations in RAM. Consequently, if we can sacrifice the advantages of an
800-way merge, trading them for savings in access time, we may be able to
obtain a net gain in performance.
This would

RAM,

all

but the very purpose of this merge

296

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

have seen that one of the keys to reducing seeks is to reduce the
that we have to merge, thereby giving each run a bigger

number of runs

share of available buffer space. In the previous section

we accomplished

this

by adding more memory. Multiple-step merging provides a way for us to

apply the same principle without having to go out and buy more memory.
In multiple-step merging, we do not try to merge all runs at one time.
Instead, we break the original set of runs into small groups and merge the
runs in these groups separately.

each of these smaller merges, more

and hence, fewer seeks are required

of the smaller merges are completed, a second pass
merges the new set of merged runs.
It should be clear that this approach will lead to fewer seeks on the first
pass, but now there is a second pass. Not only are a number of seeks
required for reading and writing on the second pass, but extra transmission
time is used in reading and writing all records in the file. Do the advantages
of the two-pass approach outweigh these extra costs? Let's revisit the merge
step of our 8-million record sort to find out.
Recall that we began with 800 runs of 10,000 records each. Rather than
merging all 800 runs at once, we could merge them as, say, 25 sets of 32
runs each, followed by a 25-way merge of the intermediate runs. This

buffer space

per run.

scheme

available for each run,

When

all

illustrated in Fig. 7.26.

When compared

to our original

disadvantage of requiring that

FIGURE 7.26 Two-step merge of

runs.

32 runs

VV V

800-way merge,

this

approach has the

read every record twice: once to form the

25 sets of 32 runs each

32 runs

800

MERGING AS A WAY OF SORTING LARGE

FILES

intermediate runs and then again to form the final sorted

each step of the merge

reading from 25 input

and avoid

to use larger buffers

files at a

number of

a large

297

ON DISK

But, since

file.

we are able
seeks. When we

time,

disk

analyzed the seeking required for the 800-way merge, disregarding seeking
for the output

file,

800-way merge involved 640,000

perform similar calculations for our

calculated that the

seeks between the input

files.

Let's

multistep merge.

First

Merge Step

For each of the 32-way merges of the

input buffer can hold V32 run, so

initial

runs, each

we end up making 32 X 32 = 1,024 seeks.

we make 25 x 1,024 = 25,600 seeks. Each

For all 25 of the 32-way merges,

of the resulting runs is 320,000 records, or 32 megabytes.

Second Merge Step

space

For each of the 25

final runs, Vis

of the

total buffer

400 records, or Vsoo run.

step there are 800 seeks per run, so we end up making 25 X

allocated, so each input buffer can hold

Hence, in this
800 = 20,000 seeks, and

The
So,

total

number of seeks

by accepting the

number of

for the

two

steps

25,600

cost of processing each record twice,

seeks for reading in

spent a penny for extra

But what about the

from 640,000

20,000

to 45,600,

45,600.

reduce the

and

haven't

RAM.
total

time for the merge?

inputting data, but there are costs.

We now

We save on access times for

have to transmit

all

of the

records four times instead of two, so transmission time increases by 651

seconds. Also,

write the records out twice, rather than once, requiring

an extra 40,000 seeks.

for the

merge

When we add

in these extra operations, the total

5,907 seconds, or about

5 hours, 20 minutes for the single-step merge. These results are

Table

Once more, note

over the data for
is

summarized

7.3.

that the essence of what

we have done is

to increase the available buffer space for each run.

trade

time

hour, 38 minutes, compared to

to find a

way

trade extra passes

dramatic decrease in random accesses. In

this case the

certainly a profitable one.

can achieve such an improvement with

do even

better with three steps? Perhaps, but

7.3 that

we have

reduced

total seek

it is

file,

where
three-step merge would

and rotation times

transmission times are about as expensive. Since

require yet another pass over the

two-step merge, can

important to note in Table

we may

to the point

have reached

point of

diminishing returns.

We also could have chosen to distribute our initial runs differently.

How would the merge perform if we did 400 two-way merges, followed by

298

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Time estimates

TABLE 7.3

for two-step

merge

hypothetical disk drive described

sort of

800-megabyte

Table 3.2. The

total

file,

time

assuming use
is 1 hour, 31

minutes.

Number

Amount

Seek + Rotation

Transfer

Transferred
(Megabytes)

Time

(Seconds)

Seeks

Total Time
(Seconds)

1st

Merge: Reading

25,600

800

673

651

1,324

1st

Merge: Writing

40,000

800

1,052

651

1,703

2nd Merge: Reading

20,000

800

526

651

1,177

2nd Merge: Writing

40,000

800

1,052

651

1,703

125,600

3,200

3,303

2,604

5,907

Totals

one 400-way merge, for instance? A rigorous analysis of the trade-offs

between seek and rotation time and transmission time, accounting for
different buffer sizes, is beyond the scope of our treatment of the subject.'*'
Our goal is simply to establish the importance of the interacting roles of the
major costs in performing merge sorts: seek and rotation time, transmission
time, buffer size, and number of runs. In the next section we focus on the
the number of runs.
pivotal role of the last of these

7.5.6 Increasing Run Lengths Using Replacement Selection

What would happen

could

somehow

increase the size of the initial

runs? Consider, for example, our earlier sort of 8,000,000 records in which

each record was 100 bytes.

Our

10,000 records because the

RAM work area was limited to

Suppose

are

somehow

initial

able

runs were limited to approximately

create

one megabyte.

runs of twice this length,

containing 20,000 records each. Then, rather than needing to perform an

800-way merge, we need

do only

400-way merge. The

divided into 400 buffers, each holding

the

number of seeks

required per run

available

RAM

/800th of a run. (Why?) Hence,

800, and the total

number of seeks

800 seeks/run x 400 runs

half the

""For

number

320,000 seeks,

required for the 800-way merge of 10,000-byte runs.

more rigorous and

end of

detailed analyses of these issues, consult the references cited at the

this chapter, especially

Knuth (1973b) and Salzberg

(1988, 1990).

MERGING AS A WAY OF SORTING LARGE

In general, if

can

somehow

FILES

299

ON DISK

increase the size of the initial runs,

amount of work required during the merge step of the sorting

process. A longer initial run means fewer total runs, which means a
lower-order merge, which means bigger buffers, which means fewer seeks.
But how, short of buying twice as much memory for the computer, can we
create initial runs that are twice as large as the number of records that we can
hold in RAM? The answer, once again, involves sacrificing some efficiency
in our in-RAM operations in return for decreasing the amount of work to
be done on disk. In particular, the answer involves the use of an algorithm
decrease the

known

as replacement selection.

Replacement selection

from memory
replacing

with

implemented
1.

based on the idea of always

key
and then
Replacement selection can be
selecting the

that has the lowest value, outputting that key,

new key from

the input

list.

as follows:

of records and sort them using heapsort. This

heap of sorted values. Call this heap the primary heap.
Instead of writing out the entire primary heap in sorted order (as we
do in a normal heapsort), write out only the record whose key has

Read

in a collection

creates a
2.

the lowest value.

Bring in a new record and compare the value of its key with that of
the key that has just been output.
a.
If the new key value is higher, insert the new record into its
proper place in the primary heap along with the other records
that are being selected for output. (This makes the new record
part of the run that is being created, which means that the run
being formed will actually be larger than the number of keys
that can be held in memory at one time.)
b.
If the new record's key value is lower, place the record in a secondary heap of records with key values lower than those already
written out. (It cannot be put into the primary heap, because it
cannot be included in the run that is being created.)

Repeat step 3 as long as there are records left in the primary heap
and there are records to be read in. When the primary heap is empty,
make the secondary heap into the primary heap and repeat steps 2
and 3.

see

how

this

works,

input

list

of only

keys.

Fig. 7.27 illustrates,

six

keys and

let's

begin with

simple example, using an

memory work area that can hold

we begin by reading into RAM the
a

there and use heapsort to sort them.

only three
three keys

key with the

minimum value, which happens to be 5 in this example, and output that
key. We now have room in the heap for another key, so we read one from
that

fit

select the

300

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Input:

67,

21,

12,

47,

t_ Front of input string

Memory

Remaining input
21,

67,

21,

47
-

Output run

12,

16,

12,

21,

16,

12,

21,

16,

12,

21,

16,

12,

FIGURE 7.27 Example of the principle underlying replacement selection.

key, which has a value of 12, now becomes a

of keys to be sorted into the output run. In fact, since it
is smaller than the other keys in RAM, 12 is the next key that is output. A
new key is read into its place, and the process continues. When the process
is complete, it produces a sorted list of six keys while using only three
the input

list.

member of the

memory

set

locations.

In this

happens

The new

example the entire file is created using only one heap, but what
fourth key in the input list is 2 rather than 12? This key arrives

if the

memory
The

too

late to

be output into

its

proper position relative to the other

been written to the output list. Step 3b in the

algorithm handles this case by placing such values in a second heap, to be
included in the next run. Figure 7.28 illustrates how this process works.

keys:

During the

5 has already

first

run,

when

keys are brought in that are too small to be

we mark them with parentheses, indicating

have to be held for the second run.
It
is
interesting to use this example to compare the action of
replacement selection to the procedure we have been using up to this point,
namely that of reading keys into RAM, sorting them, and outputting a run
that is the size of the
space. In this example our input list contains 13

included in the primary heap,

that they

RAM

keys.

A series of successive RAM sorts,

results in five runs.

runs.

The replacement

given only three

Since the disk accesses during a multiway

expense,

replacement selection's

ability

fewer, runs can be an important advantage.

Two

questions emerge

memory

locations,

selection procedure results in only

at this point:

merge can be

create longer,

two

major

and therefore

MERGING AS A WAY OF SORTING LARGE

FILES

301

ON DISK

Input:
33,

18,

24,

58,

14,

17,

67,

21,

12,

47,

Front of input string

Memory

Remaining input
33,

18,

24,

58,

14,

17,

21,

67,

33,

18,

24,

58,

14,

17,

21,

33,

18,

24,

58,

14,

17,

33,

18,

24,

58,

14,

17,

33,

18,

24,

58,

14,

33,

18,

24,

58,

33,

18,

24,

(1?)

(14)

(17)

tart building the

33,

18,

24,

33,

18,

33,

Output run

67,

12,

16,

12,

21,

16,

12,

47,

21,

16,

12,

47,

21,

16,

12,

second

14,

58,

17,

14,

18,

17,

14,

17,

14,

24,

18,

33,

24,

18,

17,

14,

33,

24,

18,

17,

14,

FIGURE 7.28 Step-by-step operation of replacement selection working to form two sorted runs.

in memory, how long a run can

placement selection to produce, on the average?
What are the costs of using replacement selection?

Given P locations

Average Run Length for Replacement Selection

first

question

that,

on the average, we can expect

P memory locations. Knuth

intuitive argument for why
A

clever

way

discovered by E.
a circular track

situation

shown

expect re-

The answer

to the

run length of 2P, given

(1973b)^ provides an excellent description of an

this is so:

show
F.

that 2P is indeed the expected run length was

Moore, who compared the situation to a snowplow on

[U.S. Patent 2983904 (1961), Cols. 3-4]. Consider the

[below]; flakes of snow are falling uniformly on

a circular

From Donald Knuth, The Art of Computer Programming, 1973, Addison-Wesley, Reading,
Mass. Pages 254-55 and Figs. 64 and 65. Reprinted with permission.
+

302

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

road, and a lone

snow

snowplow

continually clearing the snow.

has been plowed off the road,

Once

the

disappears from the system. Points

< x < 1; a flake of

on the road may be designated by real numbers x,
snow falling at position x represents an input record whose key is x, and
the snowplow represents the output of replacement selection. The ground
speed of the snowplow
that

inversely proportional to the height of the

encounters, and the situation

snow

perfectly balanced so that the total

A new run is

amount of snow on the road at all times is exactly

output whenever the plow passes point 0.

After this system has been in operation for awhile,

it is

formed

in the

will

approach

speed (because of the circular

snow

constant height

linearly in front

intuitively clear that

which the snowplow runs at constant

symmetry of the track). This means that the

a stable situation in

when it meets the plow, and the height drops off

as shown [below]. It follows that the volume

of the plow

of snow removed in one revolution (namely the run length)

amount

twice the

present at any one time (namely P).

lllHHHIHil
Falling

snow

Future snow
Existing

Total length of the road

snow/=fc(o^==!
|

^(Op~

MERGING AS A WAY OF SORTING LARGE

So,

given

random ordering of

keys,

FILES

303

ON DISK

can expect replacement

hold in

form runs that contain about twice as many records as we can

memory at one time. It follows that replacement selection creates

half as

many

selection to

assuming
the

runs as does

same amount of memory. (As we

selection does, in fact, have to

RAM

sorts

and the

see in a

make do with

RAM

memory

contents,

have access to
moment, the replacement

less

sort

memory

than does the

sort.)

It is

actually often possible to create runs that are substantially longer

than 2P. In

many

random; the keys

the order of the records

applications,

produce runs

(Consider what would happen

Replacement selection becomes an

ordered input

not

wholly

are often already partially in ascending order. In these

cases replacement selection can

2P.

RAM

a series

that the replacement selection

that,

the input

on the average, exceed

list

already sorted.)

especially valuable tool for such partially

files.

The Costs of Using Replacement

Selection

Unfortunately, the no-

free-lunch rule applies to replacement selection, as

does to so

many

other

of file structure design. In the worked-by-hand examples we have

looked at up to this point, we have been inputting records into memory one
at a time. We know, in fact, that the cost of seeking out to disk for every
areas

single input record

which means,

prohibitive. Instead,

in turn, that

operation of replacement selection.

output buffering. This

sorting,

cost,

we want

Some of it

and the

affect

to buffer the input,

of the memory for the

has to be used for input and

are not able to use

all

has on available space for

illustrated in Fig. 7.29.

need for buffering during the replacement

selection step, let's return to our example in which we sort 8 million
records, given a memory area that can hold 10,000 records.
see the effects

this

FIGURE 7.29 In-RAM sort versus replacement selection, in

terms of their use of available

heapsort area
(a)

In-RAM

sort: all available

i/o buffer
(b)

Replacement

space used for the

RAM

sort.

heapsort area
selection:

some of

available space

used for

i/o.

for sorting operation.

304

CQSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

RAM

For the
records into

sorting

memory

10,000 records

methods such

until

full,

as heapsort,

which simply read

can perform sequential reads of

800 runs have been created. This means that

at a time, until

the sort step requires 1,600 seeks: 800 for reading and 800 for writing.

For replacement selection

we might

use an input/output buffer that can

enough space to hold 7,500

records for the actual replacement selection process. If the I/O buffer holds
hold, for example, 2,500 records, leaving

2,500 records,
so

can perform sequential reads of 2,500 records

at a

time,

takes 8,000,000/2,500

3,200 seeks to access all records in the file.

This means that the sort step for replacement selection requires 6,400 seeks:
it

3,200 for reading and 3,200 for writing.

If the records occur in a random key sequence, the average run length
using replacement selection will be 2 X 7,500 = 15,000 records, and there

about 8,000,000/15,000

will be

step

534 such runs produced. For the merge

divide the one megabyte of

average of 18.73 records, so

RAM into 534 buffers,

we end up making

which hold an

15,000/18.73

801 seeks

per run, and

801 seeks per run x 534 runs

427,734 seeks altogether.

Table 7.4 compares the access times required to sort the 8 million
records using both a
sort and replacement selection. The table

RAM

800-way merge and two replacement selection examples. The second replacement selection example, which produces runs of
40,000 records while using only 7,500 record storage locations in memory,
assumes that there is already a good deal of sequential ordering within the
includes our

initial

input records.
It

clear that,

given randomly distributed input data, replacement

selection can substantially reduce the

number of runs formed. Even though

as many seeks to form the runs,

amount of seeking effort required to merge the runs
more than offsets the extra amount of seeking that is required to form the
runs. And when the original data is assumed to possess enough order to
make the runs 40,000 records long, replacement selection produces less than

replacement selection requires four times

the reduction in the

one third

many

seeks as

RAM

sorting.

7.5.7 Replacement Selection Plus Multistep Merging

While these comparisons highlight the advantages of replacement selection

RAM

we would

choose the one-step

merge patterns shown in Table 7.4. We have seen that two-step merges can
result in much better performance than one-step merges. Table 7.5 shows
how these same three sorting schemes compare when two-step merges are
over

sorting,

probably not in

reality

>s
to

c Z
09

o B

C/5

b*XH

^
<
OT

o E
.a o
M
GO
C CO
CO
3 =3
CO V|_
o
"O
i_
o CD
o -O
CD
E
c 3
o C
33

E
00

HZ
,i

(ft

5 c
**

O 3
o

03
Z3

cr
CD

o
D
CO

[/)

-C

o
CO
o

u M
_*
o
u
i

C o
3

cr
CD
i_
CO
CD

E
'-

o
CD

S/3

d
o
o
o o
CD
CO
CO
CD

o c s
N 3 5Cfl

o CO
c c
o
CD
CO
E
03 CD
Q. O
E
o

a
5/3

Q.
CD

E O

Uh
rs

O
-

t/3

<
hh

Q,^o
c2

^
as

gJS

u 12
o C

^-n

305

>s
ed

73 .3

c
CD
E
CD
O

00
"*

00
en
<*

2 =
C/3

p<
A 03 ^
o o .5

PN -w

Q.
CD

HH
TO

o
CO

<
ce

^
CM

K
CM

ScJS

o
^

.c
*->

<+-

.Q
GO

r
C

C
CO
3

cz^

s S 2

O
O

SO O
lO O
CM CM

o
o
o
oo
q

^C CM
r- ^c
00 rH
cn in
CM

oo"

o
o
CD

> *

* %

Cl
cd

las

X C

x c

8-3

-5

S
CM

l<sl

*-*

8
CO

o
2 CM
^

00 J,

4-1

-a

*5
CD

8|
o c
CD
*~

1
c
.3 J o
y. a u-

o
o

O
8

o
LO

o
o
LO

cm'

n_ -C

C
O
._2

TO
CD

fe.2

CD
CD
CO

s o S

c*o

If)

P^
LU
1

<
1

306

J_,

4->

u
ed

2
<

&
Cm
<

o
,

-S
Cm

o
vi

es:

-MJ

(/>

c/5

S o c
CJ

<~>

-r

'->

^ C
8 OT!
-- u O rt o
Cm <y U
o c3 J
IU -c- S > Cm O

MERGING AS A WAY OF SORTING LARGE

From Table

used.
less in

7.5

every case than

we
it

see that the total

was

FILES

307

ON DISK

number of seeks

dramatically

method

for the one-step merges. Clearly, the

used to form runs is not nearly

than one-step, merges.

important

as the use

of multistep, rather

Furthermore, since the number of seeks required for the merge steps is
smaller in all cases, while the number of seeks required to form runs

much

have a bigger effect proportionally on the final

total, and the differences between the RAM-sort based method and
replacement selection are diminished.
The differences between the one-step and two-step merges are
exaggerated by the results in Table 7.5, because they don't take into account
the amount of time spent transmitting the data. The two-step merges
and disk two more times
require that we transfer the data between
than do the one-step merges. Table 7.6 shows the results after adding
transmission time to our results. The two-step merges are still better, and
remains the same, the

latter

RAM

replacement selection

still

wins, but the results are

less

dramatic.

7.5.8 Using Two Disk Drives with Replacement Selection

and fortunately, replacement selection offers an opportunity
sort
to save on both transmission and seek times in ways that
methods do not. As usual, this is at a cost, but if sorting time is expensive,
it could well be worth the cost.
Suppose that we have two disk drives available that we can assign the
separate dedicated tasks of reading and writing during replacement
selection. One drive, which contains the original file, does only input, and
Interestingly,

RAM

the other does only output. This has

two very

nice results: (1)

input and output can overlap, reducing transmission time by

50%; and
If

(2)

seeking

we have two

to take advantage

two

means
as

that

much

virtually eliminated.

disks at our disposal,

of them.

we should
memory

configure

also configure
as follows:

memory

allocate

buffers each for input and output, permitting double buffering, and

allocate the rest

of memory for forming the selection

tree.

This arrangement

might proceed

to take advantage

illustrated in Fig. 7.30.

Let's see

how

the

merge

sort process

this configuration.
First,

the sort phase.

the heap-sized part of

begin by reading in enough records to

the heap. Next, as we

memory, and form

records from the heap into one of the output buffers,

fill

move

replace those

records with records from one of the input buffers, adjusting the tree in the
usual manner. While
filling the

we empty one

input buffer into the

tree,

can be

other one from the input disk. This permits processing and input

c 5
co .5

.2
a C

Hhhi
c 5 S c
.22 *S

5 I &
ho* oft
tt

c
2

*
tN

00
J*
03
0)

<T)

C
St

r
^

^
S

c/5

lo
CN

o
LO

o
O

OC
CN G\
i

O
V

O U

o
c

E
o

On
^H

||3

*-H

X o

C
i

LT5

o
in

w
O C

<si

s g

s -8
S ? S

o
h

O O

PC _Q

T o
So.
ji

oilC

too

X R O
j~.

J-c

? o

p5c OC

&
a

308

C
O

u
O
u

X>-

-a

in
<N

>^
CO

X ,
v x U

ScSD

*-*

Ctf

-Q .S

4-'

MERGING AS A WAY OF SORTING LARGE

FILES

ON DISK

309

input

buffers

output
buffers

FIGURE 7.30 Memory organization for replacement selection.

to overlap. Similarly, at the

buffers

from the

tree,

same time

that

are filling

one of the output

can be transmitting the contents of the other to the

way, run selection and output can overlap.

During the merge phase, the output disk becomes the input disk, and
vice versa. Since the runs are all on the same disk, seeking will occur on the
input disk. But output is still sequential, since it goes to a dedicated drive.
Because of the overlapping of so many parts of this procedure, it is
difficult to estimate the amount of time the procedure is likely to take. But
it should be clear that by substantially reducing seeking and transmission
time, we are attacking those parts of the sort merge that are the most costly.
output disk. In

this

7.5.9 More Drives? More Processors?

two

Isn't

drives can

improve performance, why not

true that the

phase, the faster

course the

drives

we have

can perform I/O?

number and speed of I/O

more?
hold runs during the merge
three, or four, or

to a point this

true,

but of

processors must be sufficient to keep

up with the data streaming in and out. And there will also be a point at
which I/O becomes so fast that processing can't keep up with it.
But who is to say that we can use only one processor? A decade ago, it
would have been far-fetched to imagine doing sorting with more than one
processor, but it is very common now to be able to dedicate more than one
processor to a single job. Possibilities include the following:

Mainframe computers, many of which spend a great deal of their

time sorting, commonly come with two or more processors that can
simultaneously work on different parts of the same problem.
Vector and array processors can be programmed to execute certain
kinds of algorithms orders of magnitude faster than scalar processors.

310

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Massively

machines provide thousands, even millions, of

at the same time communicate in complex ways with one another.
Very fast local area networks and communication software make it
relatively easy to parcel out different parts of the same process to
parallel

processors that can operate independently and

several different machines.

It is

these

not appropriate, in

newer

cover in detail the implications of

this text, to

architectures for external sorting.

But just

the past decade in the availability and performance of

have altered the way

many more

look

as the

changes over

RAM and disk storage

we can expect it to change

new architectures becomes

external sorting,

times as the current generation of

commonplace.

7.5.10 Effects

Multiprogramming

of external sorting on disk we are, of course, making tacit

assumptions about the computing environment in which this merging is
In our discussions

are assuming, for example, that the merge job is running

environment (no multiprogramming). If, in fact, the
operating system is multiprogrammed, as it normally is, the total time for
the I/O might be longer, as our job waits for other jobs to perform their

taking place.
in

dedicated

I/O.

the other hand, one of the reasons for

multiprogramming

allow the operating system to find ways to increase the efficiency of the
overall system

by overlapping processing and I/O among

the system could be performing I/O for our job while

different jobs.

was doing

CPU

processing on others, and vice versa, diminishing any delays caused by

overlap of I/O and

CPU

processing within our job.

Effects such as these are very hard to predict, even

when you have

much

information about your system. Only experimentation can determine

what

real

performance will be

7.5.11 A Conceptual Toolkit

We can now list many tools

busy, multiuser system.

for External Sorting

that can

improve external sorting performance.

should be our goal to add these various tools to our conceptual toolkit for
designing external sorts and to pull them out and use them whenever they

are appropriate.

following:

full listing

of our

new

set

of tools would include the

311

SORTING FILES ON TAPE

For

in-RAM

lap input

Use

time

With

RAM

as possible.

much

makes the runs longer and promerge phase.

buffers during the

number of initial runs

and output with internal processing.

much

vides bigger and/or

If the

forming the original list of

and double buffering, we can over-

sorting, use heapsort for

sorted elements in a run.

so large that total seek and rotation

greater than total transmission time, use a multistep

It increases the amount of transmission time but

number of seeks enormously.

merge.
the

Consider using replacement selection for

cially if there is a possibility that the

Use more than one

initial

can decrease

run formation, espe-

runs will be partially ordered.

disk drive and I/O channel so reading and writ-

ing can overlap. This

especially true if there are not other users

the system.

Keep

mind

the fundamental elements of external sorting and their

and look for ways to take advantage of new architectures and systems, such as parallel processing and high-speed local
area networks.
in

relative costs,

7.6

Sorting Files on Tape

There was a time when it was usually
on tape than on disk, but this is much
is still used in external sorting, and

faster to

perform large external

we would

sorts

now. Nevertheless, tape

less the case

be remiss

did not

consider sort merge algorithms designed for tape.

There are a large number of approaches to sorting files on tape. After

approximately 100 pages of closely reasoned discussion of different
alternatives for tape sorting, Knuth (1973b) summarizes his analysis in the
following way:

Theorem A.

It is

difficult to decide

which merge pattern

best in a

given situation.

Because of the complexity and number of alternative approaches and

way that these alternatives depend so closely on the specific
characteristics of the hardware at a particular computer installation, our

because of the

merely to communicate some of the fundamental issues

For a more comprehensive
discussion of specific alternatives we recommend Knuth's (1973b) work as
objective here

associated with tape sorting and merging.

a starting point.

312

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Viewed from

general perspective, the steps involved in sorting on

tape resemble those that

Distribute the unsorted

Merge

discussed with regard to sorting on disk:

file

into sorted runs;

the runs into a single sorted

Replacement selection

and

file.

almost always

good choice

as a

method

for

You will remember that the

when we are working on disk is that

creating the initial runs during a tape sort.

problem with replacement

the

amount of seeking

selection

required during run creation

more than

offsets the

advantage of creating longer runs. This seeking problem disappears when

the input is from tape. So, for a tape-to-tape sort, it is almost always
advisable to take advantage of the longer runs created by replacement
selection.

7.6.1 The Balanced Merge

Given

that the question

how

create the initial runs has

such

merging process that we

encounter all of the choices and complexities implied by Knuth's tonguein-cheek theorem. These choices begin with the question of how to distribute
the initial runs on tape and extend into questions about the process of
merging from this initial distribution. Let's look at some examples to show
what we mean.
Suppose we have a file that, after the sort phase, has been divided into
10 runs. We look at a number of different methods for merging these runs
on tape, assuming that our computer system has four tape drives. Since the
initial, unsorted file is read from one of the drives, we have the choice of
initially distributing the 10 runs on two or three of the other drives. We
begin with a method called two-way balanced merging, which requires that the
initial distribution be on two drives, and that at each step of the merge,
except the last, the output be distributed on two drives. Balanced merging
is the simplest tape merging algorithm that we look at; it is also, as you will
straightforward answer,

it is

clear that

it is

in the

see, the slowest.

The balanced merge proceeds according

to the pattern illustrated in Fig.

7.31.

This balanced merge process

The numbers

expressed in an alternate,

form in Fig. 7.32.

measured in terms of the number of
run. For example, in step

all

initial

runs included in each merged

the input runs consist of a single initial run.

At the start of
Tl contains one run consisting of four initial runs
run consisting of two initial runs. This method of illustration

step 2 the input runs each consist of a pair of initial runs.

step 3,

tape drive

followed by

more compact

inside the table are the run lengths

SORTING FILES ON TAPE

Tape

Step

313

Contains runs

T2
T3
T4

R1-R2
R3-R4

R5-R6
R7-R8

R9-R10

Tl
T2

R1-R4
R5-R8

R9-R10

T3
T4

R9
RIO

Tl
Step 2

Step 3

T2
T3
T4

Tl
Step 4

Step 5

T2
T3
T4

R1-R8
R9-R10
R1-R10

Tl
T2
T3

T4
FIGURE 7.31

Balanced four-tape merge

more
grow

shows

runs.

way some of

combine and
one run that is copied
again and again stays at length 2 until the end. The form used in this
illustration is used throughout the following discussions on tape merging.
Since there is no seeking, the cost associated with balanced merging on
tape is measured in terms of how much time is spent transmitting the data.
In the example, we passed over all of the data four times during the merge
phase. In general, given some number of initial runs, how many passes over
the data will a two-way balanced merge take? That is, if we start with
runs, how many passes are required to reduce the number of runs to 1?
clearly

the

the intermediate runs

into runs of lengths 2, 4, and 8, whereas the

314

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

11111

Step 2

Step 3

4 2

Step 4

Step 5

Step

Merge ten runs

22222

Merge ten runs

FIGURE 7.32 Balanced four-tape merge of 10 runs expressed

more compact

table notation.

Since each step combines

the

number

two

runs, the

for the previous step. If p

number of runs after each step is half

the number of passes, then we can

express this relationship as follows:

(Vif

from which

can be

shown

that

P =
In our simple example,

Hog, N~l

10, so four passes

over the data were required.

Recall that for our partially sorted 800-megabyte

flog 2 200 "| = 8 passes are required for

file

there

were 200 runs, so

balanced merge. If reading and

writing overlap perfectly, each pass takes about 11 minutes," so the total
1"

time

is 1

hour, 28 minutes. This time

merges, even

when

a single

outweigh the savings

disk drive

not competitive with our disk-based

used.

The transmission times

far

in seek times.

7.6.2 The /(-way Balanced Merge

we want

tells

us that

improve on

this

approach,

it is

clear that

we must

number of passes over the data. A quick look at

we can reduce the number of passes by increasing

to reduce the

find

ways

the formula
the order of

assumes the 6,250 bpi tape used in the examples in Chapter 3. If the transports speed
200 inches per second, the transmission rate is 1,250 Kbytes per second, assuming no
blocking. At this rate an 800-megabyte file takes 640 seconds, or 10.67 minutes to read.

""This
is

SORTING FILES ON TAPE

each merge. Suppose, for instance, that

we have

315

20 tape drives, 10 for input

and 10 for output, at each step. Since each step combines 10 runs, the

number of runs

after

each step

one tenth the number for the previous

step.

we have

Hence,

(Vxof

and
p = Rogio
In general,
at

A k-way balanced merge is one in which the order of the merge

each step (except possibly the

last)

required for a k-way balanced merge with

Hence, the number of passes

N initial

runs

v = r~iog N~i.
fe

10-way balanced merge of our 800-megabyte

file with 200 runs,

200 1 = 3, so three passes are required. The best estimated time now
is reduced to a more respectable 42 minutes. Of course, the cost is quite
high: We must keep 20 working tape drives on hand for the merge.

For

|~logio

7.6.3 Multiphase Merges

The balanced merging algorithm has the advantage of being very simple; it
easy to write a program to perform this algorithm. Unfortunately, one

reason

simple

that

"dumb" and cannot take advantage of

how we can improve on it.
when we merge the extra run with empty

opportunities to save work. Let's see

can begin by noting that

runs in steps 3 and

don't really accomplish anything. Figure 7.33

shows how we can dramatically reduce the amount of work that has to be
done by simply not copying the extra run during step 3. Instead of merging
this run with a dummy run, we simply stop tape T3 where it is. Tapes Tl
and T2 now each contains a single run made up of four of the initial runs.
We rewind all the tapes but T3 and then perform a three-way merge of the
runs on tapes Tl, T2, and T3, writing the final result on T4. Adding this
intelligence to the merging procedure reduces the number of initial runs that
must be read and written from 40 down to 28.
The example in Fig. 7.33 clearly indicates that there are ways to
improve on the performance of balanced merging. It is important to be able
to state, in general terms, what it is about this second merging pattern that
saves work:

use a higher-order merge. In place of

use one three-way merge.

two two-way merges, we

316

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Tl
Step

11111

Step 2

Step 3

2 2 2

2 2

Merge

ten runs

Merge

eight runs

Merge ten runs

Step 4

FIGURE 7.33 Modification of balanced four-tape merge that does not rewind
between steps 2 and 3 to avoid copying runs.

extend the merging of runs from one tape over several

we merge some

Specifically,
in step 4.

could say that

of the runs from T3

we merge

the runs

steps.

and some

in step 3

from T3

two

phases.

These

ideas, the use

of higher-order merge patterns and the merging of

runs from a tape in phases, are the basis for two well-known approaches to

merging called polyphase merging and cascade

merges share the following characteristics:

The
a

initial

distribution of runs

J 1-way merge, where J

The

the

such that

merging.

at least

Figure 7.34 illustrates

the initial

number of available

distribution of the runs across the tapes

ten contain different

In general,

merge

initial

such that the tapes of-

numbers of runs.

how

polyphase merge can be used to merge 10

runs that must be read and written from 40 (for

two-way merge)

to 25.

tape drives.

runs distributed on four tape drives. This merge pattern reduces the

these

easy to see that this reduction

It is

number

balanced

consequence

of the use of several three-way merges in place of two-way merges. It

should also be clear that the ability to do these operations as three-way
merges is related to the uneven nature of the initial distribution. Consider,
for example,

what happens

than 5-3-2.

T3, but

this also clears all the

Tl. Obviously,

second

the initial distribution of runs

4-3-3

rather

can perform three three-way merges to open up space on

runs off of T2 and leaves only a single run on

are not able to

perform another three-way merge

as a

step.

Several questions arise at this point:

How

does one choose an

efficient

merge

pattern?

initial

distribution that leads readily to an

SORTING FILES ON TAPE

Tl
1

11111

Step 2

..111

Step 3

...

Step 4

....

Step

Step 5

Merge

six

Merge

five

317

runs
runs

Merge four runs

Merge ten runs

FIGURE 7.34 Polyphase four-tape merge of 10 runs.

Are there algorithmic descriptions of the merge

patterns, given an

initial distribution?
3.

N runs

and J tape drives, is there some way to compute the

merging performance so we have a yardstick against which
compare the performance of any specific algorithm?

Given

optimal

beyond the scope of this text; in

answer to question 3 requires a more mathematical approach
to the problem than the one we have taken here. Readers wanting more
than an intuitive understanding of how to set up initial distributions should
Precise answers to these questions are

particular, the

consult

Knuth

(1973b).

7.6.4 Tapes versus Disks

for External Sorting

RAM

was considered a substantial amount of

of
any single job, and extra disk drives were very
costly. This meant that many of the disk sorting techniques to decrease seeking that we have seen were not available to us or were very
decade ago 100

memory

to allocate to

limited.

we want

our 800-megabyte file, and

available, instead of one megabyte. The
there is only 100 K of
approach that we used for allocating memory for replacement selection
would provide 25 K for buffering, and 75 K for our selection tree. From this
Suppose, for instance, that

to sort

RAM

can expect 5,334 runs of 1,500 records each, versus 534 when there is a
RAM. For a one-step merge, this 10-fold increase in the

megabyte of

number of runs

results in a 100-fold increase in the

took three hours with one megabyte of memory

for the seeks!

No wonder

no seeking, were

tapes,

preferred.

which

number of seeks. What

now

takes 300 hours, just

are basically sequential

and require

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

now

RAM

much more readily available. Runs can be longer and

much less of a problem. Transmission time is now
more important. The best way to decrease transmission time is to reduce
the number of passes over the data, and we can do this by increasing the
But

fewer, and seeks are

order of the merge. Since disks are random-access devices, very large order

merges can be performed, even if there is only one drive. Tapes, however,
are not random-access devices; we need an extra tape drive for every extra
run we want to merge. Unless a large number of drives is available, we can
only perform low-order merges, and that means large numbers of passes
over the data. Disks are better.

7.7

Sort-Merge Packages
Many

utility programs are available for users who need to sort

Often the programs have enough intelligence to choose from one
of several strategies, depending on the nature of the data to be sorted and the
available system configuration. They also often allow users to exert some
control (if they want it) over the organization of data and strategies used.
Consequently, even if you are using a commercial sort package rather than
designing your own sorting procedure, it is useful to be familiar with the
variety of different ways to design merge sorts. It is especially important to
have a good general understanding of the most important factors and

large

very good

files.

trade-offs influencing performance.

7.8

Sorting and Cosequential Processing

UNIX

has a

number of utilities

also has sorting routines, but

for

UNIX

performing cosequential processing. It

at the level of sophistication that you

nothing

find in production sort-merge packages. In the following discussion

introduce

some of

these

utilities.

For

full

details,

consult the

UNIX

documentation.

7.8.1 Sorting and Merging

Because

UNIX

sorting of large

UNIX

not an environment in which one expects to do frequent

files

of the type

discuss in this chapter, sophisticated

sort-merge packages are not generally available on UNIX systems. Still, the
sort routines you find in UNIX are quick and flexible and quite adequate for
the types of applications that are common in a UNIX environment. We can

SORTING AND COSEQUENTIAL PROCESSING

divide

UNIX

two

sorting into

319

UNIX

command, and

categories: (1) the sort

(2)

callable sorting routines.

UNIX

The

Command

sort

options, but the simplest one

lexical order.

character

'\n'.)

command
sorted

one

line

sort

command

to sort the lines in an

default the sort utility takes

named on

has

ASCII

many
file in

different

ascending

any sequence of characters ending with the new-line

and writes the sorted

too large to

file is

The

fit

RAM,

file

sort

its

input

file

name from

to standard output. If the

performs

merge

sort. If

the input line, sort sorts and merges the

file

the

to be

more than

files.

As a simple example, suppose we have an ASCII file called team with

names of members of a basketball team, together with their classes and their
scoring averages:

Jean Smith Senior 7.8

Chris Mason Junior 9.6
Pat Jones Junior 3.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4

sort the

file,

enter

$
sort team
Chris Mason Junior 9.6
Jean Smith Senior 7.8
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4
Pat Jones Junior 3.2

Notice that by default sort considers an entire line as the sort key.
Hence, of the two players named "Pat Jones," the freshman occurs first in

"Freshman" is lexically smaller than "Junior." The

assumption that the key is an entire line can be overridden by sorting on
specified key fields. For sort a key field is assumed to be any sequence of
characters delimited by spaces or tabs. You can indicate which key fields to
use for sorting by giving their positions:
the output because

+po5

where posl
tells which
of the
$

-pos2

tells

how many

sort
file

If pos2

before starting the key, and pos2

omitted, the key extends to the end

-2 team

team to be sorted according to the

form of posl and pos2

to start a

fields to skip

end with.
Hence, entering

field to

line.

causes the
a

key with.)

that allows

you

last

names. (There

also

to specify the character within a field

320

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

The following options, among

ASCII ordering used by sort:
-d

Use "dictionary"

ordering:

others, allow

Only

you

to override the default

and blanks are

letters, digits,

signifi-

cant in comparisons.

-f

"Fold" lowercase letters into uppercase. (This

defined in Chapter 4.)

the canonical

form

that

we
-r

"Reverse" the sense of comparison: Sort

Notice that

and within

descending ASCII order.

compares groups of
Chapter 4, records
are lines, and fields are groups of characters delimited by white space. This
is consistent with the most common UNIX view of fields and records
sort sorts lines,

characters delimited

within

The

UNIX

text

by white

files.

Library Routine

qsort

lines

space. In the language of

UNIX

The

library routine qsort (

general sorting routine. Given a table of data, qsort(

the table in place.

table could be the contents

where the elements of the

table are

its

nel,

int

of a

sorts the elements in

file,

loaded into

records. In C, qsort ()

RAM,

defined as

follows:

qsortCchar *ba5e,
The argument

base

int

a pointer to the

elements in the table; and width

argument, compar(

the

name of

width,

int

base of the data, nel

*compar (
is

the

) )

number of
The last

the size of each element.

user-supplied comparison function

) must have two parameters,

which are pointers to elements that are to be compared. When qsort( ) needs
to compare two elements, it passes to comparf ) pointers to these elements,
and compar( ) compares them, returning an integer that is less than, equal to,
or greater than zero, depending on whether the first argument is considered

that qsort(

uses to

compare keys. Compar(

equal to, or greater than the second argument. A full explanation

of how to use qsort( ) is beyond the scope of this text. Consult the UNIX
documentation for details.
less than,

7.8.2 Cosequential Processing

UNIX
utility,

UNIX

number of utilities for cosequential

when used to merge files, is one example.

provides a

introduce three others:

cmp

Utilities in

difj]

The

sort

In this section

processing.

cmp, and cotnm.

Suppose you find in your computer that you have two team files,
one called team and the other called my team. You think that the two files are
the same, but you are not sure. You can use the command cmp to find out.

SORTING AND COSEQUENTIAL PROCESSING

321

UNIX

cmp compares two files. If they differ, it prints the byte and line number
where they differ; otherwise it does nothing. If all of one file is identical to
the first part of another, it reports that end-of-file was reached on the
shorter file before any differences were found.
For example, suppose the

team and myteam have the following

file

contents:

team

myteam

Jean Smith Senior 7.8

Chris Mason Junior 9.6
Pat Jones Junior 3.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4

Jean Smith Senior 7.8

Stacy Fox Senior 1.6
Chris Mason Junior 9.6
Pat Jones Junior 5.2
Leslie Brown Sophomore 18.2
Pat Jones Freshman 11.4

cmp

tells

you where they

differ:

cmp team myteam

team myteam differ:

char 23 line

files on a byte-by-byte basis until it finds a

makes no assumptions about fields or records. It works with

Since cmp simply compares

difference,

both text and nontext

files.

useful if you just want to know if two files are different, but it
you much about how they differ. The command diff gives fuller
information, diff telh what lines must be changed in two files to bring them

cmp

diff

doesn't

tell

into agreement. For example:

team myteam

diff
1a2

>
Stacy Fox Senior
.6
3c4
<
Pat Jones Junior 3.2
1

Jones Junior 5.2

Pat

The "la2"

indicates that after line

in the first

file,

we need

to add line 2

make them agree. This is followed by the line from

the second file that would need to be added. The "3c4" indicates that we
need to change line 3 in the first file to make it look like line 4 in the second
from

file.

the second

This

leading

"<"

indicates that

file

followed by

a listing

of the two differing

indicates that the line

it is

from the second

from the

first

lines,

file,

where the

and the

">"

file.

One other indicator that

a line in the first file

means

could appear in d(ffoutput is "d", meaning that

has been deleted in the second file. For example, "12dl5"

that line 12 in the first

file

appears to have been deleted from being

322

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

right after line 15 in the second

work with

lines

of text.

file.

Notice that

diff,

like sort,

designed to

would not work well with non-ASCII

text

files.

comm

Whereas diff tells what is different about two files, comm compares
which must be ordered in ASCII collating sequence, to see what
they have in common. The syntax for comm is the following:

two

files,

comm [-123] filel

file2

comm produces three columns of output. Column 1 lists

filel only; column 2 lists lines in file2 only, and column
in

both

files.

the lines that are in

lists lines

that are

For example,

sort team >ts

sort myteam >ms
comm ts ms
Chris Mason Junior 9.6
Jean Smith Senior 7.8

$
$

Leslie Brown Sophomore 18.2

Pat Jones Freshman 11.4
Pat

Jones Junior 3.2

Pat Jones Junior 5.2
Stacy Fox Senior
.6
1

Selecting any of the flags

you

1, 2,

or 3 allows you to print only those columns

are interested in.

The

sort, diff,

representative of

comm, and emp commands (and the

what

available in

UNIX

qsort() function) are

for sorting

and cosequential

As we have said, they have many useful options

cover and that you will be interested in reading about.
processing.

that

SUMMARY
In the first half

and apply

merge

of this chapter,

sorting.

develop

two common problems

cosequential processing

updating

In the second half of the chapter

model

general ledger and

identify the

most

important factors influencing performance in merge-sorting operations and

suggest some strategies for achieving good performance.
The cosequential processing model can be applied to problems that
involve operations such as matching and merging (and combinations of
these)

on two or more sorted input

files.

begin the chapter by

don't

SUMMARY

illustrating the use

common
to

two

of the model to perform

and

lists,

merge of two

simple match of the elements

lists.

perform these two operations embody

The procedures we develop

all

the basic elements of the

model.

most complete form, the model depends on

certain assumptions
enumerate these assumptions in our
formal description of the model. Given these assumptions, we can describe
the processing components of the model.
The real value of the cosequential model is that it can be adapted to
more substantial problems than simple matches or merges without too
In

its

about the data

much

in the input files.

alteration.

illustrate this

by using

the

model

to design a general

ledger accounting program.

model involve only two

multiway merge to show how the

All of our early sample applications of the

input

We next adapt the

files.

model

to a

model might be extended to deal with more than two input lists. The
problem of finding the minimum key value during each pass through the
main loop becomes more complex as the number of input files increases. Its
solution involves replacing the three-way selection statement with either a

multiway

selection or a procedure that keeps current keys in a

that can be processed

well for small values of

it is

list

structure

conveniently.

of the model to fe-way merging performs

but that for values of k greater than eight or

efficient to find the

minimum key

so,

value by means of a selection

tree.

After discussing multiway merging,

we shift our attention

to a

problem

encountered in a previous chapter

how to sort large files. We
begin with files that are small enough to fit into
and introduce an
efficient sorting algorithm, heapsort, which makes it possible to overlap I/O
with the sorting process.

that

RAM

The generally accepted

is some form of merge

sorts
1.

Break the

file

nal sorting
2.

Merge

into

solution
sort.

when

merge

two or more

a file

too large for

sort involves

two

in-RAM

steps:

sorted subfiles, or runs, using inter-

methods; and

the runs.

to keep every run in a separate file so we can perform

one pass through the runs. Unfortunately, practical
considerations sometimes make it difficult to do this effectively.
The critical elements when merging many files on disk are seek and
rotational delay times and transmission times. These times depend largely
on two interrelated factors: the number of different runs being merged and

Ideally,

the

we would like

merge

step with

323

324

the

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

amount of internal

buffer space available to hold parts of the runs.

can reduce seek and rotational delay times in two ways:

By performing the merge in more than one step;

By increasing the sizes of the initial sorted runs.
In

and/or

both cases, the order of each merge step can be reduced, increasing the
of the internal buffers and allowing more data to be processed per seek.

sizes

Looking

at the first alternative,

means

that

total data

we need

see

how

number of seeks

several steps can decrease the

performing the merge

dramatically, though

also

through the data more than once (increasing

to read

transmission time).

The second

through use of an algorithm called

Replacement selection, which can be implemented
using the selection tree mentioned earlier, involves selecting the key from
memory that has the lowest value, outputting that key, and replacing it
with a new key from the input list.
With randomly organized files, replacement selection can be expected
to produce runs twice as long as the number of internal storage locations
available for performing the algorithms. Although this represents a major
step toward decreasing the number of runs needing to be merged, it carries
with it an additional cost. The need for a large buffer for performing the
replacement selection operation leaves relatively little space for the I/O
buffer, which means that many more seeks are involved in forming the runs
than are needed when the sort step uses an in-RAM sort. If we compare the
alternative

realized

replacement selection.

total

number of seeks

required by the

two

different approaches,

replacement selection can actually require more seeks;

tially better

only

when

Next we turn our

there

is a

we find

great deal of order in the initial

that

performs substanfile.

I/O with
tapes does not involve seeking, the problems and solutions associated with
tape sorting can differ from those associated with disk sorting, although the
fundamental goal of working with fewer, longer runs remains. With tape
sorting, the primary measure of performance is the number of times each
record must be transmitted. (Other factors, such as tape rewind time, can
also be important, but

attention to

we do

file

sorting

tapes. Since file

not consider them here.)

Since tapes do not require seeking, replacement selection

always

good choice

for creating initial runs. Since the

available to hold run files

the

files

on the

tapes. In

limited, the next question

most

cases,

each of several tapes, reserving one or

it is

how

drives

to distribute

necessary to put several runs on

more other

tapes for the results. This

generally leads to merges of several steps, with the total

being decreased after each merge step.

almost

number of

Two

number of runs

approaches to doing

this are

KEY TERMS

balanced merges

and multiphase merges. In

tapes contain approximately the

number of output

a fe-way

balanced merge,

same number of

all

input

runs, there are the

same

tapes as there are input tapes, and the input tapes are read

through entirely during each

of k after each step.

step.

The number of runs

decreased by a

factor

multiphase merge (such

polyphase merge or

requires that the runs initially be distributed unevenly

the available tapes. This increases the order of the

cascade merge)

among

merge and

but one of

all

as a result

can

number of times each record has to be read. It turns out that the
distribution of runs among the first set of input tapes has a major
on the number of times each record has to be read.

decrease the
initial

effect

Next,
available

discuss briefly the existence of sort-merge

on most large systems and can be very

conclude the chapter with

a listing

UNIX

utilities,

flexible

utilities

and

which

effective.

are

used for sorting and

cosequential processing.

KEY TERMS
Balanced merge. A multistep merging technique that uses the same
number of input devices as output devices. A two-way balanced
merge uses two input tapes, each with approximately the same number of runs on it, and produces two output tapes, each with approximately half as many runs as the input tapes. A balanced merge is
suitable for merge sorting with tapes, though it is not generally the
best method (see multiphase merging).
cmp. A UNIX utility for determining whether two files are identical.
Given two files, it reports the first byte where the two files differ, if
they

comtn.

differ.

A UNIX

utility for

common. Given two

determining what

files, it

the lines that are in the

lines

two

files

have

reports the lines they have in

first file

and not

in the second,

common,

and the

lines

second file and not in the first.

Cosequential operations. Operations applied to problems that involve
the performance of union, intersection, and more complex set operations on two or more sorted- input files to produce one or more output files built from some combination of the elements of the input
files. Cosequential operations commonly occur in matching, merging, and file-updating problems.
that are in the

325

326

diff.

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

A UNIX

determining

utility for

two files. It
make it like

all

the lines that differ

between

reports the lines that need to be added to the

the second, the lines that need to be deleted

first file to

from the

file to make it like the first, and the lines that need to be
changed in the first file to make it like the second.
heapsort. A sorting algorithm especially well suited for sorting large

second

files

that

fit

RAM

because

variation of heapsort

its

execution can overlap with I/O.

used to obtain longer runs in the replacement

selection algorithm.

HIGH_VALUE. A

value used in the cosequential model that

than any possible key value.

assigning

HIGH_VALUE

greater

as the

current key value for files for which an end-of-file condition has
been encountered, extra logic for dealing with end-of-file conditions
can be simplified.
fc-way merge. A merge in which k input files are merged to produce

one output

file.

LOW_VALUE. A

value used in the cosequential model that is less than

any possible key value. By assigning LOW_VALUE as the previous
key value during initialization, the need for certain other special
start-up code is eliminated.
Match. The process of forming a sorted output file consisting of all the

elements common to two or more sorted input files.

Merge. The process of forming a sorted output file that consists
of the union of the elements from two or more sorted input
files.

Multiphase merge.

merge

which the initial distrimerge is a J 1 -way

merge (J is the number of available tape drives), and in which the
distribution of runs across the tapes is such that the merge performs
bution of runs

efficiently at

every

Multistep merge.

multistep tape

such that

at least

the

initial

step. (See polyphase merge.)

merge

which not

all

runs are merged in one

of runs are merged separately, each set producing one long run consisting of the records from all of its runs.
These new, longer sets are then merged, either all together or in several sets. After each step, the number of runs is decreased and the
step. Rather, several sets

length of the runs

increased.

run consisting of the entire

file.

The output of the

a single

(Be careful not to confuse our use of

the term multistep merge with multiphase merge.)

merge

final step

Although

multistep

more time-consuming than is a single-step

can involve much less seeking when performed on a disk,

theoretically

merge, it
and it may be the only reasonable way
the number of tape drives is limited.

perform

merge on tape

KEY TERMS

Order of a merge. The number of different

or runs, being

files,

merged. For example, the 100 is the order of a 100-way merge.

Polyphase merge. A multiphase merge in which, ideally, the merge
order
qsort.

maximized

ploys

every step.

general-purpose

UNIX

library routine for sorting files that

em-

user-defined comparison function.

Replacement

selection.

method of creating initial runs based on the

from memory whose key has the

idea of always selecting the record

lowest value, outputting that record, and then replacing

with

new

record from the input

list.

When new

memory

records are

brought in whose keys are greater than those of the most recently
output records, they eventually become part of the run being created. When new records have keys that are less than those of the
most recently output records, they are held over for the next run.
Replacement selection generally produces runs that are substantially
longer than runs that can be created by in-RAM sorts, and hence can
help improve performance in merge sorting. When using replacement selection with merge sorts on disk, however, one must be careful that the extra seeking required for replacement selection does not
outweigh the benefits of having longer runs to merge.
Run. A sorted subset of a file resulting from the sort step of a sort
merge or one of the steps of a multistep merge.
Selection tree. A binary tree in which each higher-level node represents
the winner of the comparison between the two descendent keys. The
minimum (or maximum) value in a selection tree is always at the
root node,

making

ing several

lists.

the selection tree a

It is

also a

good

key structure

data structure for

replacement selection

algorithms, which can be used for producing long runs for

sorts.

(Tournament

sort,

an internal

merg-

sort, is also

merge

based on the use of

selection tree.)

Sequence checking. Checking that records in

order. It is recommended that all files used

a file are in the

expected

in a cosequential opera-

tion be sequence checked.

A UNIX utility for sorting and merging files.

Synchronization loop. The main loop in the cosequential processing
model. A primary feature of the model is to do all synchronization

sort.

within

a single

loop, rather than in multiple nested loops.

keep the main synchronization loop

objective

This

done by

ble.

second

simple as possi-

restricting the operations that occur within the

loop to those that involve current keys, and by relegating

special logic as possible (such as error checking

ing) to subprocedures.

much

and end-of-file check-

327

328

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

Theorem

(Knuth).

It is

difficult to decide

which merge pattern

best in a given situation.

EXERCISES
1.

Write an output procedure to go with the procedures described in

7. 1 for doing cosequential matching. As a defensive measure, it is a

section

good idea to have the output procedure do sequence checking

manner as the input procedure does.
2.

Consider the cosequential

in the

initialization routine in Fig. 7.4. If

same

PREV_1

LOW_VALUE in this routine, how would

input() have to be changed? How would this affect the adaptability of input ()
PREV_2 were

and

not

set to

for use in other cosequential processing algorithms?

Consider the cosequential merge procedures described in section 7.1.

Comment on how they handle the following situations. If they do not
correctly handle a situation, indicate how they might be altered to do so.
a. List 1 empty and List 2 not empty
b. List 1 not empty and List 2 empty
c. List 1 empty and List 2 empty
3.

4.
it

In the ledger procedure

example

also updates the ledger file

Use

the /e-way

in section 7.2,

with the

merge example

new

modify the procedure so

account balances for the month.

as the basis for a

procedure that

/e-way match.
6.

are

Figure 7. 17 shows a loop for doing a /e-way merge, assuming that there
no duplicate names. If duplicate names are allowed, one could add to the

procedure a facility for keeping

names. Alter the procedure to do
7.

In section 7.3,

keys

two methods

list

of subscripts of duplicate lowest

are presented for choosing the lowest of/?

each step in a /e-way merge:

Compare

this.

a linear

search and use of a selection tree.

in terms of numbers of
comparisons for k = 2, 4, 8, 16, 32, and 100. Why do you think the linear
approach is recommended for values of k less than 8?
8.

the performances of the

two approaches

Suppose you have 8 megabytes of

800,000-record

file

How long does it take to sort the

rithm described in section 7.5.1?
a.

RAM

available for sorting the

described in section 7.5.1.

file

using the merge sort algo-

EXERCISES

How

long does

take to sort the

file

using the keysort algorithm

described in Chapter 5?
c.

Why

work

will keysort not

there

one megabyte of

RAM

available for the sorting phase?

How much

seek time

the one described in section

the

amount of available

10.

Performance

7.5 if the

why

50 msec and

500 K? 100 K?

often measured in terms of the

the

number of comparisons

measuring performance in sorting large

11. In

one-step merge such as

time for an average seek

internal buffer space

in sorting

comparisons. Explain

required to perform

number of

not adequate for

files.

our computations involving the merge

sorts,

we made

the simpli-

fying assumption that only one seek and one rotational delay are required
for

any single sequential

access. If this

were not the

case, a great deal

time would be required to perform I/O. For example, for the 80-megabyte
file

used in the example in section 7.5.1, for the input step of the sort phase
all records into
for sorting and forming runs"), each

RAM

("reading

individual run could require

many

accesses.

extent size for our hypothetical drive

track),

and that

all files

separately (one seek

b.
c.

Now

let's

assume

that the

20,000 bytes (approximately one

are stored in track-sized blocks that

must be accessed

and one rotational delay per block).

How many seeks does step now require?

How long do steps 1, 2, 3, and 4 now take?
How does increasing the file size by a factor

total

12.

of 10

now

affect the

time required for the merge sort?

Derive two formulas for the number of seeks required to perform the
step of a one-step /e-way sort merge of a file with r records divided

merge

into k runs,
If

where the amount of available

an internal sort

of each run is M,
the length of each run

Assume

RAM

equivalent to

M records.

used for the sort phase, you can assume that the length
but if replacement selection is used, you can assume that

about 2M.

Why?

system with four separately addressable disk drives,

hundred megabytes. Assume that the
80-megabyte file described in section 7.5 is already on one of the drives.
Design a sorting procedure for this sample file that uses the separate drives
to minimize the amount of seeking required. Assume that the final sorted
file is written off to tape and that buffering for this tape output is handled
invisibly by the operating system. Is there any advantage to be gained by
using replacement selection?
13.

a quiet

each of which

able to hold several

329

330

COSEQUENTIAL PROCESSING AND THE SORTING OF LARGE FILES

14. Use replacement selection to

assuming P = 4.
a. 23 29 5 17 9 55 41 3 51
b. 3 5 9 11 17 18 23 24 29
c. 55 51 47 41 33 29 24 23

Suppose you have

15.

so 10 cylinders

produce runs from the following

33 18 24 11 47
33 41 47 51 55
18 17 11 9 5 3

a disk drive that has 10

may be

files,

read/write heads per surface,

accessed at any one time without having to

you could control

on disk, how might you be able to
a sort merge?
actuator arm. If

move

the

the physical organization of runs stored

exploit this arrangement in performing

Assume we need

16.

patterns starting

8-4-2
7-4-3
6-5-3

5-5-4.

to merge 14 runs on four tape drives. Develop merge

from each of these initial distributions:

A four-tape polyphase merge is to be performed to

17.

25 16 45 29 38 23 50 22 19 43 30
runs are of length

following runs

After

1 1

27 47. The original

initial sorting,

24/36/13/25

Tape

24 36 13

list is

on tape

4. Initial

and 3 contain the

slash separates runs):

Tape

tapes

list

sort the

Show the contents of tape 4 after one merge phase.

Show the contents of all four tapes after the second and

fourth

phases.
c.

Comment on

the appropriateness of the original

4-6-7

distribu-

tion for performing a polyphase merge.

18. Obtain a copy of the manual for one or more commercially available
sort-merge packages. Identify the different kinds of choices available to
users of the packages. Relate the options to the performance issues discussed
in this chapter.

Programming Exercises
19.
in

20.
in

Implement the cosequential match procedures described

Implement the cosequential merge procedures described

in section 7.1

or Pascal.

in section 7.

FURTHER READINGS

21.

Implement

complete program corresponding to the solution to the

general ledger problem presented in section 7.2.

22.

Design and implement

Examine

program to do the following:

two sorted files Ml and M2.

the contents of

COMMON

containing a copy of records

Produce a third file
from the original two files that are identical.
c. Produce a fourth file DIFF that contains all records from the two
b.

files

that are not identical.

two separate topics: the

model for cosequential processing, and discussion of external
merging procedures on tape and disk. Although most file processing texts discuss
cosequential processing, they usually do it in the context of specific applications,
rather than presenting a general model that can be adapted to a variety of
applications. We found this useful and flexible model through Dr. James VanDoren,
who developed this form of the model himself for presentation in the file structures
course that he teaches. We are not aware of any discussion of the cosequential model
subject matter treated in this chapter can be divided into

presentation of a

elsewhere in the literature.

of work has been done toward developing simple and effective

file updating, which is an important instance of
cosequential processing. The results deal with some of the same problems the
cosequential model deals with, and some of the solutions are similar. See Levy
(1982) and Dwyer (1981) for more.
Unlike cosequential processing, external sorting is a topic that is covered
widely in the literature. The most complete discussion of the subject, by far, is in
Quite

a bit

algorithms to do sequential

Knuth

some
Knuth
book in

(1973b). Students interested in the topic of external sorting must, at

point, familiarize themselves with Knuth's definitive

also describes replacement selection, as evidenced

summary of the

subject.

by our quoting from

his

this chapter.

Salzberg (1987) provides an excellent analytical treatment of external sorting,

and Salzberg (1990) describes an approach that takes advantage of replacement
selection, parallelism, distributed computing, and large amounts of memory. Lorin
(1975) spends several chapters on sort-merge techniques. Bradley (1982) provides a
good treatment of replacement selection and multiphase merging, including some
interesting comparisons of processing time on different devices. Tremblay and
Sorenson (1984) and Loomis (1983) also have chapters on external sorting.
Since the sorting of large files accounts for a large percentage of data processing
time, most systems have sorting utilities available. IBM's DFSORT (described in
IBM, 1985) is a flexible package for handling sorting and merging applications. A

VAX

sort utility

described in Digital (1984).

331

B-Trees and Other

Tree-structured File
Organizations

CHAPTER OBJECTIVES
Place the development of B-trees in the historical

context of the problems they were designed to

solve.
I

Look

might be
paged AVL

briefly at other tree structures that

used on secondary storage, such

trees.
I

Provide an understanding of the important properties possessed by B-trees, and show how these
properties are especially well suited to secondary
storage applications.

Describe fundamental operations on B-trees.

Introduce the notion of page buffering and virtual

B-trees.

Describe variations of the fundamental B-tree algorithms, such as those used to build B * trees and
B-trees with variable-length records.

333

CHAPTER OUTLINE
8.1

Introduction:

The Invention of the

8.13 Deletion, Redistribution,

B-Tree

Concatenation

8.2

Statement of the Problem

8.13.1

8.3

Binary Search Trees

8.4

AVL

8.5

Paged Binary Trees

8.6

The Problem with

8.7

as a

Solution

and

Redistribution

8.14 Redistribution during Insertion:

Way to Improve Storage

Trees

Utilization
8.15

the

Top-Down

Trees

8.16 Buffering of Pages: Virtual

Construction of Paged Trees

B-Trees

B-Trees: Working up from the

8.16.1

Bottom

8.16.2 Replacement Based on Page

LRU

Replacement

Height

and Promoting

8.8

Splitting

8.9

Algorithms for B-Tree Searching

and Insertion

8.10

B-Tree Nomenclature

8.18 Variable-length

Records and Keys

8.11

Formal Definition of B-Tree

C Program

Keys into

Properties

B-Tree

8.12 Worst-case Search

8.16.3 Importance of Virtual B-Trees

Depth

8.17

Placement of Information
Associated with the Key

Pascal

to Insert

Program

to Insert

Keys into

B-Tree

8.1

Introduction:

Computer

science

that at the start

The Invention
is a

young

of 1970,

of the B-Tree

discipline.

As evidence of this youth, consider

had twice travelled to the moon,

after astronauts

B-trees did not yet exist. Today, only 15 years

major, general-purpose

file

system that

later, it is

not built around

hard to think of
a

B-tree design.

Douglas Comer, in his excellent survey article, "The Ubiquitous

B-Tree" [1979], recounts the competition among computer manufacturers
and independent research groups that developed in the late 1960s. The goal
was the discovery of a general method for storing and retrieving data in
large file systems that would provide rapid access to the data with minimal
overhead cost. Among the competitors were R. Bayer and E. McCreight,
who were working for Boeing Corporation at that time. In 1972 they
published an article, "Organization and Maintenance of Large Ordered

335

INTRODUCTION: THE INVENTION OF THE B-TREE

Indexes," which announced B-trees to the world.

published his survey

Comer was

able

B-trees had already

article,

that

state

organization for indexes in

become

so widely used that

de facto,

is,

when Comer
standard

the

database system."

have reprinted the

"the B-tree

1979,

first

few paragraphs of the 1972 Bayer and

McCreight article^ because it so concisely describes the facets of the

problem that B-trees were designed to solve: how to access and maintain
efficiently an index that is too large to hold in memory. You will remember
that this is the same problem that is left unresolved in Chapter 6, on simple
index structures. It will be clear as you read Bayer and McCreight's
introduction that their

back

in the

work goes

straight to the heart

of the issues

raise

indexing chapter.

In this paper

index for

consider the problem of organizing and maintaining an

dynamically changing random access

a collection

of index elements which are pairs

adjacent data items,

The key x
information

random

namely

identifies

file.

(x, a)

an index

of fixed

we mean

size physically

key x and some associated information

unique element

the

index,

the

associated

typically a pointer to a record or a collection of records in a

access

file.

For

this

paper the associated information

of no

further interest.

We assume that the index itself is so voluminous that only rather small
parts

of it can be kept in main store

must be kept on some backup

store.

are pseudo random access devices

time

opposed

The

class

which have

random

to a true

one time. Thus the bulk of the index

of backup stores considered

rather long access or wait

access device like core store

and

rather high data rate once the transmission of physically sequential data has

been

Typical pseudo random access devices

initiated.

moving head

disks,

Since the data

the index and

keys

drums, and data

file itself

changes,

elements,

retrieve

are:

fixed and

cells.

must be possible not only

to search

but also to delete and to insert

more accurately index elements economically.

The index orga-

nization described in this paper allows retrieval, insertion, and deletion ot

keys in time proportional to log^

size

or better, where

/ is

the size of the

dependent natural number which describes the page

such that the performance of the maintenance and retrieval scheme

index, and k

a device

becomes near optimal.

Exercises 17, 18, and 19 at the end of Chapter 6 introduced
of a paged index. Bayer and McCreight's statement that
developed a scheme with retrieval time proportional to log^ /,
related to the page size, is very significant. As we will see, the use

"From

Ada-Informatica, 1:173-189,
permission.

1972, Springer Verlag,

New

the notion

they have

where k

of a B-tree

York. Reprinted with

336

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

of 64 to index a file with a million records results in being

key for any record in no more than four seeks to the disk.
A binary search on the same file can require as many as 20 seeks. Moreover,
we are talking about getting this kind of performance from a system that
requires only minimal overhead as keys are inserted and deleted.
Before looking in detail at Bayer and McCreight's solution, let's first
return to a more careful look at the problem, picking up where we left off
in Chapter 6. We will also look at some of the data and file structures that
were routinely used to attack the problem before the invention of B-trees.
Given this background, it will be easier to appreciate the contribution made
by Bayer and McCreight's work.
with

page

size

able to find the

One
provides

matter before

last

begin:

Why

the

name

B-tree?

Comer

(1979)

this footnote:
'

lie origin

of 'B-tree

As we

Creight].

'
'

shall see,

has never been explained by [Bayer and

Mc-

"balanced," "broad," or "bushy" might apply.

Others suggest that the "B" stands for Boeing. Because of his contribuhowever, it seems appropriate to think of B-trees as "Bayer"-trees.

tions,

8.2

Statement

of the

Problem

The fundamental problem with keeping an index on secondary

of course, that accessing secondary storage
problem can be broken down into two more
Biliary searching requires too

many

seeks.

slow.

specific

storage

is,

This fundamental

problems:

Searching for

key on

disk

often involves seeking to different disk tracks. Since seeks are expensive, a search that has to

look in more than three or four locations

more time than is desirable. If

before finding the key often requires

are using a binary search, four seeks

only enough to differenti-

between 15 items. An average of about 9.5 seeks is required to

find a key in an index of 1,000 items using a binary search. We need
to find a way to home in on a key using fewer seeks.
It can be very expensive to keep the index in sorted order so we can perform
a binary search. As we saw in Chapter 6, if inserting a key involves

ate

number of the other keys in the index, index maintevery nearly impractical on secondary storage for indexes
consisting of only a few hundred keys, much less thousands of keys.
We need to find a way to make insertions and deletions that have

moving
nance

a large

only local effects in the index, rather than requiring massive reorganization.

337

BINARY SEARCH TREES AS A SOLUTION

FIGURE 8.1 Sorted

list

of keys.

These were the two

problems that confronted Bayer and McCreight

in 1970. They serve as guideposts for steering our discussion of the use of
tree structures for secondary storage retrieval.

8.3

critical

Binary Search Trees as a Solution

Let's begin

by addressing the second of these two problems, looking at the

list in sorted order so we can perform binary searches.

cost of keeping a

Given the sorted

list

in Fig. 8.1,

shown

as a binary search tree, as

can express

Using elementary data structure techniques,

create nodes that contain right

can be constructed as
representation of the
8.2. In each

and

left link fields

two

left

list

simple matter to

so the binary search tree

linked structure. Figure 8.3 illustrates

first

node, the

binary search of this

in Fig. 8.2.

levels

and right

of the binary search

tree

links point to the left

shown

and right

linked
in Fig.
children

of the node.
If

each node

treated as a fixed-length record in

contain relative record numbers

(RRNs) pointing

which the

link fields

to other nodes, then

it is

possible to place such a tree structure on secondary storage. Figure 8.4

illustrates the contents

of the 15 records that would be required to form the

binary tree depicted in Fig. 8.2.

of the link fields in the file are empty because they

with no children. l^j^xMXl^^fjlod^jaeedio contain some
special character, suchas f, to indicat e t hat the search through the tree h as

Note

that over half

are leaf nodes,

FIGURE 8.2 Binary search tree representation of the

KF.

list

of keys.

338

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

FIGURE 8.3 Linked representation of part of a binary search

tree.

reached the leaf level and that there are no more nodes on the search

leave the fields blank in this figure

tomake them more

illustrating the potentially substantial cost in

incurred by this kind of linked representation of

But

to focus

important

new

have to sort the

records in the

on the

costs

a tree.

and not the advantages

file

to be able to

perform

binary search.

illustrated in Fig. 8.4 appear in

random

capability that this tree structure gives us:

file

noticeable,

terms of space utilization

miss the

no longer
Note that the

rather than sorted

FIGURE 8.4 Record contents

[ill

for a linked representation of

the binary tree

Key

Left

Right

child

Left

Key

Right

child child

in Fig.

8.2.

339

BINARY SEARCH TREES AS A SOLUTION

FIGURE 8.5 Binary search tree with LV added.

The sequence of the

order.

structure of the tree;

in the link fields.

that if

we add

records in the

new key

has no necessary relation to the

positive consequence that follows

to the

such

file,

to create a tree that

we would get with

file

the information about the logical structure

The very

appropriate leaf node

all

L V, we

need only link

provides search performance that

a binary search on a sorted

list.

The

tree

with

carried

from
is

this is

to the

good

added

illustrated in Fig. 8.5.

Search performance on
balanced state.

a leaf

does not differ

level.

For the tree in Fig.

to complete balance,

this tree

we mean

still

good because

the tree

in a

of the shortest path to

from the height of the longest path by more than one

balanced

where

that the height

8.5, this difference

all

the paths

of one

from root

as close as

can get

to leaf are exactly the

same

length.

Consider what happens if we go on to enter the following eight keys to

the tree in the sequence in which they appear:

NP MB TM LA UF ND TS NK
Just searching

down through

the tree and adding each key at

position in the search tree results in the tree

The

tree

now

shown

its

correct

in Fig. 8.6.

out of balance. This is a typical result for trees built by

treeasTiey occur without rearrangement. The

placing keys ""into Hie

resulting disparity

between the length of various search paths

undesirable

any binary search tree, but is especially troublesome if the nodes of the
tree are being kept on secondary storage. There are now keys that require
seven, eight, or nine seeks for retrieval. A binary search on a sorted list of
these 24 keys requires only five seeks in the worst case. Although the use of
a tree lets us avoid sorting, we are paying for this convenience in terms of
extra seeks at retrieval time. For trees with hundreds of keys, in which an

out-of-balance search path might extend to 30, 40, or

too high.

seeks, this price

340

B-TREES AND OTHER TREE-STRUCTURED

/ \DE
AX

FILE

ORGANIZATIONS

/ \JD
FT

/
NR
/
LV

/X
TK

\TM

/
MB
\ND

NK
FIGURE 8.6 Binary search tree showing the effect

8.4

added

keys.

AVL Trees
we

is no necessary relationship between the order in

and the structure of the tree. We stress the word
necessary because it is clear that order of entry is, in fact, important in
determining the structure of the sample tree illustrated in Fig. 8.6. The
reason for this sensitivity to the order of entry is that, so far, we have just
been linking the newest nodes at the leaf levels of the tree. This approach

Earlier

said that there

which keys

are entered

can result in

some very

undesirable tree organizations.

example, that our keys consist of the

letters

A-G,

and that

keys in alphabetical order. Linking the nodes together

\
\D

FIGURE 8.7 A degenerate

tree.

Suppose,

for

we receive these
we receive them

341

AVL TREES

FIGURE 8.8 AVL trees.

produces

degenerate tree that

is,

in fact,

nothing more than

linked

list,

as illustrated in Fig. 8.7.

The
tree as

solution to this problem

receive

elegant

method

known

new

A VL trees,

a height-balanarftree^

of difference that
a

common

AVL

tree

honor of the

that

which

Landis

who

first

This means that there

An AVL

defined them.
a limit

placed on the

M^
tree

amount

AVL tree the maximum allowable difference is

HB(1)

general class of height-balanced trees

AVL trees.

not in balance

mance

tree.

known

1.
It

An
is

HB(k)

In each of these trees, the root of the subtree that

marked with an X.

make AVL

feat ures that

setting a

subtrees,

maximum

AVL

trees

important are

as follows:

allowable difference in the height of any

trees guarantee a certain

in searching;

Maintaining

minimum

level

AVL

form as new nodes are inserted involves

of four possible rotations. Each of the rota-

FIGURE 8.9 Trees that are not AVL trees.

two

of perfor-

and

a tree in

the use of one of a set

trees

The trees illustrated in Fig. 8.8 have the AVL, or HB(1) property. Note
no two subtrees of any root differ by more than one level. The trees in

The two

One

are permitted to be k levels out of balance.

Fig. 8.9 are not

nodes of the

pair of Russian mathematicians, G.

therefore called a height-balanced 1-tree or

member of a more
trees,

to reorganize the

allowed between the heights of any two subtrees sharing

root. In an
is

somehow

for handling such reorganization results in a class

Add'son-Vel'skii and E.
is

keys, maintaining a near optimal tree structure.

X
/

m m

342

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

FIGURE 8.10 A completely balanced search

tions

confined to

a single, local area

of the rotations requires only

of the

tree.

The most complex

five pointer reassignments.

AVL

trees are an important class of data structure. The operations used

and maintain AVL trees are described in Knuth (1973b), Standish
(1980), and elsewhere. AVL trees are not themselves directly applicable to
most file structure problems because, like all strictly binary trees, they have
they are too deep. However, in the context of our general
too many levels
discussion of the problem of accessing and maintaining indexes that are too

to build

large to

fit

memory,

AVL

trees are interesting because they suggest that

possible to define procedures that maintain height balance.

it is

The

fact that an

AVL

tree

height-balanced guarantees that search

performance approximates that of a completely balanced tree^For example,

the completely balancedTorm oTTTreelnade up from the input keys

BCGEFDA
is

illustrated in Fig. 8.10,

keys, arriving in the

For
given

and the

AVL

same sequence,

completely balanced

N possible keys,

looks

of the

tree.

For an

So,

illustrated in Fig. 8.11.

tree, the

AVL

(N +

tree, the

1.44 log 2
levels.

from the same input

worst-case search to find

key,

log 2
levels

tree resulting

given 1,000,000 keys,

worst-case search could look

(N+

completely balanced tree requires

some of the keys, but never to 21 levels. If the tree

maximum number of levels increases to only 28. This

seeking to 20 levels for

AVL

tree, the

FIGURE 8.1

A search tree conAVL procedures.

structed using

343

PAGED BINARY TREES

very interesting

AVL

given that the

result,

no more than

single reorganization requires

procedures guarantee that

five pointer reassignments.

Empirical studies by VanDoren and Gray (197 4V

shown

among

others,

have

that such local reorganizations are required for approximately every

other insertion into the tree and for approximately every fourth deletion. So
height balancing using

AVL

methods guarantees

that

will obtain a

reasonable approximation to optimal binary tree performance

at a cost that

acceptable in most applications using primary, random-access

When we

memory.
more

are using secondary storage, a procedure that requires

than five or six seeks to find

unacceptable. Returning to the

key

is less

than desirable; 20 or 28 seeks

two problems

that

identified earlier in

this chapter:

Binary searching requires too

Keeping an index

seeks;

and

expensive,

can see that height-balanced trees provide an acceptable solution to the

second problem.

8.5

many

in sorted order

Now we

need to turn our attention to the

first

problem.

Paged Binary Trees

Once

again

are confronting

secondary storage devices:

perhaps the most

critical feature

takes a relatively long time to seek to a specific

location, but once the read head

what
is

positioned and ready, reading or writing

stream of contiguous bytes proceeds rapidly. This combination of slow

seek and fast data transfer leads naturally to the notion of paging. In

system, you do not incur the cost of

a paged
few bytes.
of the disk, you read

disk seek just to get a

once you have taken the time to seek to an area

file. This page might consist of a great many
individual records. If the next bit of information you need from the disk is
in the page that was just read in, you have saved the cost of a disk access.
Paging, then, is a potential solution to our searching problem. By
dividing a binary tree into pages and then storing each page in a block of
contiguous locations on disk, we should be able to reduce the number of
seeks associated with any search. Figure 8.12 illustrates such a paged tree. In
this tree we are able to locate any one of the 63 nodes in the tree with no
more than two disk accesses. Note that every page holds seven nodes and
can branch to eight new pages. If we extend the tree to one additional level
of paging, we add 64 new pages; we can then find any one of 511 nodes in
only three seeks. Adding yet another level of paging lets us find any one of
4,095 nodes in only four seeks. A binary search of a list of 4,095 items can
Instead,

an entire page from the

take as

many

as 12 seeks.

344

B-TREES AND OTHER TREE-STRUCTURED

ORGANIZATIONS

FILE

A A

mi
nnnn nnnn
A A
a

/\
9 9 *

-3,

n n n

FIGURE 8.12 Paged binary

A ^A
S

l\ l\ l\ l\

/\ /\

3 #

A A

/wwwi

/\ /\ l\ l\

/in

tree.

Clearly, breaking the tree into pages has the potential to result in faster

searching on secondary storage, providing us with

than any other form of keyed access that

much

faster retrieval

have considered up to

this

Moreover, our use of a page size of seven in Fig. 8. 12 is dictated more

by the constraints of the printed page than by anything having to do with
secondary storage devices. A more typical example of a page size might be
8 kilobytes capable of holding 511 key /reference field pairs. Given this page
size, and assuming that each page contains a completely balanced, full tree,
and that the pages themselves are organized as a completely balanced, full
tree, it is then possible to find any one of 134,217,727 keys with only three
seeks. That is the kind of performance we are looking for. Note that, while
the number of seeks required for a worst-case search of a completely full,
point.

balanced binary tree

(N +

log 2

where

the

number of keys

in the tree, the

the paged versions of a completely

full,

log fe+1

number of seeks

balanced tree

(N +

required for

once again, the number of keys. The new variable, k, is the

held in a single page. The second formula is actually a
generalization of the first, since the number of keys in a page of a purely

where

is,

number of keys

345

THE PROBLEM WITH THE TOP-DOWN CONSTRUCTION OF PAGED TREES

binary tree

It is

makes

the logarithmic effect of the page size that

the

impact of paging so dramatic:

log 2 (134,217,727

log 511 +

(134,217,727

27 seeks

3 seeks.

come free. Every access to a page

amount of data, most of which is not
used. This extra transmission time is well worth the cost, however, because
it saves so many seeks, which are far more time-consuming than the extra
transmissions. A much more serious problem, which we look at next, has
to do with keeping the paged tree organized.
The use of

large pages does not

requires the transmission of a large

8.6

The Problem with the Top-down Construction

of Paged Trees
Breaking

pages is a strategy that is well suited to the physical

of
secondary
storage devices such as disks. The problem,
characteristics
once we decide to implement a paged tree, is how to build it. If we have the
entire set of keys in hand before the tree is built, the solution to the problem
is relatively straightforward: We can sort the list of keys and build the tree
a tree into

from this sorted list. Most importantly, if we plan to start building the tree
from the root, we know that the middle key in the sorted list of keys should
be the root key within the root page of the tree. In short, we know where to
begin and are assured that
a

this

beginning point will divide the

set

of keys

balanced manner.
Unfortunately,

receiving keys in

the

problem

random order and

them. Assume that

we must

much more
inserting

complicated

them

build a paged tree as

soon

are

receive

receive the following

sequence of single-letter keys:

CSDTAMPIBWNGURKEHOLJYQZFXV
We

will build a

keys per page. As

paged binary

tree that contains a

insert the keys,

necessary to keep each page as balanced as

illustrated in Fig. 8.13.
in pages), this tree

as
is

does not turn out too badly. (Consider, for example,

arrive in alphabetical order.)
is

not dramatically misshapen,

clearly illustrates

from the top down.

from the root, the initial keys must, of necessity, go into the
example at least two of these keys, C and D, are not keys that

the difficulties inherent in building a paged binary tree

start

root. In this

of three

Evaluated in terms of the depth of the tree (measured

what happens if the keys

Even though this tree

When you

maximum

them within a page

possible. The resulting tree
rotate

346

B-TREES AND OTHER TREE-STRUCTURED

ORGANIZATIONS

FILE

A A
H

A
Y

FIGURE 8.13 Paged tree constructed from keys arriving

we want

there.

beginning of the

They

random input sequence.

adjacent in sequence and tend toward the

of keys. Consequently, they force the tree out of

are

total set

balance.

Once the wrong keys are placed in the root of the tree (or in the root
of any subtree further down the tree), what can you do about it?
Unfortunately, there is no easy answer to this. We cannot simply rotate
entire pages of the tree in the same way that we would rotate individual
keys in an unpaged tree. If we rotate the tree so the initial root page moves
down to the left, moving the C and D keys into a better position, then the
S key is out of place. So we must break up the pages. This opens up a whole
world of possibilities and difficulties. Breaking up the pages implies
rearranging them to create new pages that are both internally balanced and
well arranged relative to other pages. Try creating a page rearrangement
algorithm for the simple, three-keys-per-page tree from Fig. 8.13. You will
find it very difficult to create an algorithm that has only local effects,
rearranging just a few pages. The tendency is for rearrangements and
adjustments to spread out through a large part of the tree. This situation
grows even more complex with larger page sizes.
So, although we have determined that the idea of collecting keys into
pages is a very good one from the standpoint of reducing seeks to the disk,

347

SPLITTING AND PROMOTING

we have

not yet found

confronting

at least

way

to collect the right keys.

two unresolved

How

do we ensure

good

separator keys, dividing

are

still

questions:

that the keys in the root

up the

set

page turn out to be

of other keys more or

less

evenly?

How

do we avoid grouping keys, such

ple, that

There

is,

should not share

size

C, D, and 5 in our exam-

page?

in addition, a third question that

because of the small page

we have

of our sample

not yet had to confront

tree:

How

can we guarantee that each of the pages contains at least some

minimum number of keys? If we are working with a larger page
size, such as 8,191 keys per page, we want to avoid situations in

which

a large

number of pages each

Bayer and McCreight's 1972 B-tree

precisely tow ard these questions.

contains only a few dozen keys.

article

provides a solution directed

8.7

B-Trees: Working up from the Bottom

A number

in computer science have

problem from a different viewpoint. B-trees are
an example of this viewpoint-shift phenomenon.
The key insight required to make the leap from the kinds of trees we
have been considering to a new solution, B-trees, is that we can choose to
build trees upward from the bottom instead of downward from the top. So far, we
have assumed the necessity of starting construction from the root as a given.
Then, as we found that we had the wrong keys in the root, we tried to find
ways to repair the problem with rearrangement algorithms. Bayer and
McCreight recognized that the decision to work down from the root was,
of itself, the problem. Rather than finding ways to undo a bad situation,
they decided to avoid the difficulty altogether. With B-trees, you allow the
root to emerge, rather than set it up and then find ways to change it.

of the elegant, powerful ideas used

grown out of looking

8.8

Splitting

at a

and Promoting

of an ordered sequence of keys and a

with the paged
trees shown previously; there is just an ordered list of keys and some
pointers. The number of pointers always exceeds the number of keys by

In a B-tree, a page, or node, consists

set

of pointers. There

explicit tree within a node, as

348

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

FIGURE 8.14

Initial ieaf of

B-tree with a page size of

seven.

one.

The maximum number of pointers

that can be stored in a

we have

the order of the B-tree. For example, suppose

Each page can hold

most seven keys and

node

called

an order-eight B-tree.

eight pointers.

Our

initial leaf of

the tree might have a structure like that illustrated in Fig. 8.14 after the
insertion of the letters

B C G E
The

starred (*) fields are the pointer fields. In this leaf, as in any other

leaf node, the value

all

definition, a leaf node has

do not lead
pages

the pointers

no children

to other pages in the tree.

usually

contain

set to indicate end-of-list.

We assume that the pointers in the leaf

pointer

invalid

in the tree; consequently, the pointers

such

value,

Note,

incidentally, that this leaf is also our root.

In a real-life application there

also usually

some other information

stored with the key, such as a reference to a record containing data that are
associated with the key. Consequently, additional pointer fields in each
page might actually lead to some associated data records that are stored

But,

elsewhere.

paraphrasing Bayer and McCreight,

purposes, "the associated information

Building the
a single

page

first

our present

As we insert new keys, we use

memory and, working in memory,
Since we are working in electronic

easy enough.

disk access to read the page into

insert the

key into

memory,

this insertion is relatively

its

for

of no further interest."

place in the page.

inexpensive compared to the cost of

additional disk accesses.

But what happens

the kev
full.

come

keys

in?

Suppose

try to insert the J

leaf to

choose between the leaves

accommodate the new

when

shown
a

add
is

in Fig. 8.15

searching. In short,

higher level in the

J key.

we want

find that our Jeaf

distributing the keys as evenly as

can between the old leaf node and the new one, as
Si nre wgjiow have two lea ves, we need to create

FIGURE 8.15 Splitting the

We then split the le af intoTwo leaves,

tree to enable us to

as additional

to the B-tree._W hen

349

SPLITTING AND PROMOTING

FIGURE 8.16 Promotion of the E key into a root node.

need to create
leaves. In this

new root. We do this by promoting a key that separates th e

case, we promote the E from the first position in the second
a

leaf, as illustrated in Fig.

In this
in

two

8.16.

example we describe the

steps to

make

and promotion are handled

Let's see

how

paged binary

splitting

and the promotion operations

the procedure as clear as possible; in practice, splitting

in a single operation.

B-tree grows given the key sequence that produces the

tree illustrated in Fig. 8.13.

The sequence

CSDTAMPIBWNGURKEHOLJYQZFXV
We use an order-four B -tree

(four poi nter fields and three key fields pe r

page), since this corresponds to the page size of the paged binary tree.

such

Using

small page size has the additional advantage of causing pages to

frequently,

promotion.

split

providing us with more examples of splitting and

omit

explicit indication

of the pointer

fields so

can

fit

on the printed page.

8. 17 illustrates the growth of the tree up to the point at which the
root node is about to split. Figure 8.18 shows the tree after the splitting of
the root node. The figure also shows how the tree continues to grow as the

larger tree

Figure

remaining keys in the sequence are added. We number each of the tree's
pages (upper left corner of each node) so you can distinguish the newly
added pages from the ones already in the tree.

Note

that the tree

always perfectly balanced

path from the root to any leaf is the same

other

leaf.

as the

w ith regar d to height; the

path from the root to any

Also note that the keys that are promoted upward into the tree
of keys we want in a root: keys that are good

are necessarily the kind

separators.

pages

fill

By working up from the leaf level, splitting and promoting as

we overcome the problems that plague our earlier paged

up,

binary tree efforts.

Insertion of C,

and

D
c D

into the initial page:

Insertion of

forces the split

and the promotion of

c D

A added without

incident:

A C D

Insertion of
split

\<^r

M forces another

and the promotion of D:

A C

and

^cx/vCtX*^

W inserted

into existing pages:

D N

A B C

N causes another
followed by the promotion of N. G, U, and R are

Insertion of
split,

added

to existing pages:

A B C
FIGURE 8.17 Growth of a B-tree, part
is imminent.

ting of the root

350

The

tree grows to a point at

which

split-

m^ o<? 33

Insertion of K causes a split at leaf level,

followed by the promotion of K. This
causes a split of the root. N is promoted
to

become the new

root.

added

to a leaf:

D K

r
E G

A B C

Insertion of

H causes

a leaf to split.

/
M

T U

P R

H is

promoted. O, L, and/ are added:

A B C

T U

Y and Q force two more leaf

and promotions. Remaining letters

Insertion of
splits

are added:

B C

FIGURE 8.18 Growth of a B-tree, part

II.

M
The

root splits to

T L V

add

new

level;

X Y

remaining keys are

inserted.

351

352

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

8.9

Algorithms for B-Tree Searching and Insertion

Now

that

have had

look

a brief

how

work on paper,
make them work

B-trees

outline the structures and algorithms required to

let's

in a

computer. Most of the code that follows is pseudocode. C and Pascal

implementations of the algorithms can be found at the end of this chapter.

Page Structure

used by

As you

a B-tree.

begin by defining one possible form for the page

see later in this chapter

and

in the following

ways to construct the page of a B-tree. We

start with a simple one in which each key is a single character. If the
maximum number of keys and children allowed on a page is MAXKEYS
chapter, there are

many

different

MAXCHILDREN, respectively, then the following

C and Pascal describe a page called PAGE.

and

structures ex-

pressed in

In C:

struct BTPAGE {
short
KEYCDUNT;
char
KEYCMAXKEYS]
CHILDCMAXKEYS+1
short
>
PAGE;
;

/* number of keys stored

/* the actual keys
/RRNs of children
&

PAGE */
*/

3SS^S

* /

In Pascal:

TYPE

BTPAGE

RECORD
KEYCDUNT
KEY

CHILD

nt eger

ar ray [
ar ray [

1
1

MAXKEYS] of char;
MAXCHILDREN] of integer

END;

VAR

PAGE

BTPAGE;
Given this page structure, the file containing the B-tree consists of a set
of fixed-length records. Each record contains one page of the tree. Since the
keys in the tree are single letters, this structure uses an array of characters to
hold the keys. More typically, the key array is a vector of strings rather than

The variable PAGE.KEYCOUNT is useful

when the algorithms must determine whether a page is full or not. The
PAGE.CHILD[] array contains the RRNs of PAGE's children, if there are
any. When there is no descendent, the corresponding element of PAGE.
CHILD[] is set to a nonaddress value, which we call NIL. Figure 8.19

just a vector of characters.

shows two pages

in a B-tree

of order four.

ALGORITHMS FOR B-TREE SEARCHING AND INSERTION

353

Part of a B-tree:
2

H K

/
A B C

'NT

(a)

Contents of

PAGE
-

for pages 2

and

KEYCOUNT f ^
{

KEY

array

CHILD

array

*>*

Page 2

Page 3

NIL

(b)

FIGURE 8.19 A B-tree of order four, (a) An internal node and some leaf
nodes, (b) Nodes 2 and 3, as we might envision them in the structure

PAGE.

Searching

The

first

procedure. Searching
yet

-^n
^-~*n

still

B-tree algorithms
a

good

we examine

place to begin because

illustrates the characteristic aspects

They are recursive; and

They work in two stages, operating

are a tree-searching

it is

relatively simple

of most B-tree algorithms:

alternatively

entire pages

and

then within pages.

The searching procedure

calls itself recursively,

seeking to a page and

at successively lower
of the tree until it either finds the key or finds that it cannot descend
further, having reached beyond the leaf level. Figure 8.20 contains a
description of the searching procedure in pseudocode.

then searching through the page, looking for the key

levels

354

B-TREES AND OTHER TREE-STRUCTURED

FUNCTION:

search (RRN, KEY, FOUND_RRN

if RRN == NIL then

FILE

ORGANIZATIONS

FOUND_POS)

/* stopping condition for the recursion */

return NOT FOUND

else

read page RRN into PAGE

look through PAGE for KEY, setting POS equal to the
position where KEY occurs or should occur,
if KEY was found then
F0UND_RRN = RRN
/* current RRN contains the key */
F0UND_P0S = POS
return FOUND
else /* follow CHILD reference to next level down */
return(search(PAGE.CHILDCPOS], KEY, F0UND_RRN, F0UND_P0S
endif
endif
:

end FUNCTION
FIGURE 8.20 Function search (RRN, KEY,

F0UND_RRN, FOUND_POS)

searches

re-

cursively through the B-tree to find KEY. Each invocation searches the page refer-

enced by RRN. The arguments FOUND_RRN and FOUND_POS identify the page
and position of the key, if it is found. If searchO finds the key, it returns FOUND.
it goes beyond the leaf level without finding the key, it returns NOT FOUND.

Let's

work through

the function

by hand, searching

for the

key Kin the

RRN
RRN
not
NIL,
so
the
(2).
function reads the root into PAGE, then searches for K among the elements
of PAGE.KEY[]. The K
not found. Since K should go between D and
tree illustrated in Fig. 8.21.

argument equal

We begin by

RRN

to the

calling the function

of the root

This

with the

N, POS identifies position

in the root as the position of the pointer to
where the search should proceed. So search () calls itself, this time using the
1"*"

RRN

stored in

PAGE.CHILD[1]. The

the next

M. Once again the

PAGE.KEY[]. Again,
indicates

searches

not

found.

where the search should proceed.

RRN

is 3.

for

This

and

among the keys in

time PAGE.CHILD[2]

Searchf) calls itself again, this

through the various

stored in

levels

return

statements until the program that

originally calls search() receives the information that the key

PAGE.CHILD[2].
Since this call is from a leaf node, PAGE.CHILD[2] is NIL, so the call
search() fails immediately. The value NOT FOUND is passed back

time using the

function

value of this

reads the page containing the keys G,

call, search()

will use zero origin indexing in these examples, so the leftmost

PAGE.KEY[0], and

the

RRN

of the leftmost child

key

PAGE.CHILDfO].

in a

not found.
page

355

ALGORITHMS FOR B-TREE SEARCHING AND INSERTION

D N

/
G

A B C

/.XT
M

P R

T U

FIGURE 8.21

B-tree used for

the search example.

Now let's use search()

same downward path
position 2 of page

FOUND_POS,
2 of page

to look for

that
It

M, which

in the tree.

did for K, but this time

and 2

stores the values 3

respectively, indicating that

and returns the value

follows the

finds the

M in

FOUND_RRN

and

M can be found in the position

FOUND.

and Promotion There ar e two important obsermake abo ut the insertion, splitting, and promotion proc ess.

Insertion, Splitting,
vations
*a

begins with a search that proceeds

level;

can

all

the

way down

to the leaf

and
at the leaf level, the work of inand promotion proceeds upward from the bottom.

After finding the insertion location

sertion, splitting,

Consequently,

can conceive of our recursive procedure

having three

phases:
1.

search-page step

the recursive
2.

The

recursive call

the tree as
3.

itself,

before

which moves the operation down through

searches for either the key or the place to insert

Insertion, splitting,

recursive

that, as in the search() function, takes place

call;

call,

and promotion logic that are executed

the action taking place

it;

and

after the

on the upward return path

fol-

lowing the recursive descent.

need an example of an insertion so we can watch the insertion

procedure work through these phases. Let's insert the $ character into the
tree shown in the top half of Fig. 8.22, which contains all of the letters of
the alphabet. Since the ASCII character sequence places the $ character
ahead of the character A, the insertion is into the page with an RRN of 0.
This page and its parent are both already full, so the insertion causes
splitting and promotion that result in the tree shown in the bottom half of
Fig. 8.22.

356

B-TREES AND OTHER TREE-STRUCTURED

Before inserting

FILE

ORGANIZATIONS

After inserting

H N

FIGURE 8.22 The effect of adding $ to the tree constructed

in Fig.

8.18.

Now let's see how the insert!) function performs this splitting and
promotion. Since the function operates recursively, it is important to
understand
insert()

how

the function arguments are used

function that

CURRENT_RRN

on successive

calls.

The

are about to describe uses four arguments:

The

RRN

use.

o{ the B-tree page that

currently in

the function recursively descends and as-

cends the

tree, all the

RRNs

on the search and

in-

sertion path are used.

KEY

The key

PROMO_KEY

Argument used only

that

to be inserted.
to carry

back the return

and the pro-

value. If the insertion results in a split

motion of a key, PROMO_KEY contains the

promoted key on the ascent back up the tree.

PROMO_R_CHILD

another return value argument. If there

higher levels of the calling sequence
must not only insert the promoted key value,

This
is

a split,

357

ALGORITHMS FOR B-TREE SEARCHING AND INSERTION

RRN of the new page created in

PROMO_KEY inserted,
PROMO_R_CHILD the right child pointer

but also the

the

When

split.

inserted with

it.

In addition to the values returned via the

and

PROMO_R_CHILD,

makes

NO PROMOTION

PROMO_KEY
PROMOTION if

arguments

insertQ returns the value

done and
ERROR if the insertion cannot be made.
Figure 8.23 illustrates the way the values of these arguments change as
the insert() function is called and calls itself to perform the insertion of the
$ character. The figure makes a number of important points:
promotion,
nothing is promoted, and
a

During the search

changes

as the

fected

an insertion

step part of the insertion, only

function

path of successive

The

calls

CURRENT_RRN

descending the

calls itself,

This search

tree.

includes every page of the tree that can be af-

and promotion on the return path.

when CURRENT_RRN is NIL. There are no

splitting

search step ends

further levels to search.

As each

recursive

call returns,

execute the insertion and splitting

PROOtherwise, we

logic at that level. If the lower-level function returns the value

MOTION,

then we have a key to insert at this level.

have no work to do and can just return. For example,
insert
at the highest (root) level of the tree without

therefore return
that the

NO PROMOTION

PROMO_KEY

and

from

are able to

splitting,

and

That means
from this level

this level.

PROMO_R_CHILD

have no meaning.

Given

this

introduction to the

insert()

to look at an algorithm for the function

function's operation,

shown in

Fig. 8.24.

are ready

We have already

described insertO's arguments. There are several important local variables as

well:

PAGE

The page

NEWPAGE

New

POS

The position in
or would occur

P_B_RRN

The

that insert()

page created

currently examining.

if a split occurs.

PAGE

where the key occurs

(if it is

present)

(if inserted).

relative record

level. If a split

number promoted

occurs

at the

from below up

next lower

level,

to this

P_B_RRN

number of the new page created

during the split. P_B_RRN is the right child that is inserted
with P_B_KEY into PAGE.
contains the relative record

P_B_KEY

The key promoted from below up

along with

P_B_RRN,

to this level. This key,

inserted into

PAGE.

358

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

KEY =

CURRENT RRN =

NO PROMOTION
PROMO_KEY = <undefined>
PROMO_R_CHILD = <undefined>

Return value:
fc

Search
step

Recursive

call

Insertion and

KEY =

splitting logic

CURRENT RRN
Return value: PROMOTION
PROMO_KEY = H
PROMO R CHILD = 12

Search
step

Recursive

call

Insertion and

KEY =

splitting logic

CURRENT RRN
PROMOTION
PROMO_KEY = B
PROMO R CHILD = 11

Return value:

fet

step

Recursive

Insertion

KEY =

call

and

splitting logic

CURRENT RRN = NIL

PROMOTION
PROMO_KEY = $
PROMO R CHILD = NIL

Return value:

Search
step

Recursive
Insertion

call

and

splitting logic

FIGURE 8.23 Pattern of recursive calls to insert $ into the B-tree as illustrated

in Fig.

8.22.

ALGORITHMS FOR B-TREE SEARCHING AND INSERTION

insert

FUNCTION:

CURRENT_RRN, KEY PROMO_R_CHILD, PROMO_KEY)

if CURRENT_RRN

/* past bottom of tree */

NIL then

PROMO_KEY := KEY
PR0M0_R_CHILD := NIL
return PROMOTION

/* promote original key and NIL */

else
read page at CURRENT_RRN into PAGE
search for KEY in PAGE.
let POS := the position where KEY occurs or should occur.
if KEY found then

issue error message indicating duplicate key

return ERROR

RETURN_VALUE

insert PAGE CHILD [ POS]

(

KEY,

P_B_RRN, P_B_KEY)

if RETURN_VALUE

= = NO PROMOTION or ERROR then

return RETURN_VALUE

elseif there is space in PAGE for P_B_KEY then

insert P_B_KEY and P_B_RRN (promoted from below) in PAGE
return NO PROMOTION
else
split L ( P_B_KEY, P_B_RRN, PAGE, PR0M0_KEY, PR0M0_R_CHILD, NEWPAGE
write PAGE to file at CURRENT_RRN
write NEWPAGE to file at rrn PR0M0_R_CHILD
/* promoting PR0M0_KEY and PR0M0_R_CHILD */
return PROMOTION
endif
end FUNCTION
FIGURE 8.24 Function insert
inserts a

KEY

in a

B-tree.

number CURRENT_RRN.
sively until

finds

KEY

(CURRENT_RRN,

The
If

in a

KEY,

PROMO_R_CHILD, PROMO_KEY)

insertion attempt starts at the page with relative record

page is not a
page or reaches a

this

leaf page, the function calls itself recurleaf.

finds KEY,

issues an error

ERROR. If there is space for KEY in PAGE, KEY is inserted. Otherwise, PAGE is split. A split assigns the value of the middle key to
PROMO_KEY and the relative record number of the newly created page to PROMO_R_CHILD so insertion can continue on the recursive ascent back up the tree. If a promotion does occur, insertO indicates this by returning PROMOTION. Otherwise, it returns
NO PROMOTION.
message and

quits, returning

359

360

B-TREES AND OTHER TREE-STRUCTURED

PROCEDURE:

split (I_KEY,

I_RRN,

FILE

ORGANIZATIONS

PROMO_KEY, PROMO_R_CHILD, NEWPAGE

PAGE,

copy all keys and pointers from PAGE into a working page that
can hold one extra key and child.
insert I_KEY and I_RRN into their proper places in the working page.

allocate and initialize a new page in the B-tree file to hold NEWPAGE.
set PR0M0_KEY to value of middle key, which will be promoted after
the split.
set PR0M0_R_CHILD to RRN of NEWPAGE.

copy keys and child pointers preceding PR0M0_KEY from the working
page to PAGE.
copy keys and child pointers following PR0M0_KEY from the working
page to NEWPAGE.
end PROCEDURE
FIGURE 8.25 Split (l_KEY, l_RRN, PAGE, PROMO_KEY, PROMO_R_CHILD, NEWPAGE), a
procedure that inserts l_KEY and l_RRN, causing overflow, creates a new page called
NEWPAGE, distributes the keys between the original PAGE and NEWPAGE, and determines
which key and RRN to promote. The promoted key and RRN are returned via the arguments

PROMO_KEY

and

PROMO_R_CHILD.

When
functions.

coded

in a real language,

The most obvious one

distributes the keys

between the

determines which key and

description of a simple

insertf)

uses a

number of support

which creates a new page,

page and the new page, and

split(),

original

RRN

split()

promote. Figure 8.25 contains a

is also encoded in C and

procedure, which

Pascal at the end of this chapter.

You

should pay careful attention to

only the key

promoted from

the

how

split()

working page

moves

all

of the

data.

Note

that

CHILD RRNs

back to PAGE and NEWPAGE. The RRS that is promoted

of NEWPAGE, since NEWPAGE is the right descendent from

are transferred
is

the

RRN

promoted key. Figure 8.26 illustrates the working page activity among
the working page, and the function arguments.
The version of splitf) described here is less efficient than might
sometimes be desirable, since it moves more data than it needs to. In
Exercise 17 you are asked to implement a more efficient version of split().
the

PAGE, NEWPAGE,

361

ALGORITHMS FOR B-TREE SEARCHING AND INSERTION

The Top Level

need

routine to

tie

together our

insert!

and

split(

procedures and to do some things that are not done by the lower-level

Our

routines.

Open
Read

driver

or create the B-tree

Create

It is

new

PAGE

root node

that the

FIGURE 8.26 The movement

Contents of

file,

do the following:

and identify or create the root page.

call \nsert(

to put the

tree.

routine driver

assumed

able to

keys to be stored in the B-tree. and

keys in the

The

must be

when

shown

RRN

insert(

splits the

in Fig. 8.27 carries

of the root node

current root page.

out these top-level tasks.

stored in the B-tree

file itself.

of data in splitO.

are copied to the working page.

PAGE

K
Working page

^
w

I_KEY
into

(B)

and

I_RRN

(11) are inserted

working page.
^f

Contents of working page are divided

between PAGE and NEWPAGE, except
for the middle key (H). H promoted,
along with the

RRN

(12) of

t
PROMO RRN

NEWPAGE.

NEWPAGE

PAGE

PROMO KEY
7*-

362

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

MAIN PROCEDURE: driver

if the B-tree file exists, then
open B-tree file
else
create a B-tree file and place the first key in the root
get RRN of root page from file and store it in ROOT
get a key and store it in KEY

while keys exist

if (insert (ROOT, KEY, PR0M0_R_CHILD, PROMO.KEY) == PROMOTION) then
create a new root page with key := PR0M0_KEY, left
child := ROOT, and right child := PR0M0_R_CHILD
set ROOT to RRN of new root page
get next key and store it in KEY
endwhile
write RRN stored in ROOT back to B-tree file
close B-tree file
end MAIN PROCEDURE
FIGURE 8.27 Driver for building a B-tree.

if the file exists. If the file

does

exist, driver

opens

root node. If it does not exist, driver must create the

and gets the RRN of the

file and build an original

root page. Since a root must contain

at least one key, this involves getting

and placing it in the root. Next, driver
reads in the keys to be inserted, one at a time, and calls insertf) to insert the
keys into the B-tree file. If insertf) splits the root node, it promotes a key and
right child in PROMO_KEY and PROMO_R_CHILD, and driver uses

the

first

key

to be inserted in the tree

these to create a

8.10

new

root.

B-Tree Nomenclature
Before moving on to discuss B-tree performance and variations on the basic
B-tree algorithms, we need to formalize our B-tree terminology. Providing
careful definitions

of terms such

the properties that

must be present

as order

and

leaf enables us to state precisely

for a data structure to qualify as a B-tree.

This definition of B-tree properties, in turn, informs our discussion of

matters such as the procedure for deleting keys from a B-tree.

Unfortunately, the literature on B-t rge_s

terms relatin g to

Barges.

Reading

that literature

not uniform in

its

us e_ojl

and keeping up with new

363

B-TREE NOMENCLATURE

developments therefore require some flexibility and some background: The

reader needs to be aware of the different usages of some of the fundamental
terms.

For example, Bayer and McCreight (1972),

Comer

others refer to the order of a B-tree as the minimum

(1979),

and

number of keys

few

that can

be in a page of a tree. So, our initial sample B-tree (Fig. 8.16), which can
hold a maximum of seven keys per page, has an order of three, using Bayer

and McCreight's terminology. The problem with this definition of order is

that it becomes clumsy when you try to account for pages that hold a
maximum number of keys that is odd. For example, consider the following
question: Within the Bayer and McCreight framework, is the page of an
order three B-tree

full

when

contains six keys or

when

contains seven

keys?

Knuth (1973b) and

others have addressed the odd/even confusion

maximum number of descendents that

a page can have. This is the definition of order that we use in this text. Note
that this definition differs from Bayer and McCreight's in two ways: It
defining the order of a B-tree to be the

references a

maximum, not

minimum, and

counts descendents rather than

keys.

Use of Knuth's
number of keys in a

definition

must be coupled with the

B-tree page

descendents from the page.

always one

Consequently,

less

fact

than the

that the

number of

B-tree of order 8 has

maximum of seven keys per page. In general, given a B-tree

maximum number of keys per page is m 1.
When you split the page of a B-tree, the descendents

of order m, the
are divided as

between the new page and the old page. Conseq uently,
every page except the root and the leaves has at east ml 2 descendents
Expressed in terms of a ceiling function, we can say that the minimum
It follows that the minimum numnumber of descendents is \ m/2

ber of keys per page is

ro/2
1, so our initial sample B-tree has an
order of eight, which means that it ca n hold no m ore than seven kev s
per page and that all of the pages except the root contain at least thr ee
evenly

as possible

.__

\J&pL
(The

other term that is used differently by different authors is leaf. Bayer

and McCreight refer to the lowest level of keys in a B-tree as the leaf level.
This is consistent with the nomenclature we have used in this text. Other
authors, including Knuth, consider the leaves of a B-tree to be one level
below the lowest level of keys. In other words, they consider the leaves to
be the actual data records that might be pointed to by the lowest level of
keys in the tree. We do not use this definition, sticking instead with the

notion of leaf

as the

lowest level of keys in the B-tree.

364

8.1

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

Formal Definition of B-Tree Properties

Given these definitions of order and leaf, we can formulate
statement of the properties of a B-tree of order m:

Every page has a maximum of m descendents.

Every page, except for the root and the leaves, has

The

All the leaves

at least [

precise

m/2~\ de-

scendents.

8.12

two descendents (unless

appear on the same level.

at least

it is

a leaf).

nonleaf page with k descendents contains k 1 keys.

leaf page contains at least [ m/2~] - 1 keys and no more than

A
A

root has

m -

keys.

Worst-case Search Depth

important to have a quantitative understanding of the relationship
between the page size of a B-tree, the number of keys to be stored in the
tree, and the number of levels that the tree can extend. For example, you
It

might

know

that

you need

to store 1,000,000 keys

and

nature of your storage hardware and the size of your keys,

to consider using a B-tree

of order 512

(maximum of 511

that,

given the

it is

reasonable

keys per page).

Given these two facts, you need to be able to answer the question, "In the
worst case, what will be the maximum number of disk accesses required to
locate a key in the tree?" This is the same as asking how deep the tree
will be.

We can answer this question by beginning with the observation that the
number of descendents from any level of a B-tree is one greater than the
number of keys contained at that level and all the levels above it. Figure 8.28
illustrates this relation for the tree

constructed earlier in this chapter.

T his tree contains 27 kevs fall the letters of the alphabet and S). If you co unt
me number of potential descendents trailing from the leaf level, you see that
there are 28 of them.

Next we need

to observe that

properties to calculate the

from any

level

can use the formal definition of B-tree

minimum number of descendents

B-tree of

some given

order. This

that can extend

of interest because

we are interested in the worst-case depth of the tree. The worst case occurs
when every page of the tree has only the minimum number of descendents.
In such a case

iimal breadth.

he keys are spread over

maximal height for the

tree

and

WORST-CASE SEARCH DEPTH

365

H N

ddd

ddddddd ddd ddd

FIGURE 8.28 A B-tree with

Fo r

root page

keys can have

(A/

T U V

X Y

dddddddd

descendents from the

leaf level.

minimum number of descendents from the

two, so the second level of the tree contains only two pages.

B-tree o f order m, the

Ea cRof these pages,

in turn has at least

mil 1 de scendents.

he third

level,

then, co ntains
2

X [ ml2~\

pages. Since each of these pages, once again, has a

minimum of [ m/2~\

descendents, the general pattern of the relation between depth and the

minimum number of descendents

Minimum number

Level
1

of descendents

(root)

x \ mll\

x fm/21 x [~m/2~|or 2 x
3
2 X \ml2~}

takes the following form:

\
'

mil'}

d-\
2 x fro/21

So.

general,

for

any

descendents extending fro

level

d of a B-tree,

thaf level
2

x fm/2l d-\

the

minimum numbe r of

366

B-TREES AND OTHER TREE-STRUCTURED

Wejcnow
Let

level.

that

s call

ORGANIZATIONS

N keys

free with.

descendents from

N+

we know

than the
at

HesrenHe ntS from

that the

number

descendents and the

of height d

a tree

its

leaf

can express the

minimum number

N+1>2X
since

has

the depth of the tree at the leaf level

between the

relationship

FILE

[ ml2~]

number of descendents from any

cannot be

tree

for a worst-case tree of that depth. Solving for

less

arrive

the following expression:

+ logrw2l

((N

l)/2).

This expression gives us an upper bound for the depth of

a B-tree

A/keys. Let's find the upper bound for the hypothetical tree that
at the start

this section: a tree

Substituting these specific

of order 512 that contains 1,000,000 keys.

numbers
d

with

we describe

into the expression,

find that

log 256 500000.5,

or
d

3.37.

So we can say that given 1,000,000 keys,

of no more than three levels.

8.13

B-tree of order 512 has a depth

Deletion, Redistribution, and Concatenation

Indexing 1,000,000 keys in no more than three levels of a tree is precisely
the kind of performance we are looking for. As we have just seen, this

performance

predicated on the B-tree properties

Every page except

coupled t o the r ules that

for the root

describe earlier; in

broad and shallow rathe r

particular, the ab:Qity_to_guarantee that B-trees are

than na rrow and dee p

state the

and the leaves has

follow ing:

at least

m/2~\ de-

scendents;

A
A

nonleaf page with k descendents contains k

leaf

page contains

least [ ml2~\

keys; and

keys and no more than

keys.

have already seen that the process of page

these properties are maintained

when new keys

splitting guarantees that

are inserted into the tree.

need to develop some kind of equally reliable guarantee that

proper ties are maintained when keys are deleted from the tree.

jJiese

-Working through some simple deletion situations by hand helps us

a kev can result in several different

demonstrate that the deletion of

367

DELETION, REDISTRIBUTION, AND CONCATENATION

of these situations and the associated

situations. Figure 8.29 illustrates each

response in the course of several deletions from an order six B-tree.

The

simplest situation

illustrated in case

cause the contents of page 5 to drop below the

Deleting the key J does not

minimum number of keys.

Consequently, deletion involves nothing more than removing the key from
the page and rearranging the keys within the page to close up the space.

M (case

2) is more complicated. If we simply remove the

becomes very difficult to reorganize the tree to maintain
its B-tree structure. Since this problem can occur whenever we delete a key
from a nonleaf page, we always delete keys only from leaf pages. If a key

Deleting the

M from the root,

to be deleted

swap

its

not in a

lea f, there

an easy

wav

to get

into a leaf:

immediate successor, which is guaranteed to be in a lea f,

then delete it immediately from the leaf. In our example, we can swap the
with tne iV in page b, then delete 'the
from page 6. This simple
operation does not put the
out of order, since all keys in the subtree of
which Nis a part must be greater than N. (Can you see why this is the case?)
In case 3 we delete R from page 7. If we simply remove R and do
nothing more, the page that it is in has only one key. The minimum
number of keys for the leaf page of an order six tree is
it

wit h

r6/2~|Therefore,

have to take some kind of action to correct

this

condition. Since the neighboring page 8 (called a sibling since

parent) has

more than

the

minimum number

underflow

has the

same

of keys, the corrective action

between the pages. Redistribution

ust
change in the key that is in the parent page so it continues to
act as a separator between the lower-level pages. In the example, we move
the U and Finto page 7, and move I^into the separator position in page 2.
The deletion of A in case 4 results in a situation that cannot be resolved
by redistribution. Addressing the underflow in page 3 by moving keys
from page 4 only transfers the underflow condition. There are not enough
keys to share between two pages. The solution to this is concatenation,
combining the two pages and the key from the parent page to make a single
consists of redistributing the keys
also result in a

full

page.

Concatenation

essentially the reverse

of splitting. Like

splitting,

can

propagate upward through the B-tree. Just as splitting promotes a key,

concatenation must involve demotion of keys, and this can in turn cause

underflow

in the parent page. This

just

what happens

our example.

Our concatenation of pages 3 and 4 pulls the key D from the parent page
down to the leaf level, leading to case 5: The loss of the D from the parent
page causes it, in turn, to underflow. Once again, redistribution does not
solve the problem, so concatenation

must be used.

1: No action.
Delete J from page 5. Since page 5 has more
than the minimum number of keys,

Case

be removed without reorganization.

J can

Case

Swap with immediate

Swap

X Y

successor.

M (page 0) with N (page

and then delete M from page
Delete

6),

Case

Redistribution.

Delete R. Underflow occurs. Redistribute keys

among pages

2, 7,

and

8 to restore balance

between leaves.

Promote

move U and V
into page

u V

FIGURE 8.29 Six situations that can occur during deletions.

368

Case

Concatenation.

Delete A. Underflow occurs, but

it cannot be
addressed by redistribution. Concatenate the
keys from pages 3 and 4, plus the D from
page 1 into one page.

Underflow

u V

X Y

New
page
3:

Underflow propagates upward.

1 has underflow. Again, we cannot
redistribute, so we concatenate.
Case

Now

page

Underflow
moves up to
here

C D E F

Case

Height of tree decreased.

Since the root contains only one key,

it is

absorbed into the new root.

c D E F

U V

X Y

369

370

B-TREES AND OTHER TREE-STRUCTURED

Note

ORGANIZATIONS

propagation of the underflow condition does not

the

that

FILE

necessarily imply the propagation of concatenation. If page 2

(Q and W) had

contained another key, then redistribution, not concatenation, would be

used to resolve the underflow condition at the second level of the tree.

Case 6 shows what happens when concatenation propagates all the way
The concatenation of pages 1 and 2 absorbs the only key in the
root page, decreasing the height of the tree by one level.
The steps involve d in deleting keys from a B-tree can be su mmarized as

to the root.

If the

key

successor,

to be deleted

which

Delete the key.

fuA

If the leaf

>~v

(Aj

now

further action
If the leaf

swap

a leaf,

contains at least the

with

its

immediate

minimum number of keys, no

required.

now

not in

in a leaf.

contains one too few keys, look at the

left

and right

siblings.
a.

If a sibling

has

more than

the

minimum number

of keys, redis-

tribute.
b.

If neither sibling has

two

leaves and the

more than

If leaves are concatenated,

If the last

the

minimum,

median key from the parent

apply steps

key from the root

3-6

concatenate the
into

one

leaf.

to the parent.

removed, then the height of the

tree

decreases.

8.13.1 Redistribution
Unlike concatenation, which is a kind of reverse split, redistribution is a
idea. Our insertion algorithm does not involve operations analogous to

new

redistribution.

Redistribution differs from both splitting and concatenation in that

does not propagat e.

It is

guaranteed to have

strictly local effects.

the term sibling implies that the pages have the

Note

that

same parent page. If there are

adjacent but do not have the

two nodes at the leaf level that are logically

same parent (for example, IJK and NOP in the

tree at the top

of Fig. 8.29),

these nodes are not siblings. Redistribution algorithms are generally written

moving keys between nodes that are not siblings,

even when they are logically adjacent. Can you see the reasoning behind
so they do not consider

this restriction?

Another difference between redistribution on the one hand a nd

is that there is no necessary, iixe3

concatenation and splitting on the other

prescription for

how

the keys should be rearranged.

single deletion in a

REDISTRIBUTION DURING INSERTION: A

WAY

TO IMPROVE STORAGE UTILIZATION

pr operly formed B-tree cannot cause an underflow of

more than one key

moving only
.

Therefore, redistribution can restore the B-tree properties by

one key from a sibling into the page that has underflowed, even if the
distribution of the keys between the pages is very uneven. Suppose, for
example, that we are managing a B-tree of order 101. The minimum
number of keys that can be in a page is 50, the maximum is 100. Suppose
we have one page that contains the minimum and a sibling that contains the
maximum. If a key is deleted from the page containing 50 keys, an
underflow condition occurs. We can correct the condition through
redistribution by moving one key, 50 keys, or any number of keys that falls
between 1 and 50. The usual strategy is to divide the keys as evenly as
possible between the pages. In this instance that means moving 25 keys.

8.14

Redistribution during Insertion: A

Storage Utilization
As you may

recall,

to redistribution; splitting

Improve

B-tree insertion does not require an operation analogous

able to account for

This does not mean, however, that

during insertion

Way

it is

not

an option, particularly since

algorithms must already include

all

instances of overflow.

desirable to
a set

use redistribution

of B-tree maintenance

redistribution procedure to support

Given that a redistribution procedure is already present, what

advantage might we gain by using it as an alternative to node splitting?

deletion.

Redistribution during insertion

way of

avoiding,

postponing, the creation of new pages. Rather than splitting

creating

two approximately

a full

least

page and

half-full pages, redistribution lets us place

some

of the overflowing keys into another page. The use of redistribution in place
of splitting should therefore tend to make a B-tree more efficient in terms
of its utilization of space.
It is possible to quantify this efficiency of space utilization by viewing
the amount of space used to store information as a percentage of the total
amount of space required to hold the B-tree. After a node splits, each of the
two resulting pages is about half full. So, in the worst case, space utilization
in a B-tree using two-way splitting is around 50%. Of course, the actual
degree of space utilization
lias

shown

approaches

better than this worst- case ligure.

Yao

(1978)

of relatively large order, space Trtilizatioli

theoretical average of about 69% if insertion is handled

that, for large trees

through two-wav splittin gs

The idea of using redistribution as an alternative to splitting when
possible, splitting a page only when both of its siblings are full, is
introduced in Bayer and McCreight's original paper (1972). The paper

372

B-TREES AND OTHER TREE-STRUCTURED

includes

some experimental
of

in a space utilization

insertions.

possible,

testing

When

ORGANIZATIONS

results that

67%

show

that

two-way

splitting results

for a tree of order 121 after 5,000

random

was repeated, using redistribution when

increased to over 86%. Subsequent empirical

the experiment

space utilization

by Davis

FILE

(1974) (B-tree of order 49) and Crotzer (1975) (B-tree of

order 303) also resulted in space utilization exceeding

85% when

redistri-

bution was used.

These findings and others suggest that any serious

application of B-trees to even moderately large files should implement
insertion procedures that handle overflow through redistribution when
possible.

8.15

B* Trees
and amplification of work on B-trees

In his review

Knuth (1973b)

in 1973,

new

extends the notion of redistribution during insertion to include

for splitting.

He calls

afi* tree.
Consider

the resulting variation

system

which we

rules

on the fundamental B-tree form

are postponing

splitting

redistribution, as outlined in the preceding section. If

through

are considering

any page other than the root, we know that when it finally is time to split,
the page has at least one sibling that is also full. This opens up the possibility
of a two-to-three split rather than the usual one-to-two or two-way split.
Figure 8.30 illustrates such a

The important

split.

aspect of this two-to-three split

that

that are each about two-thirds full rather than just half

possible to define a

new

kind of B-tree, called

results in pages

full.

tree,

This makes it
which has the

following properties:

maximum

Every page has

Every page except for

the root

of m descendents.
and the leaves has

at least

(2m

l)/3 de-

scendents.
3.

The

All the leaves appear

A
A

root has

at least

two descendents (unless

on the same level.

it is

a leaf).

nonleaf page with k descendents contains k 1 keys.

leaf page contains at least \_(2m 1)/3J keys and no more than

keys.

of properties and the set we define

for a conventional B-tree are in rules 2 and 6: a B* tree has pages__tiiat
contain a minimum of (2m - l)/3j keys. This new property, o f course,

The

critical

changes between

this se t

affects

procedures for deletion and redistribution.

BUFFERING OF PAGES: VIRTUAL B-TREES

373

Original tree:

A C D F

H K

T V X

Two-to-three-split:

H K M

After the insertion of the

key B.

C D

FIGURE 8.30 A two-to-three

T V X

split.

To impleme nt B * tree proced ures, o ne must also deal with the question
of sj>h tting. the root, which, bv definition, never has a sibling. If there is no
sibling, no two-to-three split is possible. Knu th suggests allowing the roo t
grow to a size larger than the other pages so. when it does split, it can
producetwo pages that are each about tw o-thirds full. This sugges tion has
to

the advantage of ensuring that

Tree charact eristic's.

However,

all
it

pa,g e s

below the root

level

adhere to

has the disadvantage of requiring that the

page that is larger than all the others. Another

solution is to handle the splitting of the root as a conventional one-to-two
split. This second solution avoids any special page-handling logic. On the
other hand, it complicates deletion, redistribution, and other procedures
procedures be able to handle

must be sensitive to the minimum number of keys allowed in a page.

Such procedures would have to be able to recognize that pages descending
from the root might legally be only half full.
that

8.16

Buffering of Pages: Virtual B-Trees

have seen

very

that,

given some additional refinements, the B-tree can be

efficient, flexible

ties after

storage structure that maintains

its

balanced proper-

repeated deletions and insertions and that provides access to any

374

B-TREES AND OTHER TREE-STRUCTURED

key with just

aspects, as

few disk

we have

using this structure to

FILE

accesses.

far,

full

However, focusing on just

the structural

can cause us inadvertently to overlook ways of

advantage. For example, the fact that

depth of three levels does not

accesses to retrieve keys

ORGANIZATIONS

at all

from pages

mean

that

B-tree has

need to do three disk

We can do much better

at the leaf level.

than that.

Obtaining better performance from B-trees involves looking in a

way at our original problem. We needed to find a way to make
efficient use of indexes that are too large to be held entirely in RAM. Up to
this point we have approached this problem in an all-or-nothing way: An
index has been either held entirely in RAM, organized as a list or binary
tree, or has been accessed entirely on secondary store, using a B-tree
precise

structure. But, stating that

not imply that

For example, assume

records and that

cannot hold

cannot hold some of

we have

ALL

of an index in

RAM does

there.

an index that contains

cannot reasonably use more than 256

megabyte of

RAM

of
for
any given time. Given a page size of 4 K, holding around
64 keys per page, our B-tree can be contained in three levels. We can reach
any one of our keys in no more than three disk accesses. That is certainly
acceptable, but why should we settle for this kind of performance? Why not
try to find a way to bring the average number of disk accesses per search
down to one disk access or less?
Thinking of the problem strictly in terms of physical storage structures,
retrieval averaging one disk access or less sounds impossible. But,
remember, our objective was to find a way to manage our megabyte of
index within 256 K of RAM, not within the 4 K required to hold a single
page of our tree.
We know that every search through the tree requires access to the root
page. Rather than accessing the root page again and again at the start of
and just keep it there.
every search, we could read the root page into
requirement from 4 K to 8 K, since we
This strategy increases our
need 4 K for the root and 4 K for whatever other page we read in, but this
is still much less than the 256 K that are available. This very simple strategy
reduces our worst-case search to two disk accesses, and the average search
index storage

RAM

under two accesses (keys in the root require no disk

require one access).

access; keys at the

first level

This simple,

keep-the-root strategy suggests

an important,

general approach: Ra ther than jnsr holding the root page in

create a page buffer

ho 1rl

cnt1 ir

number of B-tree

RAM, we

pages, perhaps

can

5, 1CL

more. As we read pages in from the disk in response to user requests, we fill
if we
"up trie buffer. Then, when a page is requested, we access it from
read
we
then
in
RAM,
can, thereby avoiding a disk access. If the page is not

RAM

375

BUFFERING OF PAGES: VIRTUAL B -TREES

into the buffer from secondary storage, replacing one of the pages that
was previously there. A B-tree that uses a RAM buffer in this way is
sometimes referred to as a virtual B-tree.
it

8.16.1 LRU Replacement

Clearly, such a buffering

request a page that

scheme works only if we are more likely to

one that is not. The process of

in the buffer than

accessing the disk to bring in a page that

page fault. There are two causes of page

not already in the buffer

called

faults:

have never used the page.

was once

in the buffer

but has since been replaced with

new

page.

The first cause of page faults is unavoidable: If we have not yet read in
and used a page, there is no way it can already be in the buffer. But the
second cause is one we can try to minimize through buffer management.
The critical management decision arises when we need to read a new page
into a buffer that is already full: Which page do we decide to replace?
One common approach is to replace the page that was least recently
used; this

called

LRU

replacing the page that

page

was

always read in

replacing the root,

replacement. Note that this

first,

which

different

from

read into the buffer least recently. Since the root

simply replacing the oldest page results in

an undesirable outcome. Instead, the

LRU

method keeps track of the actual requests for pages. Since the root is
requested on every search, it seldom, if ever, is selected for replacement.
The page to be replaced is the one that has gone the longest time without a
request for use.

Some

research

number of pages

by Webster
that

(1980)

shows the

replacement strategy. Table 8.1 summarizes

TABLE

8.1

Effect of using

more buffers with

a simple

of increasing the

Average Accesses per Search

3.00

1.71

LRU

small but representative

LRU replacement

Buffer Count

Number of keys = 2,400

Total pages = 140
Tree height = 3 levels

effect

can be held in the buffer area under an

strategy

10
1.42

20
0.97

376

B-TREES AND OTHER TREE-STRUCTURED

portion of Webster's results.

search given different

LRU

using a simple

FILE

It lists

ORGANIZATIONS

the average

numbers of page

number of disk

accesses per

These results are obtained

replacement strategy without accounting for page
buffers.

height.

B+

Webster's study was conducted using

rather than simple

+
B-trees. In the next chapter, where we look closely at B trees, you see that
+
the nature of B
trees accounts for the fact that, given one buffer, the
+
average search length is 3.00. With B trees, all searches must go all the way
+
to the leaf level every time. The fact that Webster used B
trees, however,
trees

does not detract from the usefulness of his results

an illustration of the

positive impact of page buffering. Keeping less than

15% of

the tree in

RAM (20 pages out of the total 140) reduces the average number of accesses
per search to less than one.

simple B-tree, since not

Note

that

all

The

results are

decision to use

the

even more dramatic with

searches have to proceed to the leaf level.

LRU

replacement

based on the

more likely to need a page that we have used

need a page that we have never used or one that we
used some time ago. If this assumption is not valid, then there is absolutely
no reason to preferentially retain pages that were used recently. The term
for this kind of assumption is temporal locality. We are assuming that there
is a kind of clustering of the use of certain pages over time. The hierarchical
nature of a B-tree makes this kind of assumption reasonable.
For example, during redistribution after overflow or underflow, we
access a page and then access its sibling. Because B-trees are hierarchical,
accessing a set of sibling pages involves repeated access to the parent page
in rapid succession. This is an instance of temporal locality; it is easy to see
assumption that

recently than

how

it is

are

are to

related to the tree's hierarchy.

8.16.2 Replacement Based on Page Height

There

another,

direct

way

to use the hierarchical nature

of the

B-tree to guide decisions about page replacement in the buffers.

simple, keep-the-root strategy exemplifies this alternative:
the pages that occur at the highest levels of the tree. Given a

Our

Always retain
larger amount

of buffer space, it might be possible to retain not only the root, but also all
of the pages at the second level of a tree.
Let's explore this notion by returning to a previous example in which
and a 1-megabyte index. Since our page
we have access to 256 K of
size is 4 K, we could build a buffer area that holds 64 pages within the
area. Assume that our 1 megabyte worth of index requires around 1.2
megabytes of storage on disk (storage utilization = 83%). Given the 4 K
page size, this 1.2 megabytes requires slightly more than 300 pages. We

RAM

377

PLACEMENT OF INFORMATION ASSOCIATED WITH THE KEY

assume
It

that,

on the average, each of our pages has around 30 descendents.

follows that our three-level tree has, of course,

followed by 9 or 10 pages

level,

Using

page

a single

the second level, with

all

at the

root

the remaining

page replacement strategy that always retains

the higher-level pages, it is clear that our 64-page buffer eventually contains
the root page and all the pages at the second level. The approximately 50
pages

at the leaf level.

remaining buffer slots are used to hold leaf-level pages. Decisions about
which of these pages to replace can be handled through an LRU strategy.
For many searches, all of the pages required are already in the buffer; the
search requires no disk accesses.

possible to bring the average

to a

number

that

is less

easy to see how, given a sizable buffer,

number of disk accesses per search down

It is

it is

than one.

Webster's research (1980) also investigates the effect of taking page

height into account, giving preference to pages that are higher in the tree

when

it
comes time
Augmenting the LRU

decide which pages

keep

page height reduces the average number of accesses, given

from 1.42

accesses per search

8.16.3 Importance
It is

difficult to

scheme

down

the

buffers.

10-page buffer,

to 1.12 accesses per search.

of Virtual B-Trees

overemphasize the importance of including

page buffering

into any implementation of a B-tree index structure. Because the

B-tree structure itself is so interesting and powerful,

trap of thinking that the B-tree organization

the

strategy with a weighting factor that accounts for

problem of accessing large indexes

we have

secondary storage. As

required to handle large indexes.

amount of memory

is itself a

way

easy to

fall

into the

sufficient solution to

must be maintained on

that

emphasized, to

sight of the original problem: to find a

it is

fall

into that trap

to reduce the

to lose

amount of memory

did not, however, need to reduce the

amount required for a single index page. It is

enough memory to hold a number of pages. Doing

to the

usually possible to find

so can dramatically increase system performance.

8.17

Placement

of Information

Early in this chapter

Associated with the Key

focused on the B-tree index

itself,

setting aside

consideration of the actual information associated with the keys.

any

paraphrased Bayer and McCreight and stated that "the associated information

of no further interest."

But, of course, in any actual application the associated information

in fact, the true object

of interest. Rarely do

we ever want

is,

to index keys just

378

B-TREES AND OTHER TREE-STRUCTURED

to be able to find the keys

themselves.

associated with the key that

ORGANIZATIONS

FILE

want

really

usually the information

to find. So, before closing our

discussion of B-tree indexes,

it is important to turn to the question of where

and how to store the information indexed by the keys in the tree.
Fundamentally, we have two choices. We can

Store the information in the B-tree along with the key; or

Place the information in a separate
the key with a relative record

within the index

file:

number

references the location of the information in that separate

The

advantage that the

couple

or byte address pointer that

file.

approach has over the second is that

The information
is right there with the key.
However, if the amount of information
associated with each key is relatively large, then storing the information
with the key reduces the number of keys that can be placed in a page of the
B-tree. As the number of keys per page is reduced, the order of the tree is
distinct

once the key

reduced,

found, no

and the

tree

first

disk accesses are required.

tends

become

since

taller

there

fewer

are

descendents from each page. So, the advantage of the second method

given associated information that has

that,

long length relative to the

length of a key, placing the associated information elsewhere allows us to

build a higher-order and therefore possibly shallower tree.

we need

For example, assume

1,000 keys and associated

to index

information records. Suppose that the length required to store

associated information

128 bytes. Furthermore, suppose that

the associated information elsewhere,

key and
if

we can store just the key and

to the associated information in only 16 bytes.

Given

a B-tree

we
a

its

store

pointer

page that had

512 bytes available for keys and associated information, the two fundamental storage alternatives translate into the following orders of B-trees:
Information stored with key: four keys per page

order

Pointer stored with key: 32 keys per page

order 33

Using the formula

worst-case

developed

for

finding

w/key)

d(info elsewhere)

and

tree.

depth

of B-trees

earlier:

d (in fo

So, if

the

five tree;

** +

500.5

6.66

k)g 17 500. 5

log3

store the information with the keys, the tree has a worst-case

levels. If we store the information elsewhere, we end up

reducing the height of the worst-case tree to three. Even though the

depth of six

method costs us one

number of accesses to

additional indirection associated with the second

second method
record in the worst case.

access, the
a

still

reduces the total

disk
find

379

VARIABLE-LENGTH RECORDS AND KEYS

general,

then,

the decision about

where

to store the associated

information should be guided by some calculations that compare the depths

of the trees that result. The critical factor that influences these calculations
is the ratio of overall record length to the length of just a key and pointer.
If

you can put many key/pointer

key/record
tion

8.18

from

pair,

it is

pairs in the area required for a single, full

probably advisable to remove the associated informa-

the B-tree and put

in a separate

file.

Variable-length Records and Keys

many

applications the information associated with a key varies in length.

Secondary indexes referencing inverted

One way

this.

information in

to
a

handle

separate,

this

lists

variability

variable-length record

contain a reference to the information in this other

to allow a variable

number of keys and

example of

are an excellent

place

associated

the

would

file;

the B-tree

file.

Another approach

records in a B-tree page.

we have regarded B-trees as being of some order m.

fixed maximum and minimum number of keys that it can

to this point

Each page has

The notion of a variable-length record, and, therefore, a

variable number of keys per page, is a significant departure from the point
of view we have developed so far. A B-tree with a variable number of keys
per page clearly has no single, fixed order.
The variability in length can also extend to the keys themselves as well
as to entire records. For example, in a file in which people's names are the

legally hold.

keys,

we might

choose to use only

much

space as required for a name,

rather than allocate a fixed-size field for each key.

chapters,
to put

implementing

a structure

many more names

in a

internal fragmentation. If

fields

given amount of space since

can put

larger

number of descendents from

fewer

levels.

Accommodating

As we saw

with variable-length

more keys

in earlier

can allow us

does

in a page, then

page and, very probably,

away with

we have
a tree

with

this variability in length means using a different kind

of page structure. We look at page structures appropriate for use with
+
variable-length keys in detail in the next chapter, where we discuss B
trees. We also need a different criterion for deciding when a page is full and
when it is in an underflow condition. Rather than use a maximum and
minimum number of keys per page, we need to use a maximum and
minimum number of bytes.
Once the fundamental mechanisms for handling variable-length keys or
records are in place, interesting new possibilities emerge. For example, we
might consider the notion of biasing the key promotion mechanism so the

380

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

shortest variable-length keys (or key/record pairs) are

preference to longer keys.

The

idea

that

we want

promoted upward

have pages with the

numbers of descendents up high in the tree, rather than at the leaf

Branching out as broadly as possible as high as possible in the tree
tends to reduce the overall height of the tree. McCreight (1977) explores
this notion in the article, "Pagination of B* Trees with Variable-Length
Records."
The principal point we want to make with these examples of variations
on B-tree structures is that this chapter introduces only the most basic forms
of this very useful, flexible file structure. Actual implementations of B-trees
do not slavishly follow the textbook form of B-trees. Instead, they use
many of the other organizational techniques we study in this book, such as
variable-length record structures, in combination with the fundamental
B-tree organization to make new, special-purpose file structures uniquely
suited to the problems at hand.
largest

level.

SUMMARY
We begin this chapter by picking up the problem we left unsolved at the end
of Chapter

Simple, linear indexes

RAM memory,
that

work

well

they are held in electronic

but are expensive to maintain and search

secondary storage

most evident

two

they are so big

The expense of using

they must be held on secondary storage.

areas:

Sorting of the index; and

Searching, since even binary searching required

more than just two

or three disk accesses.

first

address the question of structuring an index so

can be kept

in order

without sorting.

We use tree structures

we need

a balanced tree to

ensure that the tree does not become overly deep

after repeated

random

insertions.

see that

AVL

this,

trees

discovering that

provide

way of

amount of overhead.
Next we turn to the problem of reducing the number of disk accesses
required to search a tree. The solution to this problem involves dividing the

balancing a binary tree with only a small

tree into pages, so a substantial portion

of the

tree can

be retrieved with

Paged indexes let us search through very large numbers

of keys with only a few disk accesses.
Unfortunately, we find that it is difficult to combine the idea of paging
of tree structures with the balancing of these trees by AVL methods. The
most obvious evidence of this difficulty is associated with the problem of
selecting the members of the root page of a tree or subtree when the tree is
built in the conventional top-down manner. This sets the stage for
single disk access.

SUMMARY

work on

introducing Bayer and McCreight's

B-trees,

paging and balancing dilemma by starting from the leaf

which solves the

level, promoting

keys upward as the tree grows.

Our

discussion of B-trees begins with examples of searching, insertion,

how B-trees grow while maintaining

Next we formalize our description of B-trees.

and promotion to show

splitting,

balance in a paged structure.

This formal definition permits us to develop

worst-case B-tree depth.

The formal

on developing deletion procedures

keys are removed from a tree.

Once

formula for estimating

work
when

description also motivates our

that maintain the B-tree properties

the fundamental structure and procedures for B-trees are in place,

refining and improving on these ideas. The first set of

improvements involves increasing the storage utilization within B-trees. Of

begin

course, increasing storage utilization can also result in a decrease in the

height of the tree, and therefore in improvements in performance.

find

by sometimes redistributing keys during insertion, rather than splitting

pages, we can improve storage utilization in B-trees so it averages around
85%. Carrying our search for increased storage efficiency even farther, we
find that we can combine redistribution during insertion with a different
that

kind of splitting to ensure that the pages are about two-thirds

than only one-half

full

after the split.

Trees using

redistribution and two-to-three splitting are called

Next we turn

to the matter

this

combination of

trees.

of buffering pages, creating

a virtual B-tree.

We note that the use of memory is not an all-or-nothing choice:

are too large to

fit

secondary storage.
then

memory do
If we hold pages

not have to be accessed

into

rather

full

Indexes that
entirely

that are likely to be reused in

can save the expense of reading these pages in from the disk again.

develop two methods of guessing which pages are to be reused.

method

from

RAM,

uses the height of the page in the tree to decide

which pages

One

to keep.

Keeping the root has the highest priority, the root's descendents have the
The second method for selecting pages to keep in
RAM is based on recentness of use: We always replace the least-recentlyused (LRU) page, retaining the pages used most recently. We see that it is
possible to combine these methods, and that doing so can result in the
ability to find keys while using an average of less than one disk access per
next priority, and so on.

search.

then turn to the question of where to place the information

associated with a key in the B-tree index. Storing

with the key

attractive

same as finding the information;

required. However, if the associated

because, in that case, finding the key

no additional disk accesses are

information takes up a lot of space,

the

can reduce the order of the

thereby increasing the tree's height. In such cases

store the associated information in a separate

it is

file.

tree,

often advantageous to

381

382

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

close the chapter with a brief look at the use of variable-length

records within the pages of a B-tree, noting that significant savings in space

and consequent reduction

in the height

variable-length records.

The modification of

many

variations

from the use of

tree can result

the basic textbook B-tree

of variable-length records is just one example

that are used in real-world implementa-

definition to include the use

of the

on B-trees

tions.

KEY TERMS

AVL

tree. A height-balanced (HB(1)) binary tree in which insertions

and deletions can be performed with minimal accesses to local nodes.
AVL trees are interesting because they keep branches from getting
overly long after many random insertions.
B-tree of order m. A multiway search tree with these properties:
1.

Every node has a maximum of m descendents.

Every node except the root and the leaves has

least

|~w/2l

descendents.
3.

The

All of the leaves appear

A
A

root has

two descendents (unless

on the same level.

at least

it is

nonleaf page with k descendents contains k

leaf

page contains

at least

[ ml2~\

a leaf).

keys.

keys and no more than

keys.

B-trees are built

pages always

upward from

the leaf level, so creation of

new

starts at the leaf level.

The power of B-trees

lies in

the facts that they are balanced (no

overly long branches); they are shallow (requiring few seeks); they

accommodate random

deletions and insertions at a relatively

while remaining in balance; and they guarantee

at least

50%

low

cost

storage

utilization.

which each node is at least two-thirds full.

provide better storage utilization than do B-trees.
Concatenation. When a B-tree node underflows (becomes less than
50% full), it sometimes becomes necessary to combine the node with
tree.

special B-tree in

trees generally

an adjacent node, thus decreasing the total number of nodes in the

tree. Since concatenation involves a change in the number of nodes
in the tree, its effects can require reorganization at many levels of the
tree.

Height-balanced tree.
each node there is a

tree structure

limit to the

with

a special

property: For

amount of difference

that

allowed

EXERCISES

among the heights of any of the node's subtrees. An HB(k) tree allows subtrees to be k levels out of balance. (See AVL tree.)
Leaf of a B-tree. A page at the lowest level in a B-tree. All leaves in a
B-tree occur at the same level.
Order of a B-tree. The maximum number of descendents that a node
in the B-tree

can have.

Paged index. An index that is divided into blocks, or pages, each of

which can hold many keys. The use of paged indexes allows us to
search through very large

numbers of keys with only

few disk

ac-

cesses.

Promotion of a key. The movement of a key from one node

into a

node (creating the higher-level node, if necessary) when

the original node becomes overfull and must be split.
Redistribution. When a B-tree node underflows (becomes less than
50% full), it may be possible to move keys into the node from an
adjacent node with the same parent. This helps ensure that the 50%full property is maintained. When keys are redistributed, it becomes
higher-level

necessary to alter the contents of the parent as well. Redistribution,

as opposed to concatenation, does not involve creation or deletion of
nodes
its effects are entirely local. Redistribution can also often be
used as an alternative to splitting.
Splitting. Creation of two nodes out of one because the original node
becomes overfull. Splitting results in the need to promote a key to a
higher-level node to provide an index separating the two new nodes.
Virtual B-tree. A B-tree index in which several pages are kept in
in anticipation of the possibility that one or more of them will be
needed by a later access. Many different strategies can be applied to
replacing pages in
when virtual B-trees are used, including the
least-recently-used strategy and height-weighted strategies.

RAM

EXERCISES
1.

Balanced binary trees can be effective index structures for

part or

all

when

RAM-based

become so large that

of them must be kept on secondary storage. The following

indexing, but they have several drawbacks

they

questions should help bring these drawbacks into focus, and thus reinforce
the need for an alternative structure such as the B-tree.
a.

There are two major problems with using binary search to search
simple sorted index on secondary storage: The number of disk ac-

cesses

larger than

index sorted

we would

substantial.

search tree alleviate?

like;

and the time

takes to keep the

Which of the problems does

binary

383

384

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

Why

is it

important to keep search trees balanced?

is an AVL tree better than a simple binary search

what way

tree?
d.

Suppose you have

a file

with 1,000,000 keys stored on disk

completely

full,

what

maximum number

the

the tree

balanced binary search

paged

in the

manner

tree. If the tree is

in a

not paged,

of accesses required to find

illustrated in Fig. 8.12, but

key?

with each

page able to hold 15 keys and to branch to 16 new pages, what is the
maximum number of accesses required to find a key? If the page size
is increased to hold 511 keys with branches to 512 nodes, how does
the maximum number of accesses change?
e. Consider the problem of balancing the three-key-per-page tree in
Fig. 8.13

by rearranging the pages.

Why

is it

difficult to create a

tree-balancing algorithm that has only local effects?

size increases to a

become difficult to guarantee that each of the pages

some minimum number of keys?

Although B-trees

downward from

search trees for external searching, binary trees are

Why

Describe the necessary parts of

differ

from an

a leaf

still

commonly

this so?

node of a B-tree.

How

does

a leaf

internal node?

Since leaf nodes never have children,

pointer fields in a leaf node

need for pointer

the top.

are generally considered superior to binary

used for internal searching.

contains at least

Explain the following statement: B-trees are built upward from

the bottom, whereas binary trees are built

node

When the page

why does it

512 keys),

likely size (such as

might be possible

to use the

to point to data records. This could eliminate the

fields to data records in the internal nodes.

Why? What

are

the implications of doing this in terms of storage utilization and retrieval

time?
4.
sets

Show

the B-trees of order four that result

from loading

the following

of keys in order.
a.

b.
c.

C GJ X

CGJXNSUOAEBHI
CGJXNSUOAEBHIF
CGJXNSUOAEBHIFKLQRTVUWZ

Figure 8.23 shows the pattern of recursive calls involved in inserting a

$ into the B-tree in Fig. 8.22. Suppose that subsequent to this insertion, the
character [ is inserted after the Z. (The ASCII code for [ is greater than the
5.

ASCII code

for Z.)

Draw

a figure similar to Fig.

8.23 which shows the

pattern of recursive calls required to perform this insertion.

385

EXERCISES

Given

B-tree of order 256

What is
What is

the
the

maximum number
minimum number

of descendents from
of descendents from

page?

page (ex-

cluding the root and leaves)?

What
What

How many

the

minimum number of descendents from

minimum number of descendents from
keys are there on

the root?
a leaf?

nonleaf page with 200 descen-

dents?
f.

What

the

maximum

depth of the tree

contains 100,000

keys?

Using

method

to that used to derive the formula for

formula for best case, or minimum depth, for an
order m B-tree with
keys. What is the minimum depth of the tree
described in the preceding question?
7.

similar

worst-case depth, derive

Suppose you have a B-tree index for an unsorted file containing

data
where each key has stored with it the RRN of the corresponding
record. The depth of the B-tree is d. What are the maximum and minimum
numbers of disk accesses required to
8.

records,

Retrieve

Add

Delete

Retrieve

Assume
arrived at
9.

Show

deleted

record;

a record;
a

record; and
all

records from the

page buffering
your answer.
that

file

in sorted order.

not used. In each case, indicate

how you

the trees that result after each of the keys A, B, Q, and

from the following B-tree of order

five.

D H

A B C

r
F

K L

N O

386

10.

B-TREES AND OTHER TREE-STRUCTURED

A common

unless
11.

belief about B-trees

100%

it is

full.

Discuss

FILE

ORGANIZATIONS

that a B-tree cannot

grow deeper

this.

Suppose you want to delete

key from

node

in a B-tree.

You

look

the right sibling and find that redistribution does not work; concatenation

would be

12.

What

Do you

introduce?

compare with
13.

What

look to the left and see that redistribution

choose to concatenate or redistribute?

difference

the

improvement does
does

You

necessary.

option here.

that

between

tree offer

over

does the minimum

of an order m B-tree?
B-tree?

How

can

and a B-tree? What

and what complications
depth of an order m B * tree
tree

a B-tree,

How

a virtual

be possible to average fewer than

one access per key when

from a three-level virtual B-tree?
Write a pseudocode description for an LRU replacement scheme for a
10-page buffer used in implementing a virtual B-tree.
retrieving keys

between storing the information indexed by the

14. Discuss the trade-offs

keys in
15.

B-tree with the key and storing the information in

noted

that,

given variable-length keys,

it is

a separate file.

possible to optimize a

by building in a bias toward promoting shorter keys. With fixed-order

trees we promote the middle key. In a variable-order, variable-length key
tree, what is the meaning of "middle key"? What are the trade-offs
associated with building in a bias toward shorter keys in this selection of a
key for promotion? Outline an implementation for this selection and
promotion process.
tree

Programming Exercises
16.

Implement the programs

procedure that performs

created

at the

end of this chapter and add

by the program. As an example, here

traversal

of the

tree

shown

parenthesized symmetric traversal of

recursive
a

B-tree

the result of a parenthesized

in Fig. 8.18:

(((A,B,C)D(E,F,G)H(I,J)K(L,M))N((0,P)Q(R)S(T,U,V)W(X,Y,Z)))

The split() routine in the B-tree programs

to make it more efficient.

17.
it

program

key

18.

Write

19.

Write an interactive program that allows

delete kevs

from

that searches for a

a B-tree.

not very

efficient.

Rewrite

in a B-tree.
a

user to find, insert, and

FURTHER READINGS

20. Write a B-tree

program

that uses keys that are strings, rather than single

characters.

program that builds a B-tree index for

records contain more information than just a key.
21.

Write

a data file in

which

FURTHER READINGS
Currently available textbooks on

file and data structures contain surprisingly brief

These discussions do not, in general, add substantially to the
information presented in this chapter and the following chapter. Consequently,
readers interested in more information about B-trees must turn to the articles that
have appeared in journals over the past 15 years.
The article that introduced B-trees to the world is Bayer and McCreight's
"Organization and Maintenance of Large Ordered Indexes" (1972). It describes the
theoretical properties of B-trees and includes empirical results concerning, among
other things, the effect of using redistribution in addition to splitting during
insertion. Readers should be aware that the notation and terminology used in this
article differ from that used in this text in a number of important respects.
Comer's (1979) survey article, "The Ubiquitous B-tree," provides an excellent
overview of some important variations on the basic B-tree form. Knuth's (1973b)
discussion of B-trees, although brief, is an important resource, in part because many
of the variant forms such as B* trees were first collected together in Knuth's
discussion. McCreight (1977) looks specifically at operations on trees that use
variable-length records and that are therefore of variable order. Although this article
speaks specifically about B* trees, the consideration of variable-length records can
be applied to many other B-tree forms. In "Time and Space Optimality on B-trees,"
Rosenberg and Snyder (1981) analyze the effects of initializing B-trees with the
minimum number of nodes. In "Analysis of Design Alternatives for Virtual
Memory Indexes," Murayama and Smith (1977) look at three factors that affect the
cost of retrieval: choice of search strategy, whether or not pages in the index are
structured, and whether or not keys are compressed. Zoellick (1986) discusses the
use of B-tree- like structures on optical discs.
Since B-trees in various forms have become a standard file organization for
databases, a good deal of interesting material on applications of B-trees can be found
in the database literature. Ullman (1986) Held and Stonebraker (1978), and Snyder
(1978) discuss the use of B-trees in database systems generally. Ullman (1986) covers
the problem of dealing with applications in which several programs have access to
the same database concurrently and identifies literature concerned with concurrent

discussions

B-trees.

access to B-tree.

Uses of B-trees for secondary key access are covered in many of the previously
cited references. There is also a growing literature on multidimensional dynamic
indexes, including a B-tree- like structure called a k-d B-tree. K-d B-trees are

387

388

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

described in papers by Ouskel and Scheuermann (1981) and Robinson (1981). Other
tries and grid files. Tries are
and data structures, including Knuth (1973b) and
Loomis (1983). Grid files are covered thoroughly in Nievergelt et al. (1984).
An interesting early paper on the use of dynamic tree structures for processing
files is "The Use of Tree Structures for Processing Files," by Sussenguth (1963).
Wagner (1973) and Keehn and Lacy (1974) examine the index design considerations
that led to the development of VSAM. VSAM uses an index structure very similar
to a B-tree, but appears to have been developed independently of Bayer and
McCreight's work. Readers interested in learning more about AVL trees will find a
good, approachable discussion of the algorithms associated with these trees in

approaches to secondary indexing include the use of

covered

many

Standish (1980).
tree operations

texts

files

Knuth (1973b)

and properties.

takes a

rigorous, mathematical look at

AVL

C Programs

PROGRAMS TO INSERT KEYS INTO A B-TREE

to Insert

Keys

389

into a B-Tree

The C program that follows implements the insert program described in the
text. The only difference between this program and the one in the text is
that this program builds a B-tree of order five, whereas the one in the text
builds a B-tree of order four. Input characters are taken from standard I/O,
with

q indicating

end of

The program

data.

from

requires the use of functions

driver, c

Contains the main program, which

described in the text very closely.

insert, c

Contains

several

files:

parallels the driver

program

the recursive function that finds the proper place

insertf),

for a key, inserts

it,

and supervises

splitting

and promotions.

Contains all support functions that directly perform I/O. The

header files fdeio.h and stdio.h must be available for inclusion in

btio.c

btio.c.

Contains the

btutil.c

split ()

All the

of the support functions, including the function

rest

described in the text.

programs include the header

file

called bt.h.

bt.h.
header file for btree programs
.

#def
#def
#def
#def
#def
#def

MAXKEYS
MINKEYS

ne
ine
i ne
i ne
ine
ine
i

NIL

NDKEY

MAXKEYS/2
(-1
/@/

YES

typedef s t rue t {
shor t keycoun t;
char
keyCMAXKEYS]
shor t childEMAXKEYS+1
>
BTPAGE
;

/*
/*
/*

number of keys in page

* /
the ac t ua 1 keys
ptrs to rrns of descendants*/

(continued)

390

B-TREES AND OTHER TREE-STRUCTURED

#define PAGESIZE

FILE

ORGANIZATIONS

zeof (BTPAGE

/* rrn of root page */

extern short
root;
/* file descriptor of btree file */
extern int btfd;
/* file descriptor of input file */
extern int infd;
/ *
pr o t otypes *
btcloseC);
btopenC )
btread(short rrn, BTPAGE *page_ptr);
btwr i t e( shor t r r n
BTPAGE *page_ptr);
creat e_root (char key
short left
shor t right)
short creat e_t ree( )
shor t get page( )
;

shor t get root ()

insert (short rrn, char key, short *pr omo_r_ch i 1 d char
ins_i n_page(char key, short r_child, BTPAGE *p_page);
;

key);

pageinit (BTPAGE *p_page);

put root ( shor t root);
search_node( char key, BTPAGE *p_page, short *pos);
split(char key, short r_child, BTPAGE *p_oldpage, char *P
k ey
short *promo_r_ch i Id BTPAGE *p_newpage);
,

Driver.c
/*

driver.c...
Driver for btree tests:
Opens or creates b-tree file.
Gets next key and calls insert to insert
If necessary, creates a new root.

key in tr

#include <stdio.h>
^include "bt .h"
ma i n(

tells if a promotion from below */

#/
page
root
of
rrn
short root,
*/
rrn promoted from below
promo_rrn;
/
below
from
promo_key,/* key promoted
char
*/
/* next key to insert in tree
key;

promoted; /* boolean:

int

/*
/*

if
e

(btopenO)
root

get root ()

root

c r

ea t e_t

try to open btree.dat and get

ee(

btree.dat not there, create

root

PROGRAMS TO INSERT KEYS INTO A

391

B-TREE: INSERT.C

while ((key = getcharO) != 'q') {

promoted = insert(root, key, &promo_rrn, &pr omo_k ey )
if ( promo t ed
root, promo_rrn);
root = c rea t e_root ( promo_k ey
,

btclose();
>

Insert.

/ *

insert .c.

Contains insertO function to insert a key into a btree.

Calls itself recursively until bottom of tree is reached.
Then inserts key in node.
If node is out of room,
calls splitO to split node
promotes middle key and rrn of new node
*/

^include "bt .h"

/* inser t (
Argument s

...

rrn:
*pr omo_r_ch
key:

i 1

*promo_key:

rrn of page to make insertion in

child promoted up from here to next level
key to be inserted here or lower
key promoted up from here to next level

insert(short rrn, char key, short *pr omo_r_ch i

char *promo_key)

/* current page
BTPAGE page,
/* new page created
newpage;
int found, promoted; /* boolean values

short

pos

char

p_b_rrn;
p_b_key;

*/
if

split occurs

*/
*/

rrn promoted from below

key promoted from below

/*
/*

*/
*/

(rrn == NIL)

/* past bottom of tree... "promote"*/

i
/* original key so that it will be */
*promo_key = key;
*/
*promo_r_chi Id = NIL;/* inserted at leaf level
return (YES);

btread(rrn, &page);
found = search_node( k ey
&page, &pos);
if (found) {
printfC Error: attempt to insert duplicate key: %c \n\007", key);
return (0);
,

(continued)

392

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

promoted = i nser t (page ch i 1 d po5

key, &p_b_rrn, &p_b_key);
if ( promot ed)
return (NO);
/* no promotion */
if (page. keycount < MAXKEYS) {
ins_in_page(p_b_key p_b_rrn Apage) /* OK to insert key and*/
btwrite(rrn, &page);
/* pointer in this page. */
return (NO);
/* no promotion */
.

else

i t(p_b_key
p_b_rrn &page ,promo_key promo_r_chi
btwrite(rrn, &page);
btwrite( *promo_r_chi Id, &newpage);
return (YES);
/* promotion */

spl

Anewpage);

Btio.c
/*

btio.c...
Contains btree functions that directly involve file i/o:

btopen() -- open file "btree.dat" to hold the btree.

btcloseO -- close "btree.dat"
getrootO -- get rrn of root node from first two bytes of btree.dat
putrootO -- put rrn of root node in first two bytes of btree.dat
cr eat e_t ree( ) -- create "btree.dat" and root node
getpageO -- get next available block in "btree.dat" for a new page
btread() -- read page number rrn from "btree.dat"
btwriteO -- write page number rrn to "btree.dat"
*/

^include "stdio.h"
#include "bt .h"
^include "fileio.h"
int

btfd;

bt open(

global file descriptor for "btree.dat"

btfd = openC'btree.dat", 0_RDWR);

return(btfd > 0);

btclose(

close(btfd)
>

short get root

( )

short root
long lseekC

PROGRAMS TO INSERT KEYS INTO A

B-TREE: BTIO.C

393

lseekCbtfd, OL, 0);

if CreadCbtfd, &root, 2) == 0) {
Unable to get root \007\n")
printf ("Error
.

exitd

return (root);
>

put root

shor

root)

lseekCbtfd,
teCbtf d

OL,

0);

&root

short cr eat e_t ree(

{

char key

creatC"btree.dat M ,PMODE);
/* Have to close and reopen to insure
closeCbtfd);
/* read/write access on many systems.
btopenC);
*/
/* Get first key.
key = getcharC);
return ( c r ea t e_r oo t ( k ey NIL, NIL));
btfd

short get page(

long lseekO, addr;

addr = lseekCbtfd, OL, 2) - 2L;
return CCshort) addr / PAGESIZE);
}

btreadCshort rrn, BTPAGE *page_ptr)

{

long

lseekC), addr;

addr = Clong)rrn * C ong )P AGES ZE + 2L

lseekCbtfd, addr, 0);
return C readCblfd, page_ptr, PAGESIZE)
I

btwr

i t

eC shor

rrn,

BTPAGE *page_ptr)

long lseekC), addr

addr = Clong) rrn * Clong) PAGESIZE + 2L
lseekCbtfd, addr, 0);
return CwriteCbtfd, page_ptr, PAGESIZE));
;

);

*/
*/

394

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

Btutil.c

btut ll.c...

Contains utility functions for btree program:

e_r oo t

initialize root node and insert one key

NDKEY in all "key" slots and NIL in "child" slots
search_node( ) -- return YES if key in node, else NO. In either case,
put key's correct position in pos.
ins in_page() -- insert key and right child in page
splitC) -- split node by creating new node and moving half of keys to
new node. Promote middle key and rrn of new node.

c r

pageinitO

( )

get and

put

'include "bt.h"
c r

e_r oo t char key,

short

left,

short

right)

BTPAGE page;
short rrn;
rrn = getpageC);
page i n i t ( &page )
page.key[01 = key;
page.childtO] = left;
=
right;
page ch i 1 d
page.keycount = 1;
btwriteCrrn Apage)
putroot(rrn)
return(rrn)
;

[ 1

page i

i t

(BTPAGE *p_page)

p_page: pointer to

page

int

for

))

<
MAXKEYS;
(j = 0;
= NDKEY;
p_page->key[
=
NIL;
p_page- >ch i Id
j

p_page->childtMAXKEYS]

NIL;

search_node(char key, BTPAGE p_page, short pos)

/* pos: position where key is or should be inserted
{

int

for

*po5

p_page- > k eycoun t

key

p_page- > k ey

i 3

PROGRAMS TO INSERT KEYS INTO A

395

B-TREE: BTUTIL.C

*po5 < p_page - > eycoun t

&&
key == p_page/* key is in page */
return (YES);

> k

ey *pos
[

else

return (NO);

key

/ *

not

page */

ns_i n_page( cha r

r_child, BTPAGE *p_page)

short

key,

{
i

for

=
p_page- > k eycoun t
key < p_page=
p_page- > k ey i
p_page - > k ey i =
p_page- >ch i 1 d i +
p_page- >c h i 1 d i

i--)

p_page- >keycount++
=
p_page- > k ey i
key;
=
r_child;
p_page- >ch i 1 d i +
;

/* split ()
Argument s

inserted
promoted up from here
to be inserted
promoted up from here
pointer to old page structure
pointer to new page structure

key to be
key to be
child rrn
rrn to be

key:

promo key:
r_child:
promo r child:
p_oldpage:
p_newpage:
*/

splitCchar key, short r_child, BTPAGE p_oldpage, char promo_key,

BTPAGE *p_newpage)
short *pr omo_r_c h i 1 d
,

{
i

short mid;
char
wor k k eys [MAXKEYS+
short wor k c h MAX KE YS+ 2
t

*/
tells where split is to occur
*/
temporarily holds keys, before split
/* temporarily holds children, before split*/

/*
1

/ *

(i=0; l < MAXKEYS; i++) {

workkeystil = p_o 1 dpage - > k ey
workchti] = p_o 1 dpage -> ch i 1 d

move keys and children from

old page into work arrays

for

/*
[

i ]

*/
*/

workchti] = p_o 1 dpage- > ch i 1 d

for (i=MAXKEYS; key < wor k k ey s
workkeystil = wor k k eys i workchti+11 = workchti]
[

l
[

- - ) { / *

insert new key*/

workkeystil
workchti+1]

key;

r_child;

*promo_r_chi Id = getpageC);
i n i t
p_newpage )
(

page

/*
/*

create new page for split,

and promote rrn of new page

*/
*/

(continued)

396

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

/* move first half of keys and

1
<
MINKEYS; i + + ) {
(1 =
=
/* children to old page, second
p_o 1 dpage - > k ey
wor keyst i ]
=
/*
workch[ i
half to new page
p_o 1 dpage- >ch i 1 d i
=
p_newpage- > k ey i
wor k k ey s i + + M NKEYS
=
p_newpage- >ch i 1 d i
wor kch[ i +1 +MI NKEYS ]
/* mark second half of old
p_oldpage->key[ l+MINKEYS] = NOKEY;
/* page as empty
p_oldpage->child[ i+1 +MINKEYS] = NIL;

for

*/
*/

p_oldpage->child[MINKEYS] = wor k ch M NKE YS

p_newpage->child[MINKEYS] = wor k ch i
+M NKEYS
p_newpage->keycount = MAXKEYS - MINKEYS;
p_o 1 dpage- > eycount = MINKEYS;
/* promote middle key
*promo_key = wor k k eys M NKEYS
[

*/
*/

PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: DRIVER.PAS

397

Pascal Programs to Insert Keys into a B-Tree

The

Pascal

program

that follows

in the text.

The only

difference

implements the insert program described

between this program and the one in the text
is that this program builds a B-tree of order five, whereas the one in the text
builds a B-tree of order four. Input characters are taken from standard I/O,
with q indicating end of data.
The main program includes three nonstandard compiler directives:

{$B->
{$1
{$1

btutil.prc}
insert .pre}

The $B
as a

instructs the

standard Pascal

The

Turbo

Pascal compiler to handle keyboard input

file.

$1 directives instruct the

compiler to include the

files btutil.prc

and

main program. These two files contain functions needed by

the main program. So the B-tree program requires the use of functions
from three files:
insert. pre

in the

Contains the main program, which closely

driver. pas

driver

program described

parallels the

in the text.

insert. pre

Contains insertf), the recursive function that finds the

proper place for a key, inserts it, and supervises splitting
and promotions.

btutil.prc

Contains

all

other support functions, including the func-

tion split() described in the text.

Driver.pas
PROGRAM btree

NPUT OUTPUT)
,

Driver for B-tree tests

(continued)

398

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

Opens or creates btree file

Gets next key and calls insert to insert key in tree
If necessary, creates a new root.

{$B->

CONST
MAXKEYS = 4;
MAXCHLD = 5;
MAXWKEYS = 5;
MAXWCHLD = 6;
NOKEY = '@'
NO = FALSE;
YES = TRUE;
NULL = -1

{maximum number of keys

P a 9 e>
{maximum number of children in page}
{maximum number of keys in working space}
{maximum number of children in working space}
{symbol to indicate no key}

TYPE

BTPAGE = RECORD
keycount
integer;
{number of keys in page
}
key
array [1.. MAXKEYS] of char;
{the actual keys
}
child
array
MAXCHLD of integer; {ptrs to RRNs of descendents}
:

END;

VAR

promoted

boo 1 ean

oo t
pr omo_r

integer

promo_k ey
btfd

char
file of BTPAGE

MINKEYS
PAGESIZE

integer
i nt eger

{$1
{$1

{tells if a promotion from below

{RRN of root
{RRN promoted from below
{key promoted from below
{next key to insert in tree
{global file descriptor for
{"btree.dat"
{min. number of keys in a page
{size of a page

btutil.prc}
insert. pre}

BEGIN {main}
MINKEYS
PAGESIZE
if

MAXKEYS DIV 2;
sizeof (BTPAGE)
{try to open btree.dat and get root}

btopen then
root

root

get root
{if btree.dat

else

ead( k ey )

create_tree

NHILE (key <>

BEGIN

not

there,

create it}

399

PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: INSERT.PRC

promoted
if

i nser t ( roo t
k ey
promo_r rn promo_k ey )
then
c r eat e_root ( promo_k ey root promo_r rn

pr omo ted

root
ead( key)
END;

btclose
END

Insert.prc
FUNCTION insert (rrn: integer;key: char;VAR pr omo_r_ch i
VAR promo_key: char): boolean;

Function to insert

integer;

key into a B-tree:

Calls itself recursively until the bottom of the tree is reached.

Then inserts the key in the node.
If node is out of room, then it calls split() to split the node and
promotes the middle key and RRN of new node.

VAR

{current page
{new page created if split occurs
{tells if key is already in B-tree
{tells if key is promoted
{position that key is to go in
{RRN promoted from below
{key promoted from below

page,

newpage
found
promoted

BTPAGE;

pos

b_rrn
p_b_key

boolean

i n t eger
char

GIN
{past bottom of tree... "promote"
(rrn = NULL) th
{original key so that it will be
BEGIN
{inserted at leaf level
promo key := key;
promo r child := NULL;
insert := YES
END
else
BEGIN
btreadCrrn ,page)
found := search n ode ( k ey page pos )
if (found ) then
BEGIN
key);
attempt to insert duplicate key:
wr telnC Error
insert := NO
END
if

}
>

(continued)

400

B-TREES AND OTHER TREE-STRUCTURED FILE ORGANIZATIONS

else

BEGIN
promoted := 1 nser t ( page ch 1 1 d pos
k ey
p_b_r rn p_b_k ey )
if (NOT promoted) then
insert := ND
{no promotion}
else
BEGIN
if (page, keycount < MAXKEYS) then
BEGIN
{OK to insert key
p_b_rrn page ) {and pointer in this
i ns_i n_page( p_b_key
btwrite(rrn,page);
{page.
insert := NO
{no promotion}
END
else
BEGIN
spl i t ( p_b_k ey p_b_r r n page promo_k ey
promo_r_ch ild,newpage);
btwrite(rrn,page)
btwrite( promo_r ch ild,newpage);
insert := YES
{promotion}
END
END
END
]

END
END:

Btutil.prc
FUNCTION btopen
BOOLEAN;
{Function to open "btree.dat"
it returns false}
:

already exists. Otherwise

VAR

response
char;
BEGIN
assign(btfd, 'btree.dat );
write('Does btree.dat already exist? (respond
readln(response)
writeln;
if (response = 'Y') OR (response = 'y') then
BEGIN
reset(btfd)
btopen := TRUE
END
else
btopen := FALSE
:

END;

N):

*);

}
}

401

PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: BTUTILPRC

PROCEDURE btclose;
{Procedure to close "btree.dat"}
BEGIN
close (btfd);
END;

FUNCTION getroot
integer;
{Function to get the RRN of the root node from first record of btree.dat)
:

VAR
root
BTPAGE;
BEGIN
seek(btfd,0);
if (not EOF) then
BEGIN
r ead( btfd, root);
getroot := r oo t
eycoun t
END
else
wr i t e 1 n( Er r or
Unable to get root.')
:

END;

FUNCTION getpage
integer;
{Function that gets the next available block in "btree.dat" for
BEGIN
getpage := f i 1 es i ze( b t f d )
:

new page)

END;

PROCEDURE pageinit (VAR p_page

BTPAGE);
{puts NOKEY in all "key" slots and NULL in "child" slots}
:

VAR
j

eger

BEGIN
for

:=
to MAXKEYS
DO
BEGIN
:= NOKEY;
p_page. key[
:= NULL;
p_page ch i 1 d
1

END;

p_page.child[MAXKEYS+1

NULL

END;

PROCEDURE putroot (root: integer);

{Puts RRN of root node in the keycount of the first record of btree.dat

VAR

rootrrn
BTPAGE;
BEGIN
seek(btf d,0)
rootrrn. keycount
:

root;
(continued)

402

B-TREES AND OTHER TREE-STRUCTURED

FILE

ORGANIZATIONS

pageinit (rootrrn);
write(btfd,rootrrn)
END;

PROCEDURE btread (rrn

integer; VAR page_ptr
{reads page number RRN from btree.dat}
BEGIN
seek (btfd,rrn);
read(btfd, page_pt r )

BTPAGE)

END;

PROCEDURE btwnte (rrn

integer; page_ptr
{writes page number RRN to btree.dat}
BEGIN
seekCbtf d,rrn)
write(btfd, page p t r )

BTPAGE)

END;

FUNC TION create_root (key :char; left, right

integer):
{get and initialize root node and insert one key}
VAR
page
BTPAGE;
rrn
integer
BEGIN
rrn
=
get page
page i n i t ( page)
key
page k ey
=
left
page ch i 1 d
right
page ch i 1 d 2
page k eycount
=
:

btwrite(rrn,page)
putroot(rrn)
create_root := rrn
;

END;

integer;
FUNCTION create_tree
{creates "btree.dat" and the root node}
VAR
rootrrn
int eger
BEGIN
:

rewnteCbtf d)
r

ead( key);

rootrrn := get page;

putroot(rootrrn)
create_tree := c r ea t e_r oo t
;

END;

( k

ey NULL NULL
,

integer;

403

PASCAL PROGRAMS TO INSERT KEYS INTO A B-TREE: BTUTIL.PRC

FUNCTION search_node( key char

p_page
{returns YES if key in node, else NO.
position in pos>
:

BTPAGE VAR pos integer): boolean

either case, put key's correct
;

VAR
1

eger

BEGIN
i

while ((i
l

pos

page

eye oun t

AND (key

p_page

l ] )

((pos <= p_page

eycount
search__node := YES
else
sear ch_node
NO
=
if

AND (key

p_page

ey pos
[

) )

then

END;

PROCEDURE ins_in_page (key: char r_chi

{insert key and right child in page}
;

integer; VAR p_page:

Id:

BTPAGE);

VAR
i

integer;

BEGIN
:= p_page
i
eycount
while ((key < p_page
BEGIN
:=
p_page k ey i
p_page ch i 1 d i +

l
[

page
:=

AND
i
[

p_page ch i
.

1)) DO

l ]

END;
p

page

eye oun t

p_page k ey
p_page ch i 1
[

p_page

eye oun

key;

r_child

END;

PROCEDURE split (key: char; r_child: integer; VAR p_oldpage: BTPAGE;

integer;
VAR promo_key: char; VAR pr omo_r_ch l 1 d
VAR p_newpage: BTPAGE);
:

{split node by creating new node and moving half of keys to new node.
Promote middle key and RRN of new node.)
VAR
i
integer;
{temporarily holds keys,}
of char;
workkeys
array
MA XNKE YS
{
before split}
of integer; {temporarily holds children
workch
MA XNCHLD
array
before split }
{
:

(continued)

404

B-TREES AND OTHER TREE-STRUCTURED

ORGANIZATIONS

FILE

BEGIN
:=
to MAXKEYS
DO
1
BEGIN
workkeyslil := p_o 1 dpage k ey
workchEi] := p_o dpage ch i 1 d

for

{move keys and children from

{old page into work arrays

}
}

END;

workchCMAXKEYS+1

p_o 1 dpage ch
.

i 1

i
MAXKEYS
=
while ((key < wor k k ey s i )
AND
BEGIN
workkeystil := wor k k ey 5 i workchEi+1] := workchtil;
:

1)) DO
}

key;

r_child;

{create new page for split

{and promote RRN of new page
{move first half of keys and
{children to old page,
{second half to new page.

i 1

page

{insert new key

omo_r_ch d := getpage;
i n i t
p_newpage )
(
for i
TO MINKEYS DO
=

workchti+11

END;
wor k k ey s

MAXKEYS-

BEGIN
:= workkeystil;
p_o 1 dpage k ey[ i
:= workchtil;
p_o 1 dpage ch i 1 d i
:= wor k k ey s i + + M NKEYS
p_newpage k ey i
:= wor k ch i + + M NKEYS
p_newpage. child! i
p_oldpage. key[ i+MINKEYSl
{mark second half of old
=
NOKEY;
p_oldpage.child[ i+1 +MINKEYS] := NULL
<page as empty
]

>
>
>

>
>

END;

p_oldpage.child[MINKEYS+1
if

wor k ch M NKEYS+

odd(MAXKEYS)
t

hen beg i

:= wor k k ey s MAXNKEYS
p_newpage. k ey M NKEYS+
p_newpage.child[MINKEYS+2] := wor k ch MAXNCHLD
:= wor k ch MA XWCHLD=
p_newpage.chi ld[MINKEYS+1
[

end

else
:= wor k ch MA XNCHLD
p_newpage.chi 1 d M NKEYS+
p_newpage. keycount := MAXKEYS - MINKEYS;
p_oldpage. keycount := MINKEYS;
{promote middle key
promo_key := wor k k eys M NKE Y S+1]
1

END;

4
The B Tree Family

and Indexed Sequential

File

Access

CHAPTER OBJECTIVES
Introduce indexed sequential

files.

Describe operations on a sequence set of blocks that

maintains records in order by key.

Show how
sequence

set can be built on top of the

produce an indexed sequential file

an index

set to

structure.

Introduce the use of a B-tree to maintain the index

+
set, thereby introducing B
trees and simple prefix

B+

trees.

Illustrate
fix

B+

how

the B-tree index set in a simple pre-

tree can

variable

be of variable order, holding

number of separators.

Compare

the strengths and weaknesses of

+
trees, simple prefix B
trees, and B-trees.

B+

405

CHAPTER OUTLINE
9.1

Indexed Sequential Access

9.2

Maintaining a Sequence Set

9.2.1

9.6.2

Blocks

The Use of Blocks

9.2.2 Choice of Block Size

9.3

Adding

9.5

The Simple

9.6

Simple Prefix
Maintenance
9.6.1

9.8

Internal Structure of Index Set

9.9

9.10

Separators Instead of Keys

B+

Prefix

B+

Sequence Set

Index Set Block Size

Blocks:

The Content of the Index:

in the

9.7

Simple Index to the

Sequence Set
9.4

Changes Involving Multiple

Simple Prefix B" Tree

Trees

9.11 B-Trees,

Tree

Variable-order B-Tree

Prefix

B+

Trees, and Simple

Trees in Perspective

Tree

Changes Localized

to Single

Blocks in the Sequence Set

9.1

Indexed Sequential Access

Indexed sequential
views of a file:

The

Indexed:

file

structures provide a choice

can be seen

as a set

between two

of records that

alternative

indexed

by key;

or
Sequential:

The

ous records

The

file

can be accessed sequentially (physically contigu-

no seeking),

returning records in order by key.

idea of having a single organizational

method

that provides both of

have had to choose between
them. As a somewhat extreme, though instructive, example of the potential
divergence of these two choices, suppose that we have developed a file

these views

new

one.

to this point

structure consisting of a set of entry-sequenced records indexed

separate B-tree. This structure can provide excellent indexed access to any

individual record by key, even as records are added and deleted.

suppose that

also

want

cosequential processing

to use this

we want

Since the actual records in this

file as

to retrieve

file

Now let's

part of a cosequential merge. In

all

system are

the records in order

by key.

entry sequenced, rather than

by key, the only way to retrieve them in order by key is

pointers from
through the index. For a file of
records, following the
essentially random seeks
the index into the entry sequenced set requires
into the record file. This is a much less efficient process than the sequential
physically sorted

407

MAINTAINING A SEQUENCE SET

so much so

reading of physically adjacent records

for

any situation

which cosequential processing

that
a

the other hand, our discussions of indexing

it is

unacceptable

frequent occurrence.

show

us that a

file

consisting of a set of records sorted by key, though ideal for cosequential

processing,

an unacceptable structure

by key

delete records

What

random

when we want

to access, insert,

and

order.

an application involves both interactive random access and

cosequential batch processing? There are

many examples of such dual-mode

applications. Student record systems at universities, for example, require

keyed access

to individual records while also requiring a large

batch processing,

when

grades are posted or

when

amount of

during
both batch processing of
charge slips and interactive checks of account status. Indexed sequential
access methods were developed in response to these kinds of needs.
as

fees are paid

registration. Similarly, credit card systems require

9.2

Maintaining a Sequence Set

We set aside,

for the

moment,

the indexed part of indexed sequential access,

focusing on the problem of keeping

of records in physical order by key

ordered set of records as
sequence set. We will assume that once we have a good way of maintaining
sequence set, we will find some way to index it as well.

as
a

a set

records are added and deleted.

refer to this

9.2.1 The Use of Blocks

can immediately rule out the idea of sorting and resorting the entire

sequence
entire

set as

the changes.
deletion

records are added and deleted, since

One of the

to just

part

best

ways

When we

the buffers

we know

instead to find a

to restrict the effects

of the sequence

encountered in chapters 3 and

output.

We need

an expensive process.

file is

set

that sorting an

way

of an insertion or

involves

tool

can collect the records into

first

blocks.

block records, the block becomes the basic unit of input and
at once. Consequently, the size of

read and write entire blocks

use in

After reading in

program

block,

all

such that they can hold an entire block.

the records in

block are in

RAM,

can work on them or rearrange them much more rapidly.

An example helps illustrate how the use of blocks can help
sequence set in order. Suppose we have records that are keyed on

and collected together so there are four records

link fields in

to localize

in a block.

where we
us keep a
last

name

also include

each block that point to the preceding block and the following

408

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

block.

We need

these fields because, as

FILE

ACCESS

you

will see, consecutive blocks are

not necessarily physically adjacent.

As with B-trees, the insertion of new records into a block can cause the
block to overflow. The overflow condition can be handled by a blocksplitting process that is analogous to, but not the same as, the blocksplitting process used in a B-tree. For example, Fig. 9.1(a) shows what our
blocked sequence set looks like before any insertions or deletions take place.
We show only the forward links. In Fig. 9.1(b) we have inserted a new
record with the key CARTER. This insertion causes block 2 to split. The

second half of what was originally block 2

Note

found

block 4 after the

split.

that this block-splitting process operates differently than the splitting

encountered in B-trees. In

a B-tree a split results in the promotion

Here things are simpler: We just divide the records between two
blocks and rearrange the links so we can still move through the file in order
by key, block after block.
Deletion of records can cause a block to be less than half full and
therefore to underflow. Once again, this problem and its solutions are
analogous to what we encounter when working with B-trees. Underflow in
record.

B-tree can lead to either of

If a

neighboring node

two

solutions:

also half full,

can concatenate the two

nodes, freeing one up for reuse.

neighboring nodes are more than half

If the

records between the nodes to

make

full,

the distribution

can

redistribute

nearly

even.

the

Underflow within a block of our sequence set can be handled through

same kinds of processes. As with insertion, the process for the sequence

set is

simpler than the process for B-trees since the sequence

set is not a tree

and there are therefore no keys and records in a parent node. In Fig. 9.1(c)
we show the effects of deleting the record for DAVIS. Block 4 underflows
and is then concatenated with its successor in logical sequence, which is
block 3. The concatenation process frees up block 3 for reuse. We do not
show an example in which underflow leads to redistribution, rather than
concatenation, since it is easy to see how the redistribution process works.
Records are simply moved between logically adjacent blocks.
Given the separation of records into blocks, along with these fundamental block-splitting, concatenation, and redistribution operations, we
can keep a sequence set in order by key without ever having to sort the
entire set

of records. As always, nothing comes

free;

consequently, there are

costs associated with this avoidance of sorting:

Once

made, our file takes up more space than an unof sorted records because of internal fragmentation

insertions are

blocked

file

MAINTAINING A SEQUENCE SET

Block

ADAMS

Block 2

w BYNUM

Block 3

W DENVER

BAIRD

ELLIS

BIXBY

CARSON

COLE

BOONE

DAVIS

(a)

Block

ADAMS

Block 2

fc
w
BYNUM

DENVER

Block 3

Block 4
fe
w

COLE

BAIRD

ELLIS

DAVIS

BIXBY

CARSON

BOONE

CARTER

(b)

Block

ADAMS

Block 2

BYNUM

BAIRD

BIXBY

CARSON

CARTER

BOONE

Block 3

Available
for reuse

Block 4

w COLE

DENVER

ELLIS

(c)

Block splitting and concatenation due to insertions and deletions in

(a) Initial blocked sequence set. (b) Sequence set after insertion of CARTER record block 2 splits, and the contents are divided between
block 4 is less
blocks 2 and 4. (c) Sequence set after deletion of DAVIS record
than half full, so it is concatenated with block 3.

FIGURE 9.1

the sequence set.

409

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

within a block. However, we can apply the same kinds of strategies

used to increase space utilization in a B-tree (e.g., the use of redistribution in place of splitting during insertion, two-to-three splitting,

and so on). Once again, the implementation of any of these strategies

must account for the fact that the sequence set is not a tree and that
there is therefore no promotion of records.
The order of the records is not necessarily physically sequential
throughout the file. The maximum guaranteed extent of physical sequentially

This

last

within

a block.

point leads us to the important question of selecting

block

size.

9.2.2 Choice of Block Size

As we work with our sequence

set,

block

the basic unit for our I/O

When we read data from the disk, we never read less than a
block; when we write data, we always write at least one block. A block is
also, as we have said, the maximum guaranteed extent of physical
sequentiality. It follows that we should think in terms of large blocks, with
operations.

each block holding

many

one of identifying the

big

on block

size:

Why

not

make

we can fit the entire file in a single block?

One answer to this is the same as the reason why we

RAM

our

records. So the question of block size

limits

first

becomes

the block size so

cannot always use

enough RAM available. So

consideration regarding an upper bound for block size is as follows:

sort

a file:

Consideration

usually do not have

size should be such that we can hold several

blocks in
at once. For example, in performing a
block split or concatenation, we want to be able to
at a time. If we are
hold at least two blocks in
implementing two-to-three splitting to conserve disk
at
space, we need to hold at least three blocks in

The block

RAM

a time.

Although we
sequence

are

presently

set sequentially

randomly accessing
in an entire

a single

block to get

focusing on the ability to access our

eventually want to consider the problem of

set. We have to read
any one record within that block. We can

record from our sequence

therefore state a second consideration:

Consideration

block should not take

an unlimited amount of
RAM, we would want to place an upper limit on the
block size so we would not end up reading in the en-

Reading

in or writing out a

very long. Even

tire file just to

we had

get at a single record.

ADDING A SIMPLE INDEX TO THE SEQUENCE SET

This second consideration

very long?

knowledge of

not

interested in a

mandatory limitation, but

block because

And where is
When we discussed

introduced the term

a sensible

still

that

are

which we can guarantee such

that?

sector formatted disks back in Chapter 3,

cluster.

uses

cluster

up eight

guarantees a

As we move from

one:

contains records that are physically adjacent,

the

minimum number of

allocated at a time. If a cluster consists of eight sectors, then a

only one byte

long

size should be such that we can access a

block without having to bear the cost of a disk seek
within the block read or block write operation.

adjacency.

clustering

How

The block

not extend blocks beyond the point

let's

imprecise:

the performance characteristics of disk drives:

(redefined):

a little

can refine this consideration by factoring in some of our

Consideration 2

This

more than

sectors

on the

minimum amount

cluster to cluster in reading a

disk.

file

sectors

containing

The reason

for

of physical sequentiality.
file,

we may

incur a disk

without seeking.
block size, then, is to make

seek, but within a cluster the data can be accessed

One reasonable

suggestion for deciding on

Often the cluster size on a disk

system has already been determined by the system administrator. But what
if you are configuring a disk system for a particular application and can
therefore choose your own cluster size? Then you need to consider the
issues relating to cluster size raised in Chapter 3, along with the constraints
available and the number of blocks you
imposed by the amount of
each block equal to the size of

want

to hold in

a cluster.

RAM
RAM at once. As

so often the case, the final decision will

probably be a compromise between a number of divergent considerations.

The important thing is that the compromise be a truly informed decision,
based on knowledge of how I/O devices and file structures work, rather
than just
If

a guess.

you

that allows

point

revise this

working with a disk system that is not sector oriented, but

you to choose the block size for a particular file, a good starting
think of a block as an entire track of the disk. You may want to
downward, to half a track, for instance, depending on memory
are

constraints, record size,

9.3

and other

factors.

Sequence Set

Adding a Simple Index

to the

for maintaining a set of records so

have created

access

them

mechanism

sequentially in order

by key.

It is

can

based on the idea of grouping

the records into blocks and then maintaining the blocks, as records are

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

412

ADAMS-BERNEE

BOLEN-CAGE

CAMP-DUTTON

ACCESS

FILE

\
/

EMBRY-EVANS

FABER-FOLK Ni FOLKS-GADDIS

FIGURE 9.2 Sequence of blocks showing the range of keys

each block.

added and deleted, through splitting, concatenation, and redistribution.

Now let's see whether we can find an efficient way to locate some specific
block containing

illustrated in Fig. 9.2. This

actually read
it is

given the record's key.

a particular record,

can view each of our blocks as containing

a range

of records,

an outside view of the blocks (we have not

any blocks and so do not

know

exactly what they contain), but

choose which block might have
for example, that if we are looking

sufficiently informative to allow us to

the record

are seeking.

for a record with the

key

can

see,

BURNS, we

want

to retrieve

and inspect the

second block.
It is

these

easy to see

blocks.

how we

could construct

might choose,

for

a simple, single-level

example,

fixed-length records that contain the key for the

shown in Fig. 9.3.

The combination of this kind of index with

last

build

index for

an index of

record in each block,

the sequence set of blocks

provides complete indexed sequential access. If we need to retrieve a specific

record,

consult the index and then retrieve the correct block; if we need

we start at the first block and read through the linked list
we have read them all. As simple as this approach is, it is in

sequential access

of blocks until
fact a very workable one

long

RAM

as the entire

memory. The requirement

important for two reasons:
Since this

for the

sequence

Key

Block number

BERNE
CAGE

DUTTON
EVANS
FOLK
GADDIS

4
5

by means of

set illustrated

index can be held in electronic

the index be held in

simple index of the kind

find specific records

FIGURE 9.3 Simple index

that

RAM

discussed in Chapter

binary search of the index.

Fig. 9.2.

41 3

THE CONTENT OF THE INDEX: SEPARATORS INSTEAD OF KEYS

Binary searching works well

but, as

many
As

we saw

in the

seeks if the

file is

RAM,

the searching takes place in

previous chapter on B-trees,

requires too

secondary storage device.

the blocks in the sequence set are changed through splitting, con-

Updating

catenation, and redistribution, the index has to be updated.

simple, fixed-length record index of this kind

dex

and contained

RAM.

works well

the in-

however, the updating requires seeking to individual index records on disk, the prorelatively small

cess can

become very expensive. Once again,

more completely in earlier chapters.

If,

this

point

discussed

What do we

do, then, if the

index does not conveniently

found

blocks

that

contains so

into

RAM?

many

blocks that the block

In the preceding chapter

could divide the index structure into pages,

much

like the

are discussing here, handling several pages, or blocks, of the index

RAM at a time.

file

fit

specifically,

we found

that B-trees are an excellent

structure for handling indexes that are too large to

This suggests that

we might

fit

entirely in

RAM.

organize the index to our sequence set as a

B-tree.

The use of

B-tree index for our sequence set of blocks

very powerful notion. The resulting hybrid structure

which

appropriate since

is,

known

it is

we need

The purpose of

the index

The index

Keys

is to assist us when we
The index must guide us to

are building

searching for a record with a specific key.

all.

tree,

keep in the index.

of the Index: Separators Instead of

set at

B+

The Content

block in the sequence

in fact, a

as a

B-tree index plus a sequence set that holds

+
can fully develop the notion of a B tree, we

it is

the actual records. Before we

need to think more carefully about what

9.4

set that contains the record, if it exists in the

serves as a kind of roadmap for the sequence

interested in the content of the index only insofar as

can

assist

are

the

sequence

set.

We are

us in getting

to the correct block in the sequence set; the index set does not itself contain

answers,

contains only information about where to go to get answers.

Given this view of the index set as a roadmap, we can take the very
important step of recognizing that we do not need to have actual keys in the
index set. Our real need is for separators. Figure 9.4 shows one possible set of
separators for the sequence set in Fig. 9.2.

Note

that there are

many

potential separators capable of distinguishing

between two blocks. For example, all of the strings shown between blocks
3 and 4 in Fig. 9.5 are capable of guiding us in our choice between the blocks
as we search for a particular key. If a string comparison between the key and

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

414

ACCESS

CAM

Separators:

FILE

ADAMS-BERNE

FOLKS

CAMP-DUTTON

EMBRV-EVANSSj FABER-FOLK Si FOLKS-GADDIS

FIGURE 9.4 Separators between blocks

3
in

the sequence set.

any of these separators shows that the key precedes the separator, we look
for the key in block 3. If the key follows the separator, we look in block 4.
If

are willing to treat the separators as variable-length entities within

our index structure (we

by placing the

as the separator to

about

how

can save space

Note

use
that

not always

are separators that are

function

Note

shown

in Fig. 9.6

blocks 5 and

follows that, as

must decide
one that

by using

in the Pascal

the logic

procedure

embodied

in the

listed in Fig. 9.7.

is the same as the

by the separator between
which is the same as the first key contained in block 6. It
we use the separators as a roadmap to the sequence set, we
is

produce

a separator that

illustrated in Fig. 9.4

is to the right of the separator or the

of the separator according to the following rule:

to retrieve the block that

to the left

Relation of Search

Key <
Key =

separator

Key >

separator

list

and

that these functions can

second key. This situation

FIGURE 9.5 A

this later),

guide our choice between blocks 3 and

the other separators contained in Fig. 9.4

index structure. Consequently,

unique shortest separator. For example, BK, BN, and

all the same length and that are equally effective
separators between blocks 1 and 2 in Fig. 9.4. We choose BO and all of

there

BO
as

talk

shortest separator in the

Key and

Separator

separator

Decision

Go
Go
Go

left

right
right

of potential separators.

DUTU
CAMP-DUTTON

DVXGHESJF
DZ
E

EMBRY-EVANS

EBQX
ELEEMOSYNARY

^-n

41 5

THE CONTENT OF THE INDEX: SEPARATORS INSTEAD OF KEYS

/* find_sep(keyl, key2, sep)

...

finds shortest string that serves as a separator between keyl and

Returns this separator through the address provided by
key2
the "sep" parameter
.

the function assumes that key2 follows keyl in collating sequence

V
f ind_sep

keyl key2 sep

char keyl[], key2[],
(

sep[];

while

(*sep++

*sep='\0';

*key2++)

== *keyl++)

/* ensure that separator string is null terminated */

FIGURE 9.6 C function to find a shortest separator.

FIGURE 9.7 Pascal procedure to find a shortest separator.

PROCEDURE find_sep (keyl, key2

strng
VAR sep
strng)
{
finds the shortest string that serves as a separator between keyl and
key2
Returns the separator through the variable sep.
Strings are
handled as character arrays in which the length of the string is stored
in the Oth position of the array.
The type "strng" is used for strings.
:

Assumes that key2 follows keyl in collating sequence.

Uses two functions defined in the Appendix:
len_str(s)
returns the length of the string s.
min(i,j)
compares i and j and returns the smallest value

VAR
i, minlgth
integer;
BEGIN
minlgth := min( len_str (keyl
:

len_str(key2)

while (keylCi] = key2Ci]) and

BEGIN
sepCi]
= key2Ci]

<= minlgth) DO

END;

sepCi]
sepCO]
END;

key2Ci];
CHR(i)

set length indicator in separator array

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

E
I

Index

FIGURE 9.8 A B-tree index set

9.5

The Simple

for the

Prefix B

Figure 9.8 shows

sequence

how we

set,

forming a simple prefix B

tree.

Tree

can form the separators identified in Fig. 9.4 into

The B-tree index is called the

Taken together with the sequence set, it forms a file structure
+
simple prefix B
tree. The modifier simple prefix indicates that the

B-tree index of the sequence set blocks.

index

set.

called a

index

set contains shortest separators,

copies of the actual keys.

Our

or prefixes of the keys rather than

separators are simple because they are,

They are actually just the initial letters within the keys.
More complicated (not simple) methods of creating separators from key
simply, prefixes:

remove unnecessary characters from the front of the separator as

from the rear. (See Bayer and Unterauer, 1977, for a more complete

prefixes

well as

discussion of prefix

Note

B+

"
1

trees.)"

that since the index set

branches to

N+

EMBRY, we
separator E.
retrieving the

children. If

start at the

a B-tree, a

node containing

N separators

are searching for the record with the

root of the index

set,

comparing

EMBRY

key

to the

Since EMBRY comes after E, we branch to the right,

node containing the separators F and FOLKS. Since EMBRY

on B trees and simple prefix B trees is remarkably inconsistent in the nomenclature used for these structures. B + trees are sometimes called B* trees; simple prefix
B + trees are sometimes called simple prefix B-trees. Comer's important article in Computing
Surveys in 1979 has reduced some of the confusion by providing a consistent, standard nomenclature which we use here.

""The literature

SIMPLE PREFIX B

TREE MAINTENANCE

417

the first of these separators, we follow the branch that is

of the F separator, which leads us to block 4, the correct block in
the sequence set.

comes before even

to the left

9.6

Simple Prefix B

Tree Maintenance

9.6.1 Changes Localized to Single Blocks

the Sequence Set

suppose that we want to delete the records for EMBRY and FOLKS,
suppose that neither of these deletions results in any concatenation
or redistribution within the sequence set. Since there is no concatenation or
redistribution, the effect of these deletions on the sequence set is limited to
changes within blocks 4 and 6. The record that was formerly the second
record in block 4 (let's say that its key is ERVIN) is now the first record.
Similarly, the former second record in block 6 (we assume it has a key of
FROST) now starts that block. These changes can be seen in Fig. 9.9.

Let's

and

let's

The more

interesting question

what

effect,

any, these deletions

have on the index set. The answer is that since the number of sequence set
blocks is unchanged, and since no records are moved between blocks, the
index set can also remain unchanged. This is easy to see in the case of the
EMBRY deletion: E is still a perfectly good separator for sequence set
blocks 3 and 4, so there is no reason to change it in the index set. The case

E
!

FIGURE 9.9 The deletion of the

index set unchanged.

EMBRY

and FOLKS records from the sequence set leaves the

41 8

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FOLKS

of the

deletion

appears both as

index

set.

these

two

key

more confusing

a little

ACCESS

FILE

in the deleted record

and

since the string

avoid confusion, remember to distinguish clearly between

FOLKS: FOLKS can continue to serve as a

uses of the string

FOLKS

separator between blocks 5 and 6 even though the

(One could argue

separator,
a shorter

FOLKS

within the

as a separator

that although

we do

record

deleted.

FOLKS

not need to replace the

should do so anyway because

it is

now

possible to construct

However, the cost of making such a change in the index

outweighs the benefits associated with saving a few bytes of

separator.

set usually

space.)

The

effect

of inserting into the sequence

cause block splitting

much

the

same

not result in concatenation: The index

example, that

by the separators

insert a record for

in the

index

into block 4 of the sequence

room
in

we
set

records that do not

of these deletions that do

remains unchanged. Suppose, for

set

EATON.

new

find that

Following the path indicated

we will insert

The new record becomes

but no change in the index

the

new

record

assume, for the moment, that there

set is necessary.

This

the

record

first

not surprising

decided to insert the record into block 4 on the basis of the existing

information in the index

index

set.

for the record in the block.

block

since

set,

set

as the effect

set. It

follows that the existing information in the

sufficient to allow us to find the record again.

9.6.2 Changes Involving Multiple Blocks

the Sequence Set

What happens when

the addition and deletion of records to and from the

change the number of blocks in the sequence set? Clearly,
if we have more blocks, we need additional separators in the index set, and
if we have fewer blocks, we need fewer separators. Changing the number
of separators certainly has an effect on the index set, where the separators

sequence

set does

are stored.

Since the index set for a simple prefix

B+

tree

actually just a

normal

B-tree, the changes to the index set are handled according to the familiar

and deletion. 1 In the following examples, we

assume that the index set is a B-tree of order three, which means that the
maximum number of separators we can store in a node is two. We use this
small node size for the index set to illustrate node splitting and concatena"

rules for B-tree insertion

separators. As you will see later, actual

+
implementations of simple prefix B trees place a much larger number of

few

tion while using only a

separators in a

""As

you study

node of the index

you may find it helpful

much more detail.

the material here,

discuss B-tree operations in

set.

to refer

back to Chapter

where

SIMPLE PREFIX B

assume

that there

block

set

A new block

hold the second half of what was originally the

shown

an insertion into the

that this insertion causes the block to split.

419

TREE MAINTENANCE

begin with an insertion into the sequence

Specifically, let's

in to

set,

block, and

brought
block. This new

(block 7)

first

linked into the correct position in the sequence

in Fig. 9.9.

first

following block

and preceding block 2 (these are the physical block numbers). These
changes to the sequence set are illustrated in Fig. 9.10.
Note that the separator that formerly distinguished between blocks 1
and 2, the string BO, is now the separator for blocks 7 and 2. We need a
new separator, with a value of AY, to distinguish between blocks 1 and 7.
As we go to place this separator into the index set, we find that the node into
which we want to insert it, containing BO and CAM, is already full.
Consequently, insertion of the new separator causes a split and promotion,
1

according to the usual rules for B-trees. The promoted separator,

placed in the root of the index

BO,

set.

Now let's

suppose we delete a record from block 2 of the sequence set

underflow condition and consequent concatenation of blocks
2 and 3. Once the concatenation is complete, block 3 is no longer needed in
the sequence set, and the separator that once distinguished between blocks
2 and 3 must be removed from the index set. Removing this separator,
CAM, causes an underflow in an index set node. Consequently, there is
that causes an

FIGURE 9.10 An insertion into block 1 causes a split and the consequent addition of block 7.
of a block in the sequence set requires a new separator in the index set. Insertion
of the AY separator into the node containing BO and CAM causes a node split in the index set
B-tree and consequent promotion of BO to the root.

The addition

420

THE

TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

FIGURE 9.1 1 A deletion from block 2 causes underflow and the consequent concatenation of
blocks 2 and 3. After the concatenation, block 3 is no longer needed and can be placed on
an avail list. Consequently, the separator CAM is no longer needed. Removing CAM from its
node in the index set forces a concatenation of index set nodes, bringing BO back down from
the root.

another concatenation,

this

time in the index

that

set,

results

the

demotion of the BO separator from the root, bringing it back down into a
node with the AY separator. Once these changes are complete, the simple
prefix

B+

tree has the structure illustrated in Fig. 9.11.

Although in these examples

node split in the index set, and

in a concatenation in the

index

block

concatenation in the sequence

set,

split in the

there

sequence

not always

this

set results in
set results

correspondence

of action. Insertions and deletions in the index set are handled as standard
B-tree operations; whether there is splitting or a simple insertion,
concatenation or a simple deletion, depends entirely on how full the index
set node is.
Writing procedures to handle these kinds of operations is a straightfor-

ward

task if you remember that the changes take place from the bottom up.
Record insertion and deletion always take place in the sequence set, since
that is where the records are. If splitting, concatenation, or redistribution is
necessary, perform the operation just as you would if there were no index set
at all. Then, after the record operations in the sequence set are complete,

make changes
If

blocks are

necessary in the index

split in the

serted into the index

sequence

set:

set, a

new

separator

must be

in-

set;

421

INDEX SET BLOCK SIZE

blocks are concatenated in the sequence

moved from

set, a

separator

must be

and
If records are redistributed between blocks in the sequence
value of a separator in the index set must be changed.
Index

set

the index

re-

set;

set,

the

operations are performed according to the rules for B-trees.

This means that node splitting and concatenation propagate up through the
set. We see this in our examples as the BO
and out of the root. Note that the operations on the
sequence set do not involve this kind of propagation. That is because the
sequence set is a linear, linked list, whereas the index set is a tree. It is easy
to lose sight of this distinction and think of an insertion or deletion in terms
+
of a single operation on the entire simple prefix B tree. This is a good way
to become confused. Remember: Insertions and deletions happen in the
sequence set since that is where the records are. Changes to the index set are
secondary; they are a byproduct of the fundamental operations on the
sequence set.

higher levels of the index

separator

9.7

moves

Index Set Block Size

have ignored the important issues of the size and

Our examples have used extremely small
index set nodes and have treated them as fixed-order B-tree nodes, even
though the separators are variable in length. We need to develop more
realistic, useful ideas about the size and structure of index set nodes.
The physical size of a node for the index set is usually the same as the
physical size of a block in the sequence set. When this is the case, we speak
of index set blocks, rather than nodes, just as we speak of sequence set blocks.
There are a number of reasons for using a common block size for the index
and sequence sets:
to this point

structure of the index set nodes.

The block size for the sequence set is usually chosen because there is
a good fit between this block size, the characteristics of the disk
drive, and the amount of memory available. The choice of an index
set

block

fore, the

size

block

the index

size that

best for the sequence set

factors; there-

usually best for

set.

A common
scheme

governed by consideration of the same

block

size

makes

easier to

to create a virtual simple prefix

implement

B+

buffering

tree, similar to the virtual

B-trees discussed in the preceding chapter.

The index

set

blocks and sequence

within the same

file

blocks are often mingled

between two separate files

set

to avoid seeking

422

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

while accessing the simple prefix B tree. Use of one file for both
kinds of blocks is simpler if the block sizes are the same.

9.8

Internal Structure of Index Set Blocks:

A Variable-order B-Tree
Given

a large,

fixed-size block for the index

separators within

how do we

set,

store the

In the examples considered so far, the block structure

it?

such that it can contain only a fixed number of separators. The entire
motivation behind the use of shortest separators is the possibility of packing
more of them into a node. This motivation disappears completely if the

index

set uses a

fixed-order B-tree in which there

number of

a fixed

separators per node.

want each index

variable-length separators.

set

How

block

hold

should

we go

variable

number of

about searching through

these separators? Since the blocks are probably large, any single block can

hold

a large

we want

number of separators. Once we

to be able to

separators.

read a block into

RAM for use,

binary rather than sequential search on

therefore need to structure the block so

its list

can support

of
a

binary search, despite the fact that the separators are of variable length.

Chapter

which covers indexing, we see that the use of a separate

a means of performing binary searches on a list of

index can provide

variable-length entities. If the index itself consists of fixed-length refer-

ences,

can use binary searching on the index, retrieving the actual

variable-length records or fields through indirection. For example, suppose

are

going to place the following

of separators into an index block:

set

As, Ba, Bro, C, Ch, Cra, Dele, Edi, Err, Fa,

(We

are using lowercase letters, rather than

find the separators

easily

all

when we

Fie.

uppercase

letters,

concatenate them.)

concatenate these separators and build an index for them,

you can

shown

could
in Fig.

9.12.
If

are using this block of the index set as a

the record in the sequence set for "Beck",

the index to the separators, retrieving

which

starts

separator

Our

in position

by looking

10.

Note

at the starting

binary search eventually

first

that

roadmap

we perform

to help us find

binary search on

the middle separator, "Cra",

can find the length of

this

position of the separator that follows.

tells

that

"Beck"

"Ba" and "Bro". Then what do we do?

The purpose of the index set roadmap is to guide

falls

between the

separators

the levels of the simple prefix

B+

tree,

downward through

leading us to the sequence set block

423

INTERNAL STRUCTURE OF INDEX SET BLOCKS: A VARIABLE-ORDER B-TREE

AsBaBroCChCraDeleEdiErrFaFle

00 02 04 07 08 10 13 17 20 23 25

Index

Concatenated

to separators

separators

FIGURE 9.12 Variable-length separators and corresponding index.

we want

to retrieve. Consequently, the index set block needs

store references to

lower

level

a relative

of the

its

children, to the blocks descending

tree.

assume

that the references are

block number (RBN), which

number except
If there are

that

from

analogous to

some way
it

made

in the next
in

terms of

a relative

record

references a fixed-length block rather than a record.

N separators within a block, the block has N + children, and

N + RBNs in addition to the separators and
1

therefore needs space to store

the index to the separators.

There are many ways to combine the list of separators, index to

and list of RBNs into a single index set block. One possible
approach is illustrated in Fig. 9.13. In addition to the vector of separators,
the index to these separators, and the list of associated block numbers, this
separators,

block structure includes:

Separator count:

need

this to help us find the

the index to the separators so

middle element

can begin our binary search.

Total length of separators: The list of concatenated separators varies in

length from block to block. Since the index to the separators begins
at the
list is

end of
so

this variable-length

we need

list,

know how

long the

can find the beginning of our index.

Let's suppose,

once again, that

are looking for a record with the

key

"Beck" and that the search has brought us to the index set block pictured in
Fig. 9.13. The total length of the separators and the separator count allows

Separator count

r
r

Total length of separators

AsBaBroCChCraDeleEdiErrFaFle

Separators

00 02 04 07 08 10 13 17 20 23 25

*U Index

to separators

FIGURE 9.13 Structure of an index set block.

BOO B01 B02 B03 B04 B05 B06 B07 BOS B09 B10 Bll

Relative block

numbers

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

424

Separator
subscript:

BOO

FILE

ACCESS

01234 56789
As

B01

B02

Bro B03

B04

B05 Cra B06 Dele B07

Edi

B08

Err

B09

BIO

Bll

Fie

FIGURE 9.14 Conceptual relationship of separators and relative block numbers.

us to find the beginning, the end, and consequently the middle of the index
to the separators.

in the preceding

of the separators through

example,

this index, finally

we perform

binary search

concluding that the key "Beck"

falls between the separators "Ba" and "Bro". Conceptually, the relation
between the keys and the RBNs is as illustrated in Fig. 9.14. (Why isn't this

good physical arrangement?)

As Fig. 9.14 makes clear, discovering that the key falls between "Ba"
and "Bro" allows us to decide that the next block we need to retrieve has the
RBN stored in the B02 position of the RBN vector. This next block could

be another index

set block,

could be the sequence

set

and thus another block of the roadmap, or

block that

are looking for. In either case, the

quantity and arrangement of information in the current index set block

conduct our binary search within the index block and then
+
proceed to the next block in the simple prefix B tree.
There are many alternate ways to arrange the fundamental components
of this index block. (For example, would it be easier to build the block if the
vector of keys were placed at the end of the block? How would you handle
sufficient to let us

the fact that the block consists of both character and integer entities with

constant, fixed dividing point between them?) For our purposes here, the

implementation details for this particular index block structure are

not nearly as important as the block's conceptual structure. This kind of index
block structure illustrates two important points.
The first point is that a block is not just an arbitrary chunk cut out of
a homogeneous file; it can be more than just a set of records. A block can
specific

have

a sophisticated internal structure all its

own, including

its

own internal

index, a collection of variable-length records, separate sets of fixed-length

records,

and so

forth.

This idea of building more sophisticated data

becomes increasingly attractive as the block

structures inside of each block

it becomes imperative that we have an

of the data within a block once it has been
+
read into RAM. This point applies not only to simple prefix B trees, but
to any file structure using a large block size.

size increases.

efficient

With very

large blocks

way of processing

all

LOADING A SIMPLE PREFIX B

The second

425

TREE

is that a node within the B-trce index set of our simple

of variable order, since each index set block contains a
variable number of separators. This variability has interesting implications:

prefix

B+

tree

point

The number of separators in a block is directly limited by block size

rather than by some predetermined order (as in an order M B-tree).
The index set will have the maximum order, and therefore the mini-

mum

depth, that

form

the separators.

possible given the degree of compression used to

Since the tree

when

is full,

block

comparing

mum.

or half

full,

are

separator count against

Decisions about

come more
The

variable order, operations

when

such

no longer

some

simple matter of

fixed

determining

maximum

or mini-

to split, concatenate, or redistribute be-

complicated.

exercises at the end of this chapter provide opportunities for

exploring variable-order trees more thoroughly.

9.9

Loading a Simple Prefix B

In the previous description

Tree

of the simple prefix

tree,

on
something

focus

first

a sequence set, and subsequently present the index set as

added or built on top of the sequence set. It is not only possible to
conceive of simple prefix B^ trees this way, as a sequence set with an added
index, but one can also build them this way.
One way of building a simple prefix B~ tree, of course, is through a
series of successive insertions. We would use the procedures outlined in
+
section 9.6, where we discuss the maintenance of simple prefix B trees, to
split or redistribute blocks in the sequence set and in the index set as we
added blocks to the sequence set. The difficulty with this approach is that
splitting and redistribution are relatively expensive. They involve searching
down through the tree for each insertion and then reorganizing the tree as
necessary on the way back up. These operations are tine for tree maintenance
as the tree is updated, but when we are loading the tree we do not have to
contend with a random-order insertion and therefore do not need procedures
that are so powerful, flexible, and expensive. Instead, we can begin by

building
that

sorting the records that are to be loaded.

next record

encounter

Working from
blocks, one
fills

up.

by one,

sorted

file,

starting a

As we make

Then we can guarantee

the next record

new

the transition

we need

that the

to load.

can place the records into sequence set

when the one we are working with

between two sequence set blocks, we can

block

426

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

Next separator:

CAT

CATCH-CHECK

sequence
block

set

FIGURE 9.15 Formation of the

first

index set block as the sequence set

loaded.

determine the shortest separator for the blocks. We can

separators into an index set block that we build and hold in

collect

these

RAM until

it is

full.

develop an example of how this works, let's assume that we have

of records associated with terms that are being compiled for a book
index. The records might consist of a list of the occurrences of each term.
In Fig. 9.15 we show four sequence set blocks that have been written out to

sets

the disk and one index set block that has been built in
shortest separators derived

from the sequence

set

RAM

from the

block keys. As you can

FIGURE 9.16 Simultaneous building of two index set levels as the sequence set continues to
grow.

CAT

00 -1

-1
1

Index block
containing no
separators

LOADING A SIMPLE PREFIX B + TREE

see,

427

the next sequence set block consists of

CATCH

through

CHECK,

a set of terms ranging from

and therefore the next separator is CAT. Let's

suppose that the index set block is now full. We write it out to disk. Now
what do we do with the separator CAT?
Clearly, we need to start a new index block. But we cannot place CAT
into another index block at the same level as the one containing the
separators ALW, ASP, and BET since we cannot have two blocks at the
same level without having a parent block. Instead, we promote the CAT
separator to a higher-level block.

point directly to the sequence

blocks. This

means

that

RAM

build

However, the higher-level block cannot

must point to the lower-level index

set; it

will

now

be building two levels of the index

sequence

the

set.

Figure

9.16

illustrates

set

this

working-on-two-levels phenomenon: The addition of the CAT separator

requires us to start a new, root-level index block as well as a lower-level
index block. (Actually, we are working on three levels at once since we are
also constructing the sequence set blocks in

the index looks like after even

RAM.)

more sequence

set

Figure 9.17 shows what

blocks are added. As you

see, the lower-level index block that contained no separators when we

added CAT to the root has now filled up. To establish that the tree works,
do a search for the term CATCH. Then search for the two terms CASUAL
and CATALOG. How can you tell that these terms are not in the sequence

can

set?
It

instructive to ask

CHECK,

what would happen

the last record

so the construction of the sequence sets and index sets

were

would stop
+

with the configuration shown in Fig. 9.16. The resulting simple prefix B
tree would contain an index set node that holds no separators. This is not an
isolated, one-time possibility. If we use this sequential loading method to
build the tree, there will be many points during the loading process at which
there is an empty or nearly empty index set node. If the index set grows to
more than two levels, this empty node problem can occur at even higher
levels of the tree, creating a potentially severe out-of-balance problem.
Clearly, these empty node and nearly empty node conditions violate the
B-tree rules that apply to the index set. However, once a tree is loaded and
goes into regular use, the very fact that a node is violating B-tree conditions
can be used to guarantee that the node will be corrected through the action
of normal B-tree maintenance operations. It is easy to write the procedures
for insertion and deletion so a redistribution procedure is invoked when an
underfull node

encountered.

The advantages of loading

sequential

operation

following

simple prefix
a

sort

B+

tree in this

of the records,

outweigh the disadvantages associated with the

way,

as a

almost always

possibility

of creating

428

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

ALHASPBET

ACCESS-ALSO

FIGURE 9.17 Continued growth

of index set built

up from the sequence

set.

blocks that contain too few records or too few separators.

advantage

that the loading process goes

more quickly

The output can be written sequentially;

We make only one pass over the data, rather than
associated with random order insertions; and

blocks need to be reorganized

proceed.

The

principal

since

the

many

passes

429

TREES

There are two additional advantages to using a separate loading process

we have described. These advantages are related to
performance after the tree is loaded rather than performance during loading:
such as the one

Random insertion produces blocks that are, on the average, between

67% and 80% full. In the preceding chapter, as we discussed B-trees,
we increased this storage utilization by mechanisms such as using redistribution during insertion rather than using just block splitting.
still, we never had the option of filling the blocks completely so
we had 100% utilization. The sequential loading process changes
this. If we want, we can load the tree so it starts out with 100% utilization. This is an attractive option if we do not expect to add very
many records to the tree. On the other hand, if we do anticipate
many insertions, sequential loading allows us to select any other degree of utilization that we want. Sequential loading gives us much

But,

control over the

newly loaded

amount and placement of empty space

in the

tree.

example presented

In the loading

in Fig. 9.16,

write out the

first

four sequence set blocks, then write out the index set block containing the separators for these sequence set blocks. If
file

use the same

for both sequence set and index set blocks, this process guaran-

tees that

quence

an index

set

block

blocks that are

starts

out in physical proximity to the se-

descendents. In other words, our se-

its

is creating a degree of spatial locality within

our file. This locality can minimize seeking as we search down
through the tree.

quential loading process

9.10

Trees
+

have focused primarily on simple prefix B

trees. These structures are actually a variant of an approach to file
+
organization known simply as a B Tree. The difference between a simple

Our

discussions

prefix

B+

tree

and

to this point

a plain

B+

tree

that the latter structure does not involve

the use of prefixes as separators. Instead, the separators in the index set are

simply copies of the actual keys. Contrast the index

9.18,

which

block that

B+

illustrates the initial

loading steps for a

illustrated in Fig. 9.15,

where we

set

shown

block
tree,

in Fig.

with the index

are building a simple prefix

tree.

The

operations performed on

discussed for simple prefix

trees consist

a set

B+

trees.

trees are essentially the

Both B

of records arranged

same

those

and simple prefix

key order in a sequence
trees

B+
set,

430

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

FILE

ACCESS

Next separator:

CATCH

Next
CATCH-CHECK

sequence
set

FIGURE 9.18 Formation of the

first

index set block

block

tree without the use of shortest

separators.

coupled with an index set that provides rapid access to the block containing
any particular key/record combination. The only difference is that in the
simple prefix B tree we build an index set of shortest separators formed
from key prefixes.
One of the reasons behind our decision to focus first on simple prefix
B trees, rather than on the more general notion of a B + tree, is that we
want to distinguish between the role of the separators in the index set and
keys in the sequence

set. It is

much more

difficult to

make

this distinction

when

the separators are exact copies of the keys. By beginning with simple
+
prefix B
trees, we have the pedagogical advantage of working with
separators that are clearly different than the keys in the sequence set.
+
But another reason for starting with simple prefix B trees revolves

around the
plain

B+

implies that

can.

they are quite often a more desirable alternative than the

want the index set to be as shallow as possible, which
we want to place as many separators into an index set block as

fact that

tree.

Why use anything longer than the simple prefix in the index set?

general, the

answer

to this question

anything longer than a simple prefix

prefix

B+

trees are often a

factors that

keys

might argue

good

that

not, in fact,

as a separator;

solution.

in favor

we do

of using

There
a

B+

are,

want

to use

consequently, simple

however,

at least

two

tree that uses full copies

as separators:

The reason

for using shortest separators

an index set block. As

have already

pack more of them into

said, this implies, ineluctably,

the use of variable-length fields within the index set blocks. For

some

applications the cost of the extra overhead required to maintain

and use

this variable-length structure

outweighs the benefits of

shorter separators. In these cases one might choose to build a

B-TREES, B

straightforward

TREES, AND SIMPLE PREFIX B

B+

TREES

431

PERSPECTIVE

tree using fixed-length copies

of the keys from

the sequence set as separators.

Some key
fix

sets

method

do not show much compression when

the simple pre-

used to produce separators. For example, suppose the

keys consist of large, consecutive alphanumeric sequences such

34C18K756, 34C18K757, 34C18K758, and so

on. In this case, to en-

joy appreciable compression, we need to use compression techniques

that remove redundancy from the front of the key. Bayer and Unterauer (1977) describe such compression methods. Unfortunately,
they are more expensive and complicated than simple prefix compression. If we calculate that tree height remains acceptable with the
use of full copies of the keys as separators, we might elect to use the
no-compression option.

9.1

B-Trees, B Trees, and Simple Prefix B

Trees in Perspective

and the preceding chapter we have looked at a number of

+
file structures. These tools
B-trees, B trees, and
simple prefix B
trees
have similar-sounding names and a number of
In this chapter

"tools" used in building

common

features.

need

way

to differentiate these tools so

can

choose the most appropriate one for a given file structure job.
Before addressing this problem of differentiation, however, we should
point out that these are not the only tools in the toolbox. Because B-trees,
reliably

B+

trees,

and

their relatives are such powerful, flexible file structures,

it is

them as the answer to all problems.

Simple index structures of the kind discussed in
Chapter 6, which are maintained wholly in RAM, are a much simpler,
neater solution when they suffice for the job at hand. As we saw at the
indexes are not limited to direct
beginning of this chapter, simple
access situations. This kind of index can be coupled with a sequence set of
blocks to provide effective indexed sequential access as well. It is only when
that
the index grows so large that we cannot economically hold it in
+
we need to turn to paged index structures such as B-trees and B trees.
In the chapter that follows we encounter yet another tool, known as
hashing. Like simple RAM-based indexes, hashing is an important alterna+
tive to B-trees, B
trees, and so on. In many situations, hashing can provide
faster access to a very large number of records than can the use of a member
of the B-tree family.
into the trap of regarding

easy to

fall

This

a serious mistake.

RAM

432

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

B+

FILE

ACCESS

and simple prefix B trees are not a panacea.

However, they do have broad applicability, particularly for situations that
require the ability to access a large file both sequentially, in order by key,
and through an index. All three of these different tools share the following
So, B-trees,

trees,

characteristics:

They
tire

are

all

paged index structures, which means

blocks of information into

RAM

once.

many

possible to choose between a great

that they bring ena

consequence,

it is

alternatives (e.g., the keys

hundreds of thousands of records) with just a few seeks out to

The shape of these trees tends to be broad and shallow.
All three approaches maintain height-balanced trees. The trees do not
grow in an uneven way, which would result in some potentially
long searches for certain keys.
for

disk storage.

grow from

cases the trees

all

the

bottom up. Balance

main-

tained through block splitting, concatenation, and redistribution.

With

all

three structures

it is

possible to obtain greater storage effi-

ciency through the use of two-to-three splitting and of redistribution

in place

of block

splitting

scribed in Chapter

when

possible.

These techniques are de-

All three approaches can be implemented as virtual tree structures in

which the most recently used blocks are held in RAM. The advantages of virtual trees were described in Chapter 8.

Any

of these approaches can be adapted for use with variable-length

a block similar to those outlined in

records using structures inside

this chapter.

For

all

this similarity, there are

some important

differences.

These

differences are brought into focus through a review of the strengths and

unique characteristics of each of these three

B-Trees

B-trees contain information that

member of

each pair

file

the key; the other

structures.

grouped

member

as a set

of pairs.

One

the associated infor-

These pairs are distributed over all the nodes of the B-tree. Consequently, we might find the information we are seeking at any level of the
+
+
B-tree. This differs from B trees and simple prefix B trees, which require
mation.

all

searches to proceed

all

the

way down

to the lowest,

sequence

set level

the tree. Because the B-tree itself contains the actual keys and associated

information, and there

therefore

Given

a large

no need

for additional storage to hold

less

space than does

enough block

size

and an implementation

separators, a B-tree can take

tree as a virtual B-tree,

it is

tree.

that treats the

possible to use a B-tree for ordered sequential

access as well as for indexed access.

The ordered

sequential access

B-TREES, B

TREES, AND SIMPLE PREFIX B + TREES

obtained through an in-order traversal of the

virtual tree

tree.

433

PERSPECTIVE

The implementation

as a

necessary so this traversal does not involve seeking as

returns to the next highest level of the tree. This use of a B-tree for indexed
sequential access

works only when

the record information

actually stored

within the B-tree. If the B-tree merely contains pointers to records that are

sequence off in some other

in entry

workable because of

all

then indexed sequential access

file,

not

the seeking required to retrieve the actual record

information.
B-trees are most attractive

each record stored in the

record,

when

the key itself comprises a large part of

When

the key

possible to build a broader,

tree.

only

small part of the

shallower tree using

tree

methods.

B+

that in the
set

The primary

Trees

tree

all

known

of blocks

difference

between the B

the key and record information

as the sequence set.

not in the upper-level, tree-like portion

sequence

set

provided through

tree

and the B-tree

contained in a linked

The key and record information

B+

of the

tree.

Indexed access to

this

conceptually (though not necessarily

B+

tree the index set

of copies of the keys that represent the boundaries between
sequence set blocks. These copies of keys are called separators since they
separate a sequence set block from its predecessor.
+
There are two significant advantages that the B tree structure provides
over the B-tree:

physically) separate structure called the index

set.

In a

consists

The sequence

set

can be processed in

a truly linear, sequential

way,

providing efficient access to records in order by key; and

The
ten

use of separators, rather than entire records, in the index set of-

means

index

set

that the

block in

number of separators
a

B+

that can be placed in a single

tree substantially exceeds the

could be placed in an equal-sized block in

records that

number of
a B-tree.

Sepa-

rators (copies of keys) are simply smaller than the key/record pairs

you can put more of them in a block of

given size, it follows that the number of other blocks descending
+
from that block can be greater. As a consequence, a B tree approach can often result in a shallower tree than would a B-tree apstored in a B-tree. Since

proach.
In

practice,

the latter of these

important one. The impact of the

first

two advantages
advantage

often the

often possible to obtain acceptable performance during an in-order

traversal

B-tree.

B-tree through the page buffering mechanism of

a virtual

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

434

Simple Prefix
using

B+

ACCESS

We just indicated that the primary advantage of

+
is that aB
tree sometimes allows us to

Trees

tree instead

FILE

of a B-tree

we can obtain a higher branching factor out

of the upper-level blocks of the tree. The simple prefix B + tree builds on
this advantage by making the separators in the index set smaller than the
keys in the sequence set, rather than just using copies of these keys. If the
build a shallower tree because

we can fit more of them into a block to obtain

an even higher branching factor out of the block. In a sense, the simple
+
+
prefix B
tree takes one of the strongest features of the B
tree one step
separators are smaller, then

farther.

The price we have to pay to obtain this separator compression and

consequent increase in branching factor is that we must use an index set
block structure that supports variable-length fields. The question of
whether

this price is

worth the gain

one that has

to be considered

case-by-case basis.

SUMMARY
We

begin

this

chapter by presenting a

new problem.

In previous chapters

provided either indexed access or sequential access in order by key, but

without finding an efficient way to provide both of these kinds of access.
This chapter explores one class of solutions to this problem, a class based on
the use of a blocked sequence set and an associated index set.

The sequence
Since

all

set

holds

all

of the

tions to the sequence set,

start

structures with an examination of a

changes.

file's

insertion or deletion operations

The fundamental

data records in order

the

file

by key.

begin with modifica-

our study of indexed sequential

method for managing sequence

tools used to insert

and delete records while

keeping everything in order within the sequence

set

are ones that

file

set
still

block concatenation, and

8: block splitting,
between blocks. The critical difference between
the use made of these tools for B-trees and the use made here is that there
is no promotion of records or keys during block splitting in a sequence set.
A sequence set is just a linked list of blocks, not a tree; therefore there is no
place to promote anything to. So, when a block splits, all the records are
divided between blocks at the same level; when blocks are concatenated
there is no need to bring anything down from a parent node.

encountered in

Chapter

redistribution of records

In this chapter,

sequence

set

also discuss the question of

blocks. There

precise

answer

how

large to

can give to

this

make

question

SUMMARY

between applications and environments.

since conditions vary

block should be large, but not so large that

RAM

or cannot read in

a single

Once we

a seek. In

of a cluster (on sector-formatted

disks) or

disk track.

are able to build

and maintain

sequence

set,

matter of building an index for the blocks in the sequence

enough

In general a

cannot hold several blocks

block without incurring the cost of

practice, blocks are often the size

the size

turn to the

set. If

the index

RAM,

one very satisfactory solution is to use a

simple index that might contain, for example, the key for the last record in
every block of the sequence set.
is

small

If the

index

fit

set turns

out to be too large to

fit

RAM, we recommend

same strategy we developed in the preceding chapter when

simple index outgrows the available RAM space: We turn the index into
the use of the

a
a

combination of a sequence set with a B-tree index set is our

+
first encounter with the structure known as a B
tree.
+
Before looking at B trees as complete entities, we take a closer look at
the makeup of the index set. The index set does not hold any information
that we would ever seek for its own sake. Instead, an index set is used only
as a roadmap to guide searches into the sequence set. The index set consists
of separators that allow us to choose between sequence set blocks. There are
many possible separators for any two sequence set blocks, so we might as
well choose the shortest separator. The scheme we use to find this shortest
separator consists of finding the common prefix of the two keys on either
side of a block boundary in the sequence set, and then going one letter
+
beyond this common prefix to define a true separator.
tree with an
index set made up of separators formed in this way is called a simple prefix
B-tree. This

B+

tree.

We study the mechanism used to maintain the index set as insertions

+
and deletions are made in the sequence set of a B tree. The principal
observation we make about all of these operations is that the primary action
is where the records are. Changes to the
of the fundamental operations
byproduct
set are secondary; they are a
on the sequence set. We add a new separator to the index set only if we form
a new block in the sequence set; we delete a separator from the index set
only if we remove a block from the sequence set through concatenation.
Block overflow and underflow in the index set differ from the operations on
the sequence set in that the index set is potentially a multilevel structure and

within the sequence

set,

since that

index

therefore handled as a B-tree.

The

size

of blocks in the index

for the sequence set.

variable-length

separators

set is usually the

same

as the size

create blocks containing variable

while

the

chosen

numbers of

same time supporting binary

435

436

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

searching,

block header

FILE

ACCESS

develop an internal structure for the block that consists of

count and total separator length), the

fields (for the separator

variable-length separators themselves, an index to these separators, and a

vector of relative block numbers

(RBNs)

for the blocks descending

from

the index set block. This illustrates an important general principle about
large blocks within

homogeneous

set

structure of their

file

structures:

They

are

more than just

of records; blocks often have

own,

apart

from the

out of a

a slice

a sophisticated internal

larger structure of the file.

+
a B
tree.
find that if

turn next to the problem of loading

of records sorted by key, we can use a single-pass, sequential

process to place these records into the sequence set. As we move from block
to block in building the sequence set, we can extract separators and build the
blocks of the index set. Compared to a series of successive insertions that
work down from the top of the tree, this sequential loading process is much
more efficient. Sequential loading also lets us choose the percentage of space
utilized, right up to a goal of 100%.
The chapter closes with a comparison of B-trees, B + trees, and simple
+
+
prefix B
trees offer over B-trees
trees. The primary advantages that B
with

start

a set

are:

They support
The index set
records, so

than

true indexed sequential access; and

contains only separators, rather than full keys and

+
often possible to create a B tree that is shallower

it is

a B-tree.

suggest that the second of these advantages

important

indexed sequential access in

many

circumstances.

often the

The simple

second advantage and carries it farther,

separators and potentially producing an even shallower
+
tree is that
this extra compression in a simple prefix B
variable-length fields and a variable-order tree.
takes

one, since treating a B-tree as a virtual tree provides acceptable

this

prefix

B+

tree

compressing the
tree.

The

we must

price for

deal with

KEY TERMS

A B+

of records that are ordered

set that provides indexed
access to the records. All of the records are stored in the sequence
set. Insertions and deletions of records are handled by splitting, concatenating, and redistributing blocks in the sequence set. The index
set, which is used only as a finding aid to the blocks in the sequence

tree.

sequentially

set, is

tree consists

a sequence set

by key, along with an

managed

as a B-tree.

index

EXERCISES

Index

set.

The index

set consists

of separators that provide information

about the boundaries between the blocks in the sequence set of a B

tree. The index set can locate the block in the sequence set that contains the record

corresponding to

Indexed sequential access. Indexed

a certain

key.

sequential access

not actually

method, but rather a term used to describe situations in

which a user wants both sequential access to records, ordered by
+
key, and indexed access to those same records. B trees are just one
method for providing indexed sequential access.
Separator. Separators are derived from the keys of the records on either
side of a block boundary in the sequence set. If a given key is in one
of the two blocks on either side of a separator, the separator reliably
tells the user which of the two blocks holds the key.
Sequence set. The sequence set is the base level of an indexed sequential
+
file structure, such as B
tree. It contains all of the records in the
single-access

file.

When

read in logical order, block after block, the sequence set

lists all of the records in order by key.

Shortest separator. Many possible separators can be used to distinguish
between any two blocks in the sequence set. The class of shortest

separators consists of those separators that take the least space, given
a particular

We looked carefully at a compresremoving as many letters as possible

of the separators, forming the shortest simple prefix

compression strategy.

sion strategy that consists of

from the
that can

rear
still

Simple prefix

serve as a separator.
tree.
B + tree in which the index set

B+

shortest separators that are

simple prefixes,

made up of

described in the defini-

tion for shortest separator.

of variable order when the number of direct

is variable. This occurs
when the B-tree nodes contain a variable number of keys or separators. This form is most often used when there is variability in the
+
trees always make
lengths of the keys or separators. Simple prefix B
use of a variable-order B-tree as an index set so it is possible to take
advantage of the compression of separators and place more of them

Variable order.

B-tree

descendents from any given node of the tree

in a block.

m
EXERCISES
1.

Describe

access:

(a)

file

structures that permit each of the following types of

sequential

sequential access.

access

only;

(b)

direct

access

only;

(c)

indexed

437

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

438

A B+

tree

structure

sequential access. Since

whenever

B+

FILE

ACCESS

generally superior to a B-tree for indexed

why not use a B + tree

trees incorporate B-trees,

hierarchical indexed structure

called for?

Consider the sequence set shown in Fig. 9.1(b). Show the sequence set
keys DOVER and EARNEST are added; then show the sequence
set after the key DAVIS is deleted. Did you use concatenation or
redistribution for handling the underflow?
3.

after the

What

considerations affect your choice of a block size for constructing

you know something about expected patterns of access

random versus an even division
between the two), how might this affect your choice of block size? On a
sector-oriented drive, how might sector size and cluster size affect your
a

sequence

set? If

(primarily sequential versus primarily

choice of
5.

block size?

possible to construct an indexed sequential

tree-structured index.

could be used.
index?

Under what

(such as an

AVL

The index

discussed in Chapter

a
8,

without using

conditions might one consider using such an

conditions might

tree) rather

set

file

simple index like the one developed in Chapter 6

B+

than
tree

be reasonable to use

binary tree

B-tree for the index?

just a B-tree, but unlike the B-trees

the separators do not have to be keys.

Why

the

difference?
7.

How

differ

does block splitting in the sequence

from block

splitting in the

index

set

of a simple prefix

B+

tree

set?

key BOLEN in the simple prefix B tree in Fig. 9.8 is deleted

from the sequence set node, how is the separator BO in the parent node
8.

If the

affected?

+
Consider the simple prefix B tree shown in Fig. 9.8. Suppose a key
added to block 5 results in a split of block 5 and the consequent addition of
block 8, so blocks 5 and 8 appear as follows:
9.

FABER-FINGER

FINNEY-FOLK

3
a.

What does
Suppose

the tree look like after the insertion?

that,

subsequent to the insertion,

a deletion causes

under-

EXERCISES

flow and the consequent concatenation of blocks 4 and

What does

the tree look like after the deletion?

Describe

a case in

which

10.

Why

often a

is it

good

a deletion results in redistribution, rather

show

than concatenation, and

the effect

has

on the

tree.

idea to use the same block size for the index set
+

and the sequence set in a simple prefix B tree? Why should the index
nodes and the sequence set nodes usually be kept in the same file?
11.

Show

conceptual view of an index

illustrated in Fig. 9.12, that

Ab Arch
Also show

Astron

set block,

similar to the

set

one

loaded with the separators

B Bea

detailed

view of the index block,

as illustrated in Fig.

9.13.

of records is sorted by key, the process of loading a B

tree can be handled by using a single-pass sequential process, instead of
randomly inserting new records into the tree. What are the advantages of
this approach?
12. If the initial set

13.

Show how

the simple prefix

B+

tree in Fig. 9.17

changes after the

addition of the node

ITEMIZE-JAR

Assume

that the index set

does not have

14.
a

Use

room

for the

node containing the separators EF, H, and IG

new

separator but that there

the data stored in the simple prefix

room

in the root.

B+

tree in Fig. 9.17 to construct

+
tree is of order four.
that the index set of the B
+
+
the resulting B
tree.
tree with the simple prefix B

tree.

Compare

Assume

15. The use of variable-length separators and/or key compression changes

some of the rules about how we define and use a B-tree and how we

measure B-tree performance.

a. How does it affect our definition of the order of
b.

Suggest

criteria for

deciding

when

B-tree?

splitting, concatenation,

and

redistribution should be performed.

What

difficulties arise in estimating

maximum number
16.
in

Make

a table

terms of the

simple prefix

B+

tree height,

of accesses, and space?

comparing

criteria listed

B+

and simple prefix B trees

below. Assume that the B-tree nodes do not

B-trees,

trees,

439

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

440

ACCESS

FILE

RRNs of data
answers based on a

contain data records, but only keys and corresponding

some

records. In
tree's

will

you will be able to give

number of keys in the tree.

cases

height or the

depend on unknown

factors,

such

specific

In other cases, the answers

of access or average

as patterns

separator length.

The number of accesses required to retrieve a record from

of height h (average, best case, and worst case).
a.

The number of accesses

worst
c.

a tree

required to insert a record (best and

cases).

The number of accesses

worst

required to delete

record (best and

cases).

The number of accesses required

to process a file of n keys seassuming that each node can hold a maximum of k keys
and a minimum of k/2 keys (best and worst cases).
e. The number of accesses required to process a file of n keys sequentially, assuming that there are h + 1 node-sized buffers availd.

quentially,

able.

17. Some commercially available indexed sequential file organizations are

based on block interval splitting approaches very similar to those used with

B+

trees.

which

VSAM

called

key-sequenced access
+

much

organized

how

IBM's

like a

offers the user several file access

tree.

and which

Look up

results

a description

modes, one of
in

file

being

of VSAM and report

key-sequenced organization relates to a B tree, and also how it

offers the user file handling capabilities well beyond those of a straightforward B tree implementation. (See the Further Readings section of this
chapter for articles and books on VSAM.)

its

Although
methods now
18.

B+

trees

provide the basis for most indexed sequential access

in use, this was not always the case. A method called ISAM
Readings for this chapter) was once very common, especially
on large computers. ISAM uses a rigid tree-structured index consisting of at
least two and at most three levels. Indexes at these levels are tailored to the
specific disk drive being used. Data records are organized by track, so the
lowest level of an ISAM index is called the track index. Since the track index
points to the track on which a data record can be found, there is one track
(see Further

index for each cylinder.

overflow, the track

When
not

the addition of data records causes a track to

split.

Instead, the extra records are put into a

separate overflow area and chained together in logical order. Hence, every

entry in a track index

addition to

The

its

may

pointer to the

essential

difference

tree- like organizations

contain

home

pointer to the overflow area, in

track.

between the

in the

ISAM

way overflow

organization

and

B+

records are handled. In the

EXERCISES

of ISAM, overflow records are simply added to a chain of overflow

+
the index structure is not altered. In the B
tree case, overflow
records are not tolerated. When overflow occurs, a block is split and the
case

records

index structure

Can you

accommodate

altered to

the extra data block.

think of any advantages of using the

index

Why do you

separate overflow areas to handle overflow

+
think B tree- like approaches are replacing those that

use overflow chains to hold overflow records? Consider the

rigid

ISAM, with

structure of

records?

terms of both sequential and direct access,

two approaches

as well as addition

and deletion

of records.

Programming Exercises

chapter by discussing operations on

a sequence set, which is

of blocks containing records. Only later do we add the
concept of an index set to provide faster access to the blocks in the sequence
set. The following programming problems echo this approach, requiring

begin

this

just a linked

you

list

to write a

first

program

that builds a sequence set,

and
to the sequence

functions that maintain the sequence

functions to add an index set

programs can be implemented

19.

Write

program

set,

in either

that accepts a

file

then to write

finally to write

creating a

set,

programs and
tree. These

B+

or Pascal.

of strings

as input.

The input

file

should be sorted so the strings are in ascending order. Your program should
use this input

The

file

strings are stored in 15-byte records;

sequence

Sequence

The

to build a sequence set with the following characteristics:

first

set

block

128 bytes long;

blocks are doubly linked;

block in the output

other things,

header block containing,

file is a

a reference to the

RRN

of the

first

among

block in the se-

quence set;
Sequence set blocks are loaded so they are as full as possible; and
Sequence set blocks contain other fields (other than the actual records
containing the strings) as needed.

Write an update program that accepts strings input from the keyboard,
along with an instruction either to search, add, or delete the string from the

20.

sequence

set.

The program should have

the following characteristics:

Strings in the sequence set must, of course, be kept in order;

Response to the search instruction should be

either

found or not

found;

string should not be

added

it is

already in the sequence

set;

441

THE B + TREE FAMILY AND INDEXED SEQUENTIAL

442

Blocks in the sequence

half

set

FILE

ACCESS

should never be allowed to be

less

than

and

full;

Splitting, redistribution,

and concatenation operations should be

written as separate procedures so they can be used in subsequent pro-

gram development.
21. Write a

assume

program

file

form of

a B-tree.

B+

two

You may

levels.

The

should have the following characteristics:

The index

set in the

that the B-tree index will never be deeper than

resulting

that traverses the sequence set created in the preceding

and that builds an index

exercises

set

and the sequence

set,

taken together, should constitute

tree;

not compress the keys

you form

the separators for the index

set;

Index

set blocks, like

and
Index

set

blocks.

sequence

set blocks,

should be 128 bytes long;

blocks should be kept in the same

The header block should

file as

the sequence set

contain a reference to the root of

the index set as well as the already existing reference to the begin-

ning of the sequence

22. Write a

new

set.

version of the update program that acts on the entire

B+

you created

in the preceding exercise. Search, add, and delete

should be supported, as they are in the earlier update program.
B-tree characteristics should be maintained in the index set; the sequence set
tree that

capabilities

should, as before, be maintained so blocks are always at least half

full.

Consider the block structure illustrated in Fig. 9.13, in which an index

is used to permit binary searching for a key in an index page.
Each index set block contains three variable length sets of items: a set of
separators, an index to the separators, and a set of relative block numbers.
Develop code in Pascal or C for storing these items in an index block and
for searching the block for a separator. You need to answer such questions
23.

to separators

as:

Where should

the three sets be placed relative to one another?

Given the data types permitted by the language you are using, how
can you handle the fact that the block consists of both character and
integer data with no fixed dividing point between them?
As items are added to a block, how do you decide when a block is
too

full to insert

another separator?

FURTHER READINGS

suggestion for the

initial

tree structure appears to

have come from Knuth

(1973b), although he did not name or develop the approach. Most of the literature
+
that discusses B
trees in detail (as opposed to describing specific implementations

VSAM)

such as

provides what

form of

in the

articles rather

than textbooks.

perhaps the best brief overview of

Comer

(1979)

Bayer and Untcrauer

(1977) offer a definitive article describing techniques for compressing separators.
The article includes consideration of simple prefix B^ trees as well as a more general
+
approach called a prefix B tree. McCreight (1977) describes an algorithm for taking
advantage of the variation in the lengths of separators in the index set of a B" tree.
McCreight's algorithm attempts to ensure that short separators, rather than longer

promoted

ones, are

in the tree

shallower

have

trees.

split. The intent is to shape the tree so blocks higher

number of immediate descendents, thereby creating a

blocks

a greater

tree.

Rosenberg and Snyder (1981) study the effects of initializing a compact B-tree
on later insertions and deletions. The use of batch insertions and deletions to B-trees,
+
rather than individual updates, is proposed and analyzed in Lang et al. (1985). B
trees are

compared with more

rigid indexed sequential

file

organizations (such as

Batory (1981) and in IBM's VSAM Planning Guide.

There are many commercial products that use methods related to the

ISAM)

B+

operations described in this chapter, but detailed descriptions of their underlying

structures are scarce.

exception to

(VSAM), one of

the

sequential access.

Wagner

insights into the early

this

tree
file

IBM's Virtual Storage Access Method

most widely used commercial products providing indexed

(1973) and Keehn and Lacy (1974) provide interesting
thinking behind VSAM. They also include considerations of

key maintenance, key compression, secondary indexes, and indexes to multiple data
sets. Good descriptions of VSAM can be found in several sources, and from a
variety of perspectives, in

IBM's

(VSAM

in a

B+

VSAM Planning Guide,

Bohl

(1981),

Comer

(1979)

Bradley (1982) (emphasis on implementation

PL/I environment), and Loomis (1983) (with examples from COBOL).
as

an example of

tree),

VAX-11 Record Management Services (RMS). Digital's file and record access
subsystem of the VAX/ VMS operating system, uses a B"" tree- like structure to
support indexed sequential access (Digital, 1979). Many microcomputer implementations

of B

trees can

(Borland, 1984).

be found, including dBase

III

and Borland's Turbo Toolbox

443

Hashing

CHAPTER OBJECTIVES
Introduce the concept of hashing.

Examine

the

problem of choosing
one

algorithm, present a reasonable

describe

some

good

hashing

in detail,

and

others.

Explore three approaches for reducing collisions: randomization of addresses, use of extra memory, and
storage of several records per address.

Develop and use mathematical tools for analyzing

performance differences resulting from the use of
different hashing techniques.

Examine problems associated with file

and discuss some solutions.
Examine

effects

of patterns of record

deterioration

access

on perfor-

mance.

445

CHAPTER OUTLINE
10.1

10.5.2 Search Length

Introduction
10.1.1

What

Hashing?

10.6 Storing

More Than One Record

10.1.2 Collisions

per Address: Buckets

10.2

10.6.1

10.3

Hashing Functions and Record

on
Performance
10.6.2 Implementation Issues

Simple Hashing Algorithm

Effects of Buckets

Distributions
10.3.1

10.3.2

Distributing Records

10.7

among

Making Deletions
Tombstones

10.7.1

Some Other Hashing Methods

Deletions
10.7.2 Implications of Tombstones

10.3.3 Predicting the Distribution of

for Insertions

Records

10.7.3 Effects of Deletions and

10.3.4 Predicting Collisions for a Full

Additions on Performance

File

10.4

Handling

for

Addresses

How Much

Memory

Be Used?

Other Collision Resolution

Techniques

10.4.1 Packing Density

10.8.1

10.4.2 Predicting Collisions for

10.8.2 Chained Progressive

Extra

Should

10.8

Double Hashing
Overflow

10.8.3 Chaining with a Separate

Different Packing Densities

Overflow Area
Resolution by
Progressive Overflow

10.5 Collision

10.5.1

How

10.8.4 Scatter Tables: Indexing

Revisited

Progressive Overflow
10.9 Patterns

Works

10.1

of Record Access

Introduction
O(l) access to
a

files

means

that

how

no matter

record always takes the same, small

sequential searching gives us

grows

O(N)

in proportion to the size

number of

access,

of the

big the

file.

file

grows, access to

seeks.

contrast,

wherein the number of seeks

As we saw

in the preceding

chapters, B-trees

improve on

access; the

number of seeks

increases as the logarithm to the base k of the

number of

records,

where k

this greatly,

measure of the

providing 0(\og k

leaf size.

0(\og k N) access can provide

very good retrieval performance, even for very large

files,

but

it is still

not

O(l) access.

Holy Grail of file structure design.

what we want to achieve, but until

In a sense, O(l) access has been the

Everyone agrees

that O(l) access

~_i

447

INTRODUCTION

about 10 years ago

was not

one could develop

clear that

O(l) access strategies that would work on dynamic

that

files

general class of

change greatly

in size.

In this chapter

They provide
in size.

begin with

Static

hashing was the

following chapter

we show how

begun to
dynamic and

ways

has

find

hash Junction

art until

research and design

and O(l)

file

increases

about 1980.

work during

the

the 1980s

access, to files that are

Hashing?
like a black

The

box

retrieving records. In Fig.

be the home address of

produces an address every time you

that

it is

function h(K) that transforms

resulting address
10.1, the

hash function to the address

Hashing

of the

state

to extend hashing,

in a key. ISAore formally,

into an address.

description of static hashing techniques.

increase in size over time.

10.1.1 What
drop

us with O(l) access but are not extensible as the

That

key
is.

used

key

as the basis for storing

LOWELL

//(LOWELL) =

and

transformed by the
4.

Address 4

said

LOWELL.

indexing in that

relative record address.

Hashing

differs

involves associating

from indexing

key with

two important

ways:

With hashing, the addresses generated appear to be random

there is
no immediately obvious connection between the key and the location
of the corresponding record, even though the key is used to determine the location of the record. For this reason, hashing is sometimes referred to

as randomizing.

With hashing, two different keys may be transformed to the same

address so two records may be sent to the same place in the file.
When this occurs, it is called a collision and some means must be
found

to deal

with

it.

Consider the following simple example. Suppose you want to store 75

file, where the key to each record is a person's name. Suppose
also that you set aside space for 1.000 records. The key can be hashed by
taking two numbers from the ASCII representations of the first two
records in a

of the name, multiplying these together, then using the

rightmost three digits of the result for the address. Table 10.1 shows how
three names would produce three addresses. Note that even though the
characters

names

are listed in alphabetical order, there

addresses.

They appear

to be in random order.

no apparent order

to the

448

HASHING

Address

Record

Key

K = LOWE
Address

FIGURE 10.1

LOWELL.

Hashing the key

LOWELL

address

LOWELL'S
home address

10.1.2 Collisions

Now

sample file with the name OLIVIER.

with the same two letters as the name
they produce the same address (004). There is a collision between

suppose there

Since the

key

name OLIVIER

LOWELL,

the record for

in the

starts

OLIVIER and the

hash to the same address

Collisions cause problems.

space, so

we must

record for

LOWELL. We refer to

keys that

synonyms.

cannot put two records in the same

resolve collisions.

this in

two ways: by choosing

hashing algorithms partly on the basis of how few collisions they are likely
to produce,

TABLE 10.1

Name

and by playing some

tricks

with the ways

store records.

A simple hashing scheme

ASCII Code
First

Two

for
Letters

Product

Home
Address

BALL

66 x 65

LOWELL

76 x 79

TREE

84 x 82

=
=
-

4,290

290

6,004

004

6,888

888

449

INTRODUCTION

The

ideal solution to collisions

that avoids collisions altogether.

turns out to be

to find a transformation algorithm

Such an algorithm

much more

called a perfect hashing

hashing
algorithm than one might expect, however. Suppose, for example, that you
algorithm.

want

to store 4,000 records

difficult to find a perfect

among

5,000 available addresses. It can be

of the huge number of possible hashing
12(,0,)l)
algorithms for doing this, only one out of l()
avoids collisions
altogether. Hence, it is usually not worth trying.^

shown (Hanson,

A more

1982)

that

practical solution

acceptable number. For example,

to reduce the
if

number of

collisions to an

only one out of 10 searches for

record

results in a collision, then the average

number of

retrieve a record remains quite low.

There are several different ways to

reduce the

number of collisions,

Spread out the

compete

records.

for the

including the following three:

Collisions occur

same

address. If

that distributes the records fairly

dresses, then

we would

disk accesses required to

when two

could find

more records

hashing algorithm

randomly among

the available ad-

not have large numbers of records clustering

around certain addresses. Our sample hash algorithm, which uses

only two letters from the key, is not good on this account because
certain combinations of two letters are quite common in starting
names, while others are uncommon (e.g., compare the number of
names that start with "JO" with the number that start with "XZ").
We need to find a hashing algorithm that distributes records moi
randomly.
Use extra memory.

It is

easier to find a hash

algorithm that avoids col-

we have only a few records to distribute among many adwe have about the same number of records as adOur sample hashing algorithm is very good on this account

lisions if

dresses than if
dresses.

since there are 1,000 possible addresses and only 75 addresses (corre-

sponding to the 75 records) will be generated. The obvious disadvantage to spreading out the records

that storage space

wasted.

(In

the example, 7.5% of the available record space is used, and the remaining 92.5% is wasted.) There is no simple answer to the question
of how much empty space should be tolerated to get the best hash-

ing performance, but

some techniques

are provided later in this

not unreasonable to try to generate perfect hashing functions for small (less than 500).
of keys, such as might be used to look up reserved words in a programming language. But files generally contain more than a few hundred keys, or they contain sets of
keys that change frequently, so they are not normally considered candidates for perfect
hashing functions. See Knuth (1973b), Sager (1985), Chang (1984), and Chichelli (1980) for
more on perfect hashing functions.

'''It

stable sets

450

HASHING

chapter for measuring the relative gains in performance for different

amounts of

free space.

Put more than one record

sumed

tacitly that

exactly one record, but there

create our

file

at a single address.

now we

each physical record location in

such

way

usually

that every

no reason
file

a file

have ascould hold

why we

address

cannot

big enough to

hold several records.

If, for example, each record is 80 bytes long,

with 512-byte physical records, we can store up
to six records at each file address. Each address is able to tolerate five
synonyms. Addresses that can hold several records in this way are
sometimes called buckets.

and

create a

file

In the following sections

methods, and

we do

elaborate

present

on these collision-reducing
for managing hashed

some programs

files.

10.2

A Simple Hashing Algorithm

One

goal in choosing any hashing algorithm should be to spread out

records as uniformly as possible over the range of addresses available.

use of the term hash for this technique suggests what

Our

done

The

to achieve this.

to hash means "to chop into small

muddle or confuse." The algorithm used previously chops off
the first two letters and then uses the resulting ASCII codes to produce a
number that is in turn chopped to produce the address. It is not very good
at avoiding clusters of synonyms because so many names begin with the
same two letters.
One problem with the algorithm is that it does not really do very much
hashing. It uses only two letters of the key and it does not do much with the
two letters. Now let us look at a hash function that does much more
randomizing, primarily because it uses more of the key. It is a reasonably
good basic algorithm and is likely to give good results no matter what kinds

dictionary reminds us that the verb

pieces

of keys are used.

a specific

It is

also an algorithm that

instance of the algorithm does not

not too

work

difficult to alter in case

well.

This algorithm has three steps:

Represent the key in numerical form.

Fold and add.

Divide by

prime number and use the remainder

Step 1. Represent the Key in Numerical Form

number, then this step is already accomplished. If it

as the address.

key is already a
string of characters,

If the
is

451

A SIMPLE HASHING ALGORITHM

ASCII code of each

take the

character and use

form

number. For

example,

76 79 87 69 76 76 32 32 32 32 32 32
nlir
=
LOWELL
L
D
W
L
L
L
Blanks
,

In this algorithm
letters.

among

differences

using

use the entire key, rather than just the

parts of a key,

first

two

increase the likelihood that

The
when

the keys cause differences in addresses produced.

extra processing time required to do this

compared

to the potential

improvement

usually insignificant

in performance.

Step 2. Fold and Add Folding and adding means chopping off pieces of the
number and adding them together. In our algorithm we chop off pieces
with two ASCII numbers each:
76 79

These number

87 69

76 76

32 32

thought of

pairs can be

32 32

as integer variables

(rather than

which is how they started out) so we can do arithmetic

on them. If we can treat them as integer variables, then we can add them.
This is easy to do in C because C allows us to do arithmetic on characters.
character variables,

In Pascal,

can use the ord() function to obtain the integer position of

character within the computer's character

we add

Before

the fact that in

limited.

32,767

the numbers,

most

set.

have to mention

cases the sizes of

a problem caused by
numbers we can add together are

On some microcomputers,
(15 bits)

adding the

first

for example, integer values that exceed

overflow
errors
or become negative. For example,
cause
five of the foregoing numbers gives

7679 + 8769 + 7676 + 3232 + 3232 = 30588.

Adding

in the last 3,232

would, unfortunately, push the result over the

33,820), causing an overflow error.

32,767 (30,588 + 3,232 =

Consequently, we need to make sure

maximum
32,767.

can do this by

first

that each successive

sum

less

identifying the largest single value

than
will

ever add in our summation, and then making sure after each step that our

intermediate result differs from 32,767 by that amount.

In our case, let us

assume

that keys consist only

of blanks and uppercase

ZZ.
Suppose we choose 19,937 as our largest allowable intermediate result. This
differs from 32,767 by much more than 9,090, so we can be confident (in
this example) that no new addition will cause overflow. We can ensure in
our algorithm that no intermediate sum exceeds 19,937 by using the mod
alphabetic characters, so the largest addend

9,090, corresponding to

452

HASHING

which returns

operator,

the remainder

when one

integer

divided by

another:

+
+
4187 +
7419 +
10651 +
7679

16448

The number 13,883

Why

did

8769 -* 16448
7676 -> 24124

3232^

7419

3232 - 10651
3232 -> 13883

10651

mod 19937 ->

mod 19937 ->
mod 19937-^
mod 19937 -
mod 19937 -^

16448

24124

13883

16448

4187
7419
10651

13883

the result of the fold-and-add operation.

bound

use 19,937 as our upper

rather than, say, 20,000?

Because the division and subtraction operations associated with the mod
operator are more than just a way of keeping the number small; they are
part of the transformation work of the hash function. As we see in the

number
more random distribution than does transformation by
number 19,937 is prime.
discussion for the next step, division by a prime

usually produces

nonprime. The

Divide by the Size of the Address Space The purpose of this

down to size the number produced in step 2 so it falls within
the range of addresses of records in the file. This can be done by dividing
that number by a number that is the address size of the file, and then taking
the remainder. The remainder will be the home address of the record.
Step

step

to cut

can represent this operation symbolically as follows: If 5 represents

sum produced in step 2 (13,820 in the example), n represents the divisor

(the number of addresses in the file), and a represents the address we are
trying to produce, we apply the formula
the

The remainder produced by

and n

the

file.

mod

operator will be

number between

Suppose, for example, that

our

decide to use the 100 addresses

0-99

for

In terms of the preceding formula,

= 13820 mod

100

20.

Since the number of addresses allocated for the file does not have to be
any specific size (as long as it is big enough to hold all of the actual records
to be stored in the file), we have a great deal of freedom in choosing the
divisor n. It is a good thing that we do, because the choice of n can have a

how

major

effect

prime

number

distribute

well the records are spread out.

nonprime can

usually used for the divisor because primes tend to

much more uniformly than do nonprimes. A

work well in many cases, however, especially if it has no

remainders

453

HASHING FUNCTIONS AND RECORD DISTRIBUTIONS

FUNCTION hash(KEY,MAXAD)
set
set

SUM to
to

while (J

12)

SUM to (SUM + 100*KEY[J]

ncremen t J by 2

set
i

endwh i

KEYCJ

1]) mod 19937

return (SUM mod MAXAD)

end FUNCTION
FIGURE 10.2 Function hash( KEY, MAXAD) uses folding and prime number division
compute a hash address.

prime divisors less than 20 (Hanson, 1982). Since the remainder is going to
be the address of a record, we choose a number as close as possible to the
desired size of the address space. This number actually determines the size
of the address space. For a file with 75 records, a good choice might be 101,

which would leave the

the

file

74.3%

If 101 is the size of the address

example becomes
a

Hence,

the record

whose key

=
=

full

space, the

13820

= 0.743).
home address of the

(74/101

mod

record in

101

84.

LOWELL

assigned

record

number 84

in the

file.

that

The procedure

described previously can be carried out with a function

described mostly in pseudocode in Fig. 10.2. Procedure

call hash(),

hash() takes
at least

two

inputs:

returned by hash()

10.3

KEY, which must

12 characters, and
is

MAXAD,

the address.

Hashing Functions and Record Distributions

Of the two hash functions we have so far examined, one spreads out records
pretty well, and one does not spread

look

ways

distributions

be an array of ASCII codes for

which has the address size. The value

them out well

at all. In this section

makes

easier

Understanding
to discuss other hashing methods.

to describe distributions of records in

files.

454

HASHING

10.3.1 Distributing Records among Addresses

Figure 10.3 illustrates three different distributions of seven records
10 addresses. Ideally, a hash function should distribute records in a

by distribution

among
file

Such a distribution
is called uniform because the records are spread out uniformly among the
addresses. We pointed out earlier that completely uniform distributions are
so hard to find that it is generally not considered worth trying to find them.
Distribution (b) illustrates the worst possible kind of distribution. All
records share the same home address, resulting in the maximum number of
there are

collisions.

will be a

collisions, as illustrated

The more

a distribution

(a).

looks like this one, the

collisions

problem.

Distribution

somewhat spread

(c)

illustrates

out, but

with

a
a

distribution in

few

collisions.

which the records

is the most likely

This

are

case

we have a function that distributes keys randomly. If a hash function is

random, then for a given key every address has the same likelihood of being

chosen as every other address. The fact that a certain address is chosen for
one key neither diminishes nor increases the likelihood that the same
address will be chosen for another key.
It should be clear that if a random hash function is used to generate a
large number of addresses from a large number of keys, then simply by
chance some addresses are going to be generated more often than others. If
you have, for example, a random hash function that generates addresses
between
and 99, and you give the function 100 keys, you would expect

FIGURE 10.3 Different distributions,

(worst case), (c) A few synonyms.

(a)

Worst

Best

Record

Address

(a)

UAfuI

No synonyms

Record

Address

(b)

(uniform), (b) All

synonyms

Acceptable

Record

Address

(c)

455

HASHING FUNCTIONS AND RECORD DISTRIBUTIONS

some of

ideal,

it is

random

among

distribution of records

an acceptable alternative, given that

to find a function that gives a

may

to be

at all.

Although
not

more than once and some

the 100 addresses to be chosen

chosen not

it is

available addresses

practically impossible

uniform distribution. Uniform distributions

be out of the question,

when we

but there are times

distributions that are better than

random

in the sense that,

can find

while they do

fair number of synonyms, they spread out records among

more uniformly than does a random distribution.

generate a
addresses

10.3.2
It

Some

would be

Other Hashing Methods

nice

better-than-random

were

there

distribution

hash

function

cases,

all

but

that

there

hashing function depends on the

guaranteed
is

not.

The

of keys that
are actually hashed. Therefore, the choice of a proper hashing function
should involve some intelligent consideration of the keys to be hashed, and
perhaps some experimentation. The approaches to choosing a reasonable
hashing function covered in this section are ones that have been found to
work well, given the right circumstances. Further details on these and other
methods can be found in Knuth (1973b), Maurer (1975), Hanson (1982),
and Sorenson et al. (1978).
Here are some methods that are potentially better than random:
distribution generated

Examine keys for

rally

a pattern.

Sometimes keys

spread themselves out. This

fall

set

in patterns that natu-

likely to

be true of numeric

a set of employee identifinumbers might be ordered according to when the employees

entered an organization. This might even lead to no synonyms. If
some part of a key shows a usable underlying pattern, a hash func-

keys than of alphabetic keys. For example,

cation

tion that extracts that part of the

key can

also be used.

Fold parts of the key. Folding is one stage in the method discussed earlier. It involves extracting digits from part of a key and adding the
extracted parts together. This
terns but in

method destroys

some circumstances may

the original key pat-

preserve the separation between

certain subsets of keys that naturally spread themselves out.

Divide the key by a number. Division by the address size and use of
the remainder usually

involved somewhere in

hash function since

produce an address within a certain

range. Division preserves consecutive key sequences, so you can take
advantage of sequences that effectively spread out keys. However, if
there are several consecutive key sequences, division by a number
the purpose of the function

456

HASHING

many small factors can result in many collisions. Research

shown that numbers with no divisors less than 19 generally
avoid this problem. Division by a prime is even more likely than dithat has

has

vision

nonprime

to generate different results

from

different con-

secutive sequences.

The preceding methods are designed to take advantage of natural

among the keys. The next two methods should be tried when, for
some reason, the better-than-random methods do not work. In these cases,
orderings

randomization

the goal.

method

Square the key and take the middle. This popular

the mid-square method) involves treating the key as

(often called

a single large

number, squaring the number, and extracting whatever number of

digits is needed from the middle of the result. For example, suppose
you want to generate addresses between and 99. If the key is the
number 453, its square is 205,209. Extracting the middle two digits
yields a number between
and 99, in this case 52. As long as the
keys do not contain many leading or trailing zeros, this method usually

produces

method

fairly

that

random

results.

One

unattractive feature of this

often requires multiple precision arithmetic.

Radix transformation. This method involves converting the key to

some number base other than the one you are working in, and then
taking the result modulo the maximum address as the hash address.
and
For example, suppose you want to generate addresses between

key

99. If the

382; 382

mod

number

the decimal

85, so 85

453,

its

base 11 equivalent

the hash address.

Radix transformation is generally more reliable than the mid-square

for approaching true randomization, though mid-square has been
found to give good results when applied to some sets of keys.

method

10.3.3 Predicting the Distribution

Given

that

records

among

predict

how

that a large

Records

nearly impossible to achieve a uniform distribution of

the available addresses in a

file, it is

records are likely to be distributed. If

number of addresses

them than they can

collisions.

hold, then

likely to

we know

have

important to be able to

we know,

far

that there are

for example,

records assigned

going to be

a lot

457

HASHING FUNCTIONS AND RECORD DISTRIBUTIONS

Although there

among

collisions

are

no nice mathematical

distributions

are

that

tools available for predicting

than random,

better

there

are

mathematical tools for understanding just this kind of behavior when

records are distributed randomly. If we assume a random distribution

(knowing

that very likely

random), we can use these

our hashing method is likely

will be better than

tools to obtain conservative estimates of

how

to behave.

The Poisson

We
a

We want to

Distribution"''

that are likely to occur in a

predict the

begin by concentrating on what happens to

hash function

When

questions.

applied to a key.

of the keys

all

would

number of collisions

one record

that can hold only

file

a single

like to

in a file are hashed,

an address.

given address

when

answer the following

what

the likelihood

that

None

will hash to the given address?

Exactly one key will hash to the address?

Exactly two keys will hash to the address (two synonyms)?
Exactly three, four (and so on) keys will hash to the address?
All keys in the

file

will hash to the

same given address?

Which of these outcomes would you expect to be fairly likely, and

which quite unlikely? Suppose there are N addresses in a file. When a single
key is hashed, there are two possible outcomes with respect to the given
address:

A The
B

address

not chosen; or

The address

chosen.

How do we express
bothp(A) and a stand for
p(B) and

the probability that the address

stand for the probability that the address

P(B) =

since the address has

"''This

uted

one chance

p(A) = a

addresses in a

file if a

let

not chosen, and

chosen, then

= ^>

N of being
N- =
-^1

section develops a formula for predicting the

among

two outcomes?

the probabilities of the

random hashing

ways

chosen, and

which records

function

used.

The

will be distrib-

discussion assumes

knowledge of some elementary concepts of probability and combinatorics. You may want
to skip the development and go straight to the formula, which is introduced in the next
section.

458

HASHING

since the address has

(N =

chances in

N of not being

chosen. If there are

of our address being chosen

0.1, and the probability of the address not being chosen is a

10 addresses

10), the probability

1/10 =
0.1 = 0.9.

Now suppose two keys are hashed. What is the probability that both
keys hash to our given address? Since the two applications of the hashing
function are independent of one another, the probability that both will

produce the given address

p(BB) =

a product:

tor

10: b

0.1

0.01.

Of course, other outcomes are possible when two keys are hashed. For
example, the second key could hash to an address other than the given
address.

The

p(BA) =

probability of this

In general,

- -M

when we want

the product

for

know

10: b

xbxbxa

how

2 3

a b

0.9

0.09.

the probability of a certain sequence

of outcomes, such as BABBA, we can replace each

respectively, and compute the indicated product:

p(BABBA) = bx

0.1

for

B by

and

2 3

10: a b

and

(0.9) (0.1)

of three Bs and two

shown. We want to know the
probability that there are a certain number of Bs and As, but without regard
to order. For example, suppose we are hashing four keys and we want to
This example shows

As, where the Bs and

know how likely it is

to find the probability

As occur

in the order

that exactly

This can occur in six ways,

all

two of the keys hash to our given address.

ways having the same probability:

six

Outcome

Probability

BBAA
BABA
BAAB
ABBA
ABAB
AABB

bbaa

bV
baba = bV
baab = bV
abba = bV
abab = bV
aabb = bV
=

For

(0.1) (0.9)
2

(0.1) (0.9)

=
=
=
=
=

0.0036
0.0036
0.0036

Since these six sequences are independent of one another, the probability of two Bs and two As is the sum of the probabilities of the individual

outcomes:

459

HASHING FUNCTIONS AND RECORD DISTRIBUTIONS

p(BBAA) + p(BABA) +

+ p(AABB) =

2 2

6b a

x 0.0036 = 0.0216.

The 6 in the expression 6b a~ represents the number of ways two s and two
As can be distributed among four places.
In general, the event "r trials result in r x As and x Bs" can happen
in as many ways as r x letters A can be distributed among r places. The
probability of each such

way

and the number of such ways

given by the formula

This

the

items out of

well-known formula
of

a set

items.

x)\x

number of ways of selecting x

when r keys are hashed, the

for the

follows that

probability that an address will be chosen x times and not chosen

x times

can be expressed as
p(x)

Furthermore,

we know

= Ca'~ x b x

that there are

N addresses

precise about the individual probabilities of

available, we can be
and B, and the formula

becomes

p(x)

where

= C

has the definition given previously.

What does

this

mean?

means

that

for example,

if,

compute the probability that a given address will have

by the hashing function using the formula

x =

can

records assigned to

*) =
If

this

c (' h\-

(h)-

formula gives the probability that one record will be assigned

to a given address:

p(\)

This expression has the disadvantage that

(Try

for 1,000 addresses and 1,000 records:

for large values

and

there

awkward

compute.

1,000.) Fortunately,

function that

very good

460

HASHING

approximation for p(x) and

much

easier to

compute.

called the

Poisson function.

The Poisson Function Applied to Hashing

which we also denote by p(x), is given by
x

where N,

(r/N) e-

P(*)

The Poisson

function,

(r/x >

and p(x) have exactly the same meaning they have

previous section. That is, if

N=
r

the

number of available

number of records

the

addresses;

to be stored;

x = the number of records assigned

then p(x) gives the probability
to it after the

in the

and

to a given address,

that a given address will

hashing function has been applied

to all

have had x records assigned

n records.

Suppose, for example, that there are 1,000 addresses (N = 1,000) and
whose keys are to be hashed to the addresses (r = 1,000).

1,000 records
Since r/N
to

=
0)

the probability that a given address will have no keys hashed

becomes

=1jr

P(0)

The

0-368.

probabilities that a given address will have exactly one, two, or

three keys, respectively, hashed to

are

= ]~

0.368

p(2)=-^j- =

0.184

p(\)

I'e

p(3)

0.061.

can use the Poisson function to estimate the probability that

given address will have

predict the

number of records, we can also use it to

that will have a certain number of records

a certain

number of addresses

assigned.

For example, suppose there are 1,000 addresses (N = 1,000) and 1,000
(r = 1,000). Multiplying 1,000 by the probability that a given
address will have x records assigned to it gives the expected total number of

records

addresses with x records assigned to them. That

number of addresses with x

is,

records assigned to them.

\,000p(x) gives the

461

HASHING FUNCTIONS AND RECORD DISTRIBUTIONS

N addresses,

In general, if there are

then the expected

them

addresses with x records assigned to

number of

Np(x).

way of thinking about p(x). Rather than thinking

measure of probability, we can think of p(x) as giving the
proportion of addresses having x logical records assigned by hashing.
Now that we have a tool for predicting the expected proportion of
addresses that will have zero, one, two, etc. records assigned to them by a
This suggests another

about p(x)

as a

random hashing

function,

can apply

this tool to predicting

numbers of

collisions.

10.3.4 Predicting Collisions

for a Full File

Suppose you have a hashing function that you believe will distribute records
randomly, and you want to store 10,000 records in 10,000 addresses. How
many addresses do you expect to have no records assigned to them?
= 10,000, r/N = 1. Hence the proportion of
Since r = 10,000 and
addresses with
records assigned should be

= LIV =
1

P(0)

The number of addresses with no

0-3679.

records assigned

10,000 x p (0)

How many addresses should have one,

3,679.

two, and three records assigned,

respectively?
10,000 x ^(l)

0.3679 x 10,000

3,679

10,000 x p (2)

0.1839 x 10,000

1,839

10,000 x p (3)

0.0613 x 10,000

613.

Since the 3,679 addresses corresponding to x

have exactly one

record assigned to them, their records have no synonyms.

two records

The 1,839

however, represent potential trouble. If

each such address has space only for one record, and two records are
assigned to them, there is a collision. This means that 1,839 records will fit
into the addresses, but another 1,839 will not fit. There will be 1,839
addresses with

apiece,

overflow records.

Each of the 613 addresses with three records apiece has an even bigger
problem. If each address has space for only one record, there will be two
overflow records per address. Corresponding to these addresses will be a

462

HASHING

of 2 X 613 = 1,226 overflow records. This is a bad situation. We have

thousands of records that do not fit into the addresses assigned by the
hashing function. We need to develop a method for handling these overflow
total

records.

10.4

But

first, let's

How Much
We

Extra

try to reduce the number of

overflow records.

Memory Should Be Used?

have seen the importance of choosing a good hashing algorithm to

A second way to decrease the number of collisions (and

reduce collisions.

thereby decrease the average search length)

to use extra

memory. The

developed in the previous section can be used to help us determine the

effect of the use of extra memory on performance.
tools

10.4.1 Packing Density

The term

packing density refers to the ratio of the

stored

to the

(r)

For example,

number of available

spaces (N):^

Number of records _
Number of spaces

if there are

the packing density

75 records (n

75)

75o/o.

number of records

to be

and 100 addresses (N

100),

0.75

The packing

density gives a measure of the amount of space in a file that

and it is the only such value needed to assess performance
in a hashing environment, assuming that the hash method used gives a
reasonably random distribution of records. The raw size of a file and its
address space do not matter; what is important is the relative sizes of the
two, which are given by the packing density.
Think of packing density in terms of tin cans lined up on a 10-foot
length of fence. If there are 10 tin cans and you throw a rock, there is a
certain likelihood that you will hit a can. If there are 20 cans on the same
length of fence, the fence has a higher packing density and your rock is
is

actually used,

more
""We

likely to hit a can.

assume here

that only

essarily the case, as

it is

with records

one record can be stored

see later.

in a

file.

The more

each address. In

fact, that is

records

not nec-

HOW MUCH EXTRA MEMORY SHOULD

there are packed into a given

will occur

the

when

new

file

record

likely

it is

that a collision

added.

We need to decide how much space we are willing to waste

number of collisions. The answer depends in large measure on

circumstances.

example,

want

have

need

few

the expense of requiring the

10.4.2 Predicting Collisions

space, the

463

BE USED?

to reduce

particular

collisions as possible, but not, for

file

to use

for Different

of the

two

disks instead of one.

Packing Densities

of changing the packing

number of collisions
that are likely to occur for a given packing density. Fortunately, the Poisson
function provides us with just the tool to do this.
You may have noted already that the formula for packing density (r/N)
occurs twice in the Poisson formula
a quantitative description

density. In particular,

we need

effects

to be able to predict the

P(x)

(r/N) e~

r/N

numbers of records (r) and addresses (N) always occur together

They never occur independently. An obvious implication of
this is that the way records are distributed depends partly on the ratio of the
number of records to the number of available addresses, and not on the
absolute numbers of records or addresses. The same behavior is exhibited
by 500 records distributed among 1,000 addresses as by 500,000 records
Indeed, the

as the ratio r/N.

distributed

among

1,000,000 addresses.

Suppose that 1,000 addresses are allocated to hold 500 records in a

randomly hashed file, and that each address can hold one record. The
packing density for the

file is

= 500 =
1,000

Let us answer the following questions about the distribution of records

among

the available addresses in the

How many
How many

file:

addresses should have no records assigned to them?

addresses should have exactly one record assigned (no

synonyms)?

How many
onyms?
Assuming

addresses should have one record plus one or

more syn-

one record can be assigned to each home adoverflow records can be expected?
What percentage of records should be overflow records?

dress,

that only

how many

464

HASHING

How many

addresses should have no records assigned to them? Since p(0)

gives the proportion of addresses with no records assigned, the

ber of such addresses

Np(0)

How many

num-

1,000 x

=
-

607.

5 )" g

1,000 x 0.607

addresses should have exactly one record assigned (no syn-

onyms)?
Np(\)

How many

1,000 x

=
=

303.

^2L

1,000 x 0.303

addresses should have one record plus one or

more synonyms?

The

values o p(2), p(3), p(4), and so on give the proportions of addresses with one, two, three, and so on synonyms assigned to them.

Hence

the

sum
p(2)

gives the proportion of

all

may

appear to require

p(3)

addresses with at least one

a great deal

since the values of p(x)

p(4)

grow

synonym. This

of computation, but

doesn't

quite small for x larger than 3. This

should make intuitive sense. Since the

file is

50%

only

loaded, one

would not expect very many keys to hash to any one address.
Therefore, the number of addresses with more than about three keys
hashed to them should be quite small. We need only compute the results up to p(5) before they become insignificantly small:
p(2)

p(2>)

p(4)

p(5)

=
=

N and this

Assuming

many

0.0002

more synonyms

just the

result:

N\p{2)

0.0016

0.0902.

The number of addresses with one

product of

+ 0.0126 +

0.0758

p{3)

=
-

1,000

x 0.0902

90.

that only one record can be assigned to each

home

address,

how

overflow records could be expected? For each of the addresses rep-

resented by p(2), one record can be stored at the address and one
must be an overflow record. For each address represented by p{2>),

one record can be stored

the address, two are overflow records,

HOW MUCH EXTRA MEMORY SHOULD

465

BE USED?

and so on. Hence, the expected number of overflow records

given by
1

N
=
=

+ 2 X

x p (2)

1,000 x

N X p(3)

+ 3 x iV x p(4) + 4 X
X p (5)
x p (3) + 3 x p (4) + 4 x p(5)]
x 0.0758 + 2 x 0.0126 + 3x 0.0016 + 4 x 0.0002]

p(2)

107.

Wliat percentage of records should be overflow records? If there are 107

overflow records and 500 records

flow records is

jjgConclusion:

only one record,

0.214

all,

then the proportion of over-

= 21.4%.

If the

packing density

can expect about

50% and each address can hold

21% of all records to be stored

somewhere other than at their home addresses.

Table 10.2 shows the proportion of records

home
if

that are not stored in their

addresses for several different packing densities.

the packing density

10%, then about

The

of the time

table

shows

that

try to access

another record there. If the density is 100%, then

of all records collide with other records at their home addresses.
The 4.8% collision rate that results when the packing density is 10% looks
a record, there is already

about

TABLE 10.2

37%

Effect of packing density on the proportion of records not stored at their

home addresses
Packing
Density (%)

Synonyms

of Records

4.8

9.4

13.6

17.6

21.4

24.8

28.1

31.2

34.1

100

36.8

466

HASHING

very good until you realize that for every record

your

file

there will be

nine unused spaces!

The 36.8%

that results from 100% usage looks good when viewed in

unused space. Unfortunately, 36.8% doesn't tell the whole
story. If 36.8% of the records are not at their home addresses, then they are
somewhere else, probably in many cases using addresses that are home
addresses for other records. The more homeless records there are, the more
contention there is for space with other homeless records. After a while,
clusters of overflow records can form, leading in some cases to extremely
long searches for some of the records. Clearly, the placement of records that
collide is an important matter. Let us now look at one simple approach to
placing overflow records.

terms of

10.5

Collision Resolution by Progressive Overflow

Even

if a

hashing algorithm

occur. Therefore, any hashing

very good,

likely that collisions will

program must incorporate some method

dealing with records that cannot

number of techniques

fit

into their

home

for

addresses. There are a

for handling overflow records, and the search for

FIGURE 10.4 Collision resolution with progressive overflow.

Novak
Rosen

York's

home

address (busy)
Jaspei

2nd

Moreley

-3rd try (busy)

try (busy)

-4th try (open)

York's actual

address

467

COLLISION RESOLUTION BY PROGRESSIVE OVERFLOW

Key
Blue

"1
98

Hash

Address

routine

Jello

Wrapping around

FIGURE 10.5 Searching

for

an address beyond the end of

ever-better techniques continues to be

several approaches, but

works
and

well.

The technique

a file.

a lively area

concentrate on

of research.

We examine

very simple one that often

has various names, including progressive overflow

linear probing.

How

10.5.1

An example

Progressive Overflow Works

occurs is shown in Fig. 10.4.

whose key is York in the file.
Unfortunately, the name York hashes to the same address as the name
Rosen, whose record is already stored there. Since York cannot fit in its

of a situation

In the example,

home

address,

we want

it is

which

a collision

to store the record

an overflow record.

progressive overflow

used, the

next several addresses are searched in sequence until an empty one

The

first free

address 9
is

the

Eventually
6,

found.

record found empty, so the record pertaining to

first

stored in address

hashes to

address becomes the address of the record. In the example,

York

we need

to find

York's record in the

file.

the search for the record begins at address

York's record there, so

proceeds to look

where it finds York.

An interesting problem occurs when

Since

6. It

York

still

does not find

successive records until

gets

to address 9,

or for

record

at the

end of the

file.

This

there
is

search for an open space

illustrated in Fig. 10.5, in

which

468

HASHING

assumed that the file can hold 100 records in addresses 0-99. Blue is
hashed to record number 99, which is already occupied by Jello. Since the
file holds only 100 records, it is not possible to use 100 as the next address.
The way this is handled in progressive overflow is to wrap around the
address space of the file by choosing address
as the next address. Since, in
this case, address
is not occupied, Blue gets stored in address 0.
What happens if there is a search for a record but the record was never
it is

placed in the

file?

The

and then proceeds

home

search begins, as before, at the record's

look for

in successive locations.

Two

address,

things can

happen:
If

an open address

sume

this

If the file
is it

means
is full,

encountered, the searching routine might as-

that the record

clear that the record

when we approach

filling

not in the

comes back

the search

not in the

our

file,

file.

When

searching can

slow, whether or not the record being sought

The
cases,

greatest strength of progressive overflow

a perfectly

adequate method. There

file;

where

began. Only then

this occurs,

become
in the

or even

intolerably

file.

is its

simplicity. In

are,

however,

many

collision-

handling techniques that perform better than progressive overflow, and

examine some of them later in this chapter.

progressive overflow on performance.

Now let us look at the effect of

10.5.2 Search Length

The reason

to avoid

overflow

disk accesses) have to occur

If there are a lot

of course, that extra searches (hence, extra

a record is not found in its home address.
there are going to be a lot of overflow records
is,

when

of collisions,

taking up spaces where they ought not to be. Clusters of records can form,
resulting in the placement of records a long

way from home,

many

disk

accesses are required to retrieve them.

Consider the following set of keys and the corresponding addresses

produced by some hash function.

Key

Home

Address

Adams

Bates

Cole

Dean

Evans

469

COLLISION RESOLUTION BY PROGRESSIVE OVERFLOW

Number

Actual

Home

accesses needed

address

to retrieve

Adams

Bates

Cole

Dean

Evans

FIGURE 10.6 Illustration of the effects of clustering of records. As

keys are clustered, the number of accesses required to access
later keys can become large.

If these records are

used to resolve

loaded into an empty

collisions,

file,

and progressive overflow

only two of the records will be

at their

addresses. All the others require extra accesses to retrieve.

shows where each key

stored, together with information

accesses are required to retrieve

The term

number of

from secondary memory.

long

way from its home

accesses required to

In the context of hashing, the

search length for a record increases every time there

on how many

it.

search length refers to the

retrieve a record

home

Figure 10.6

address, the search length

a collision. If a

may

record

be unacceptable.

good measure of the extent of the overflow problem is average search

length. The average search length is just the average number of times you

can expect to have to access the disk to retrieve a record. A rough estimate
of average search length may be computed by finding the total search length
(the sum of the search lengths of the individual records) and dividing this by
the number of records:
Average search length

total search length

total

number of records'

470

HASHING

In the

example, the average search length for the

1+2

five records

= ?2

With no

is 1, since only one

any record. (We indicated earlier that an
algorithm that distributes records so evenly that no collisions occur is
appropriately called a perfect hashing algorithm, and that, unfortunately,
such an algorithm is almost impossible to construct.) On the other hand, if
a large number of the records in a file results in collisions, the average search
length becomes quite long. There are ways to estimate the expected average
search length, given various file specifications, and we discuss them in a

access

collisions at

needed

all,

the average search length

retrieve

later section.
It

turns out that, using progressive overflow, the average search length

goes up very rapidly as the packing density increases. The curve in Fig.
10.7,

adapted from Peterson (1957), illustrates the problem. If the packing

is kept as low as 60%, the average record takes fewer than two tries

density

to access, but for a

much more

desirable packing density of

80%

or more,

increases very rapidly.

Average search lengths of greater than 2.0

unacceptable, so

appears that

of your storage space

it is

are generally considered

to get tolerable performance. Fortunately,

FIGURE 10.7 Average search length versus packing density

one record can be stored per address, progressive overflow
sions, and the file has just been loaded.

hashed file in which

used to resolve colli-

in a
is

Average
search
length

40%
we can

usually necessary to use less than

Packing density

100

471

STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS

improve on

by making one small change

our

hashing program. The change involves putting more than one record

at a

this situation substantially

single address.

10.6

Storing

More Than One Record per Address: Buckets

when

computer receives information from a disk, it is just

I/O system to transfer several records as it is to transfer
a single record. Recall too that sometimes it might be advantageous to think
of records as being grouped together in blocks rather than stored individuRecall that

about

Therefore,

ally.

as easy for the

why

not extend the idea of

The word

address of a group of records?

block of records that

bucket

record address in
is

sometimes used

retrieved in one disk access, especially

records are seen as sharing the same address.

a file to

to describe

when

those

sector-addressing disks, a

bucket typically consists of one or more sectors; on block-addressing disks,

a bucket might be a block.

Consider the following

set

of keys, which

to be loaded into a hash

file.

Home

Key

Address

Green

Hall

Jenks

King

Land

Marx

Nutt

Figure 10.8 illustrates part of

are loaded.

Each address

a file into

which the records with these keys

in the file identifies a

records corresponding to three synonyms.

Nutt cannot be accommodated

in a

bucket capable of holding the

Only

home

the record corresponding

address.

When a record is to be stored or retrieved, its home bucket address is

determined by hashing. The entire bucket is loaded into primary memory.
An in-RAM search through successive records in the bucket can then be
used to find the desired record. When a bucket is filled, we still have to
worry about the record overflow problem (as in the case of Nutt), but this
occurs

much

less

often

hold only one record.

when

buckets are used than

when

each address can

472

HASHING

Bucket
address

Bucket contents

Green

Hall

Jenks

King

(Nutt
.

Land

Marks

...

an overflow
.

record)

FIGURE 10.8 An illustration of buckets. Each bucket can hold up to three

records. Only one synonym (Nutt) results in overflow.

10.6.1 Effects of Buckets on Performance

When

buckets are used, the formula used to compute packing density

changed

slightly since each bucket address can hold

To compute how

densely packed a

file is,

more than one

we need

to consider

record.

both the

number or records we can put at each

number of addresses and b is the number
of records that fit in a bucket, then bN is the number of available locations
for records. If r is still the number of records in the file, then

number of addresses
address (bucket

(buckets) and the

size). If

the

Packing density

Suppose

the following

have

a file in

which 750 records

two ways we might organize

can store the 750 data records

location can hold one record.

the

among

1,000 locations, where each

The packing

750

are to be stored. Consider

file.

density in this case

75%.

1,000

can store the 750 records

tion has a bucket size of 2.

among 500

There are

still

locations,

store the 750 records, so the packing density

0.75

= 75%

where each

loca-

1,000 places (2 x 500) to

is still

473

STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS

we might at first not expect

improve performance, but in fact it does
improve performance dramatically. The key to the improvement is that,
Since the packing density

the use of buckets in this

way

not changed,

although there are fewer addresses, each individual address has more
for variation in the number of records assigned to it.

room

performance for these two ways of

in the same amount of space. The
the fundamental description of each file

Let's calculate the difference in

storing the

same number of records

starting point for our calculations

structure.

with
Buckets

without
Buckets

File

Number of records
Number of addresses
Bucket

= 750

size

= 750
N = 500
b = 2
r

1,000
1

Packing density

0.75

Ratio of records to addresses

r/N = 0.75

r/N =

1.5

determine the number of overflow records that are expected in the

case of each file, recall that when a random hashing function is used, the
Poisson function

,
p(x)

(r/N) e-"

gives the expected proportion of addresses assigned x records. Evaluating

the function for the

two

different

file

organizations,

assigned to addresses according to the distributions

see

from

the table that

when

make
many

22.3% of

intuitive sense

the addresses have

since

in the

addresses to choose from,

find that records are

shown

Table 10.3.

buckets are not used, 42.3% of the

addresses have no records assigned, whereas

used, only

when two-record

buckets are

no records assigned. This should

two-record case there are only half

stands to reason that a greater proportion

of the addresses are chosen to contain at least one record.

Note that the bucket column in Table 10.3 is longer than the nonbucket
column. Does this mean that there are more synonyms in the bucket case
than in the nonbucket case? Indeed it does, but half of those synonyms do
not result in overflow records because each bucket can hold two records.
Let us examine this further by computing the exact number of overflow
records likely to occur in the two cases.

474

HASHING

TABLE 10.3
p(x)

Poisson distributions for two different

File

without
Buckets

File with
Buckets

(r/N = 0.75)

(r/N

0.472

0.223

p(l)

0.354

0.335

p(2)

0.133

0.251

p(3)

0.033

0.126

P(4)

0.006

0.047

P(5)

0.001

0.014

0.001

P(7)

organizations

1.5)

p(0)

P(6)

file

0.004

In the case of the

file

with bucket

size one,

any address that

exactly one record does not have any overflow.

Any

more
number of

than one record does have overflow. Recall that the expected

overflow records

which, for r/N

1,000 x

given by

+2x

x p (2)

0.75 and

assigned

address with

p (3) + 3 x p (4) + 4 x p(S) +

1,000,

approximately

x 0.1328 + 2 x 0.0332 + 3 x 0.0062 + 4 x 0.0009 +

5 x 0.0001] = 222.

The 222 overflow

records represent

29.6% overflow.

of the bucket file, any address that is assigned either one or

two records does not have overflow. The value of p(\) (with r/N =1.5)
gives the proportion of addresses assigned exactly one record, and p(2)
(with r/N = 1.5) gives the proportion of addresses assigned exactly two
records. It is not until we get to p(3) that we encounter addresses for which
there are overflow records. For each address represented by p(3), two
records can be stored at the address, and one must be an overflow record.
Similarly, for each address represented by p(4), there are two overflow
records, and so forth. Hence, the expected number of overflow records in
In the case

the bucket

file is

x p (3) + 2 x p (4) + 3 x

p(S)

4 x p(6)

+ ...],

475

STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS

which

for

r/N =1.5 and

500 x

= 500

approximately

x 0.1255 + 2 x 0.0471 + 3 x 0.0141 + 4 x 0.0035 +

5 x 0.0008] = 140.

The 140 overflow records represent 18.7% overflow.

We have shown that with one record per address and a packing density
of 75%, the expected number of overflow records is 29.6%. When 500
buckets are used, each capable of holding two records, the packing density
remains 75%, but the expected number of overflow records drops to
18.7%. That is about a 37% decrease in the number of times the program
is going to have to look elsewhere for a record. As the bucket size gets
performance continues to improve.
Table 10.4 shows the proportions of collisions that occur for different
packing densities and for different bucket sizes. We see from the table, for
example, that if we keep the packing density at 75% and increase the bucket
size to 10, record accesses result in overflow only 4% of the time.
It should be clear that the use of buckets can improve hashing
performance substantially. One might ask, "How big should buckets be?"
Unfortunately, there is no simple answer to this question because it depends
very much on a number of different characteristics of the system, including
larger,

TABLE 10.4

Synonyms causing
densities

collisions as a percent of records

nd different bucket sizes

for different

packing

Bucket Size

Packing
Density
<%)

100

4.8

0.6

0.0

9.4

2.2

0.1

0.0

13.6

4.5

0.4

0.0

17.6

7.3

1.1

0.1

0.0

21.3

10.4

2.5

0.4

0.0

24.8

13.7

4.5

1.3

0.0

28.1

17.0

7.1

2.9

0.0

29.6

18.7

8.6

4.0

0.0

31.2

20.4

10.3

5.3

0.1

34.1

23.8

13.8

8.6

0.8

100

36.8

27.1

17.6

12.5

4.0

476

HASHING

the sizes of buffers the operating system can manage, sector and track

capacities

and access times of the hardware

disks,

(seek, rotation,

and

data transfer times).

good idea to use buckets larger than a

Even a track, however, can sometimes
when one considers the amount of time it takes to transmit an
as compared to the amount of time it takes to transmit a few

a rule, it is

probably not

track (unless records are very large).

be too large
entire track,

sectors. Since hashing

almost always involves retrieving only one record

per search, any extra transmission time resulting from the use of extra large

buckets

essentially wasted.

many

is the best bucket size. For example,

with 200-byte records is to be stored on a disk system that
uses 1,024-byte clusters. One could consider each cluster as a bucket, store
five records per cluster, and let the remaining 24 bytes go unused. Since it
is no more expensive, in terms of seek time, to access a five-record cluster
than it is to access a single record, the only losses from the use of buckets
are the extra transmission time and the 24 unused bytes.
The obvious question now is, "How do improvements in the number
of collisions affect the average search time?" The answer depends in large
measure on characteristics of the drive on which the file is loaded. If there
are a large number of tracks in each cylinder, there will be very little seek
time because overflow records will be unlikely to spill over from one
cylinder to another. If, on the other hand, there is only one track per
cylinder, seek time could be a major consumer of search time.

suppose that

A less
is

cases a single cluster

a file

exact measure of the

average search length, which

amount of time required

introduced

to retrieve a record

earlier. In the case

of buckets,

average search length represents the average number of buckets that must

Table 10.5 shows the expected average

with different packing densities and bucket sizes,
given that progressive overflow is used to handle collisions. Clearly, the use
of buckets seems to help a great deal in decreasing the average search length.

be accessed to retrieve
search lengths for

The bigger

a record.

files

the bucket, the shorter the search length.

10.6.2 Implementation Issues

of this text, we paid quite a bit of attention to issues
producing, using, and maintaining random-access files with
fixed-length records that are accessed by relative record number (RRN).
In the early chapters

involved

Since a hashed

RRN, you
Hashed
respects,

file is a

fixed-length record

file

whose records

are accessed

know much about implementing hashed files.

from the files we discussed earlier in two important

should already

files

differ

however:

477

STORING MORE THAN ONE RECORD PER ADDRESS: BUCKETS

TABLE 10.5

Average number of accesses required

a successful search by

progressive overflow

Bucket Sizes

Packing
Density
(%)

1.06

1.01

1.00

1.21

1.06

1.00

1.33

1.10

1.01

1.00

1.50

1.18

1.03

1.00

1.75

1.29

1.07

1.01

1.00

2.17

1.49

1.14

1.04

1.00

3.00

1.90

1.29

1.11

1.01

5.50

3.15

1.78

1.35

1.04

10.50

5.6

2.7

1.8

1.1

Adapted from Donald Knuth, The Art of Computer Programming, Vol.

Mass. Page 536. Reprinted with permission.

5 1973,

Addison-Wesley, Read-

ing,

Since a hash function depends on there being a fixed

available addresses, the logical size of a hashed

fore the
as

long

size to

file

can be populated with records, and

number of

must be fixed bemust remain fixed

same hash function is used. (We use the phrase logical

leave open the possibility that physical space be allocated as
as the

needed.)
2.

Since the
to

its

home

RRN

record in

hashed

file is

uniquely related

key, any procedures that add, delete, or change a record

do so without breaking the bond between

dress. If this

bond

broken, the record

record and

no longer

its

must

home

accessible

ad-

hashing.

must keep these

work with hashed

Bucket Structure

special needs in

mind when we write programs

files.

The only difference between a file with buckets and

which each address can hold only one key is that with a bucket file
each address has enough space to hold more than one logical record. All
records that are housed in the same bucket share the same address. Suppose,
one

478

HASHING

for example, that

Here

An empty

Two

we want

many

to store as

are three such buckets with different

bucket:

entries:

full bucket:

as five names in one bucket.

numbers of records.

JONES

ARNSWORTH

JONES

ARNSWORTH STOCKTON BRICE

Each bucket contains

has stored in

it.

a counter that

keeps track of how

Collisions can occur only

causes the counter to exceed the

The counter

tells

when

many records it
new record

the addition of a

number of records

how many

THROOP

bucket can hold.

data records are stored in a bucket, but

tell us which slots are used and which are not. We need a way
whether or not a record slot is empty. One simple way to do this is
to use a special marker to indicate an empty record, just as we did with
deleted records earlier.. We use the key value ///// to mark empty records in

does not

tell

the preceding illustration.

Initializing a File for

remain

fixed,

Hashing

makes sense

Since the

most

logical size

of a hashed

file

must

cases to allocate physical space for the

begin storing data records in it. This is generally done by

of empty spaces for all records, and then filling the slots as
they are needed with the data records. (It is not necessary to construct a file
of empty records before putting data in it, but doing so increases the
file

before

creating a

file

likelihood that records will be stored close to one another on the disk,

avoids the error that occurs

when

an attempt

made

to read a missing

makes it easy to process the file sequentially, without having

empty records in any special way.)

record, and
treat the

Loading a Hash File A program that loads a hash file is similar in many
ways to earlier programs we use for populating fixed-length record files,
with two differences. First, the program uses the function hash() to produce
a

home

address for each key. Second, the program looks for

the record

starting with the bucket stored at

its

home

a free

space for

address and then,

479

MAKING DELETIONS

home bucket is full, continuing to look at successive buckets until one

found that is not full. The new record is inserted in this bucket, which is
rewritten to the file at the location from which it is loaded.
If, as it searches for an empty bucket, a loading program passes the
maximum allowable address, it must wrap around to the beginning
address. A potential problem occurs in loading a hash file when so many
records have been loaded into the file that there are no empty spaces left.
A naive search for an open slot can easily result in an infinite loop. Obviously, we want to prevent this from occurring by having the program make
if

the

sure that there

space available for each

new

somewhere

record

in the

file.

Another problem

when an attempt

that often arises

made

when adding

records to

files

occurs

add a record that is already stored in the file. If

there is a danger of duplicate keys occurring, and duplicate keys are not
allowed in the file, some mechanism must be found for dealing with this
problem.

10.7

Making Deletions
Deleting a record from
record for

The

two

slot freed

searches;
It

hashed

file is

more complicated than adding

by the deletion must not be allowed

to hinder later

and

should be possible to reuse the freed

When

reasons:

progressive overflow

slot for later additions.

used, a search for a record terminates if

this, we do not want to leave

open addresses that break overflow searches improperly. The following
example illustrates the problem.
Adams, Jones, Morris, and Smith are stored in a hash file in which each
address can hold one record. Adams and Smith both are hashed to address
5, and Jones and Morris are hashed to address 6. If they are loaded in

an open address

encountered. Because of

alphabetical order using progressive overflow for collisions, they are stored
in the locations

shown

in Fig. 10.9.

Smith starts at address 5 (Smith's home address).

successively looks for Smith at addresses 6, 7, and 8, then finds Smith at 8.
Now suppose Morris is deleted, leaving an empty space, as illustrated in
Fig. 10.10. A search for Smith again starts at address 5, and then looks at
addresses 6 and 7. Since address 7 is now empty, it is reasonable for the
search

program

for

to conclude that Smith's record

not in the

file.

480

HASHING

Home

Actual

Record

address

Adams

Jones

Adams

Morris

Jones

Smith

Morris

Smith

FIGURE 10.9

File organization before deletions.

10.7.1 Tombstones
Chapter 5

One

for

Handling Deletions

discussed techniques for dealing with the deletion problem.

simple technique

use for identifying deleted records involves

replacing the deleted record (or just

its

key) with

marker indicating that a

is sometimes

record once lived there but no longer does. Such a marker

FIGURE 10.10 The same organization as in Fig. 10.9, with

Morris deleted.

Adams

Jones

Smith

481

MAKING DELETIONS

Adams

Jones

######
Smith

FIGURE 10.11 The same file as

Fig. 10.9 after the insertion of

tombstone

The

that

for Morris.

referred to as a tombstone (Wiederhold, 1983).

of tombstones

The

nice thing about the use

solves both of the problems described previously:

freed space does not break

sequence of searches for

a record;

and

The

freed space

obviously available and

may

be reclaimed for

later

additions.

Figure 10.11 illustrates how the sample file might look aftc the
tombstone ###### is inserted for the deleted record. Now a search for
Smith does not halt at the empty record number 7. Instead, it uses the

######
It is

an indication that

should continue the search.

not necessary to insert tombstones every time

a deletion occurs.

example, suppose in the preceding example that the record for Smith
following the Smith record

For
is

empty, nothing is
lost by marking Smith's slot as empty rather than inserting a tombstone.
Indeed, it is actually unwise to insert a tombstone where it is not needed. (If,
after putting an unnecessary tombstone in Smith's slot, a new record is
added at address 9, how would a subsequent unsuccessful search for Smith
be affected?)
be deleted. Since the

slot

10.7.2 Implications

Tombstones

for Insertions

With the introduction of the use of tombstones, the insertion of records

becomes slightly more difficult than our earlier discussions imply. Whereas
programs that perform initial loading simply search for the first occurrence
of an empty record slot (signified by the presence of the key /////), it is now

482

HASHING

permissible to insert

record where either ///// or

######

which

yields a shorter average

occurs as the

key.

This

new

feature,

search length, brings with

earlier

example

shown

in Fig. 10.

in
1 1

desirable because

a certain

which Morris
.

danger. Consider, for example, the

deleted,

giving the

file

organization

Now suppose you want a program to insert Smith into

the file. If the program simply searches until it encounters a ######, it

never notices that Smith is already in the file. We almost certainly don't
want to put a second Smith record into the file, since doing so means that
later searches would never find the older Smith record. To prevent this

from occurring, the program must examine the entire cluster of contiguous
keys and tombstones to ensure that no duplicate key exists, and then go
back and insert the record in the first available tombstone, if there is one.

10.7.3 Effects

of Deletions

The use of tombstones

enables our search algorithms to

storage recovery, but one can

after a

and Additions on Performance

still

number of deletions and

expect

some

work and

additions occur within a

Consider, for example, our

little

helps in

deterioration in performance

four-record

file

file.

of Adams, Jones,

one slot further from its

home address than it needs to be. If the tombstone is never to be used to
store another record, every retrieval of Smith requires one more access than
is absolutely necessary. More generally, after a large number of additions
and deletions, one can expect to find many tombstones occupying places
that could be occupied by records whose home records precede them but
that are stored after them. In effect, each tombstone represents an
unexploited opportunity to reduce by one the number of locations that
must be scanned while searching for these records.
Some experimental studies show that after a 50% to 150% turnover of
records, a hashed file reaches a point of equilibrium, so average search
length is as likely to get better as it is to get worse (Bradley, 1982; Peterson,
1957). By this time, however, search performance has deteriorated to the
point that the average record is three times as far (in terms of accesses) from
its home address as it would be after initial loading. This means, for
Smith, and Morris. After deleting Morris, Smith

example, that

if after original

loading the average search length

be about 1.6 after the point of equilibrium

1.2,

will

reached.

There are three types of solutions to the problem of deteriorating

One involves doing a bit of local reorganizing every
time a deletion occurs. For example, the deletion algorithm might examine
average search lengths.

483

OTHER COLLISION RESOLUTION TECHNIQUES

the records that follow a tombstone to see if the search length can be

shortened by moving the record backward toward

Another solution involves completely reorganizing the

search length reaches an unacceptable value.

its

file

home

address.

after the

average

third type of solution

involves using an altogether different collision resolution algorithm.

10.8

Other Collision Resolution Techniques

randomized hashing using progressive overflow with
reasonably sized buckets generally performs well. If it does not perform
wxll enough, however, there are a number of variations that may perform
even better. In this section we discuss some refinements that can often
improve hashing performance when using external storage.
Despite

its

simplicity,

10.8.1 Double Hashing

One of the problems

with progressive overflow is that if many records hash

vicinity, clusters of records can form. As the packing
density approaches one, this clustering tends to lead to extremely long

to buckets in the

searches for

same

some

records.

One method

for avoiding clustering

to store

overflow records a long way from their home addresses by double hashing.
With double hashing, when a collision occurs, a second hash function is
applied to the key to produce a number c that is relatively prime to the
number of addresses.^ The value c is added to the home address to produce
the overflow address. If the overflow address is already occupied, c is added
to it to produce another overflow address. This procedure continues until a
free overflow address is found.

Double hashing does tend to spread out the records in a file, but it
from a potential problem that is encountered in several improved
overflow methods: It violates locality by deliberately moving overflow
records some distance from their home addresses, increasing the likelihood
suffers

that the disk will

need extra time to get to the

new overflow

address. If the

more than one cylinder, this could require an expensive extra

head movement. Double hashing programs can solve this problem if they
are able to generate overflow addresses in such a way that overflow records
are kept on the same cylinder as home records.
file

""If

covers

divisors.

the

number of addresses,

then

and

N are relatively

prime

they have no

common

484

HASHING

10.8.2 Chained Progressive Overflow

Chained progressive overflow
problems caused by clustering.

another technique designed to avoid the

works

in the same manner as progressive

synonyms are linked together with pointers. That is,
each home address contains a number indicating the location of the next
record with the same home address. The next record in turn contains a
It

overflow', except that

home

pointer to the following record with the same

The

net effect of this

that for each set of

connecting their records, and

it is

address, and so forth.

synonyms

this list that

there

searched

linked

when

list

record

sought.

The advantage of chained progressive overflow over simple progressive

overflow is that only records with keys that are synonyms need to be
accessed in any given search. Suppose, for example, that the set of keys
shown in Fig. 10.12 is to be loaded in the order shown into a hash file with
bucket size one, and progressive overflow is used. A search for Cole
involves an access to Adams (a synonym) and Bates (not a synonym). Flint,
the worst case, requires six accesses, only two of which involve synonyms.
Since Adams, Cole, and Flint are synonyms, a chaining algorithm
forms a 'inked list connecting these three names, with Adams at the head of
the list. Since Bates and Dean are also synonyms, they form a second list.
This arrangement

from

decreases

10.13.

illustrated in Fig.

The average

search length

2.5 to
1

1+2

1+3

The use of chained

some details that are not
link field

must be added

progressive overflow requires that

attend to

required for simple progressive overflow.

to each record, requiring the use

First, a

a little

FIGURE 10.12 Hashing with progressive overflow.

Home

Key

address

Actual
address

length

Adams

Bates

Cole

Dean

Evans

20
Average search le ngth =
Flint

3+1

6)/6

2.5

485

OTHER COLLISION RESOLUTION TECHNIQUES

Home

Actual

address

Adams

Bates

Cole

Dean

Evans

Flint

Address of
Data

synonym

Search
length

FIGURE 10.13 Hashing with chained progressive overflow. Adams, Cole, and
Flint are synonyms; Bates and Dean are synonyms.

must guarantee

storage. Second, a chaining algorithm

that

it is

possible to

any synonym by starting at its home address. This second

requirement is not a trivial one, as the following example shows.
Suppose that in the example Dean's home address is 22 instead of 21.
Since, by the time Dean is loaded, address 22 is already occupied by Cole,
get

Dean

still

ends up

address 23.

Does

this

mean

that Cole's pointer should

point to 23 (Dean's actual address) or to 25 (the address of Cole's

Flint)? If the pointer

kept intact, but

Dean

The problem here

home

problem

for

some record

handled easily

are not

list

joining

If the pointer

synonym

Adams, Cole, and

23, Flint

Flint

is lost.

that a certain address (22) that should be occupied

occupied by

a different record.

One solution to
home address

to require that every address that qualifies as a

in the file actually

when

two-pass loading.
Two-pass loading,
passes.

is lost.

record (Dean)

the

two

25, the linked

file

as the

On the first pass,

hold

first

home

record.

loaded by using

The problem can be

technique called

name implies, involves loading a hash file in

only home records are loaded. All records that

home records are kept in a separate file. This guarantees that no

home addresses are occupied by overflow records. On the second

potential

overflow record is loaded and stored in one of the free addresses

according to whatever collision resolution technique is being used.
pass, each

486

HASHING

Two-pass loading guarantees

home

address, so

home

that every potential

problem

solves the

address actually

example.

in the

does not

and additions will not re-create the same

problem, however. As long as the file is used to store both home records
and overflow records, there remains the problem of overflow records
displacing new records that hash to an address occupied by an overflow
guarantee that

later deletions

record.

The methods used for handling these problems after initial loading are
somewhat complicated and can, in a very volatile file, require many extra
disk accesses. (For more information on techniques for maintaining
pointers, see Knuth, 1973b and Bradley, 1982.) It would be nice if we could
somehow altogether avoid this problem of overflow lists bumping into one
another, and that is what the next method does.

10.8.3 Chaining with a Separate Overflow Area

One way

keep overflow records from occupying

they should not be

move them

hashing schemes are variations of

addresses

all

this basic

and the

called the prime data area,

called the overflow area.

home

to a separate

addresses where

overflow

approach. The
set

area.

set

Many
home

of overflow addresses is
is that it keeps all

The advantage of this approach

unused but potential home addresses free for later additions.

In terms of the file we examined in the preceding section, the records
for Cole, Dean, and Flint could have been stored in a separate overflow area
rather than potential

home

addresses for later-arriving records (Fig. 10.14).

Now

no problem occurs when a new record is added. If its home address

has room, it is stored there. If not, it is moved to the overflow file, where
it is added to the linked list that starts at the home address.
If the bucket size for the primary file is large enough to prevent
excessive numbers of overflow records, the overflow file can be a simple
file with a bucket size of one. Space can be allocated for
overflow records only when it is needed.
The use of a separate overflow area simplifies processing somewhat and
would seem to improve performance, especially when many additions and
deletions occur. However, this is not always the case. If the separate

entry-sequenced

overflow area

a different

cylinder than

the

home

address, every

search for an overflow record will involve a very costly head

Studies

show

that actual access time

records are stored in

a separate

generally worse

movement.

when overflow

overflow area than when they are stored

the prime overflow area (Lum, 1971).

One

situation in

the packing density

which

a separate

overflow area

greater than one

there

are

required occurs

records than

when
home

487

OTHER COLLISION RESOLUTION TECHNIQUES

Home

Primary

address

data area

Adams

Bates

Overflow
area

Cole

Dean

Flint

-*
.

Evans

FIGURE 10.14 Chaining to a separate overflow area. Adams, Cole, and

synonyms; Bates and Dean are synonyms.

addresses.

If,

for example,

it is

anticipated that a

file

Flint are

will

grow beyond

capacity of the initial set of home addresses and that rehashing the
a larger

address space

not reasonable, then

separate overflow area

the

with

file

must

be used.

10.8.4 Scatter Tables: Indexing Revisited

Suppose you have

no records, only pointers to

records. The file is obviously just an index that is searched by hashing rather
than by some other method. The term scatter table (Severance, 1974) is often
a

hash

file

applied to this approach to

organization of

a file

using

that contains

file

organization. Figure 10.15 illustrates the

a scatter table.

The scatter table organization provides many of the same advantages

simple indexing generally provides, with the additional advantage that the
search of the index itself requires only one access. (Of course, that one

one more than other forms of hashing require, unless the scatter
table can be kept in primary memory.) The data file can be implemented in
many different ways. For example, it can be a set of linked lists of
access

synonyms

(as

shown

in Fig. 10.15), a sorted

file,

or an entry-sequenced

file.

Also, scatter table organizations conveniently support the use of variable-

length records. For

more information on

and Teorey and Fry (1982).

scatter tables, see Severance (1974)

488

HASHING

20
21

Adams

Bates

Cole

b Dean

Flint

23
24

Evans

FIGURE 10.15 Example of a scatter table structure. Because the hashed part
file may be organized in any way that is appropriate.

an index,

the data

10.9

Patterns of Record Access

Twenty
Twenty
L.

percent of the fishermen catch 80 percent of the

fish.

percent of the burglars steal 80 percent of the loot.

M. Boyd

The use of different

necessarily the best

collision resolution techniques

way of improving performance

in a

not the only nor

hashed

file.

know something

about the patterns of record access, for example, then it is

often possible to use simple progressive overflow techniques and still
achieve very good performance.
Suppose you have a grocery store with 10,000 different categories of
grocery items, and you have on your computer a hashed inventory file with
a

record for each of the 10,000 items that your

time an item

accessed. Since the

file is

hashed,

records are distributed randomly

up the

company

handles. Every

purchased, the record that corresponds to that item must be

file. Is it

it is

reasonable to assume that the 10,000

among

the available addresses that

make

equally reasonable to assume that the distribution of accesses

to the records in the inventory are

randomly

distributed? Probably not.

Milk, for example, wall be retrieved very frequently, brie seldom.

There

a principle

The Concept of

used by economists called the Pareto Principle, or

Few and the Trivial Many, which in file terms

the Vital

says that a small percentage of the records in a

percentage of the accesses.

80/20 Rule of

file

account for

a large

popular version of the Pareto Principle

Thumb: 80% of

the accesses are performed

on 20%

the

of the

SUMMARY

records. In our groceries

items, brie

among

file,

among

milk would be

the

20%

489

high-activity

the rest.

We cannot take advantage of the 80/20 principle in a file structure unless

we know something about the probable distribution of record accesses.
Once we have this information, we need to find a way to place the
high-activity items
possible.

way

If,

when

that the

loaded

where they can be found with

items are loaded into

20% (more or
home

or near their

less)

a file,

that are

few accesses

they can be loaded in such a

most likely to be accessed are

most of the transactions will

addresses, then

access records that have short search lengths, so the effective average search

length will be shorter than the nominal average search length that
defined

earlier.

For example, suppose our grocery store's file handling program keeps
number of times each item is accessed during a one-month
period. It might do this by storing with each record a counter that starts at
zero and is incremented every time the item is accessed. At the end of the
month the records for all the items in the inventory are dumped onto a file
that is sorted in descending order according to the number of times they
have been accessed. When the sorted file is rehashed and reloaded, the first
records to be loaded are the ones that, according to the previous month's
experience, are most likely to be accessed. Since they are the first ones
loaded, they are also the ones most likely to be loaded into their home
addresses. If reasonably sized buckets are used, there will be very few, if any,
high-activity items that are not in their home addresses and therefore
track of the

retrievable in

one

access.

SUMMARY
There are three major modes for accessing files: sequentially, which provides
O(N) performance, through tree structures, which can produce 0(\og k N)
performance, and directly. Direct access provides O(l) performance, which
means that the number of accesses required to retrieve a record is constant
and independent of the size of the file. Hashing is the primary form of
organization used to provide direct access.

Hashing can provide

faster access than

study, usually with very

little

most of the other organizations

storage overhead, and

most types of primary keys. Ideally, hashing makes

record with only one disk access, but this ideal

primary disadvantage of hashing

by key.

that

hashed

it is

adaptable to

possible to find any

rarely achieved.

files

may

The

not be sorted

490

HASHING

Hashing involves the application of a hash function h(K) to a record key

The address is taken to be the home address of the
record whose key is K, and it forms the basis for searching for the record.
The addresses produced by hash functions generally appear to be random.
When two or more keys hash to the same address, they are called
synonyms. If an address cannot accommodate all of its synonyms, collisions
result. When collisions occur, some of the synonyms cannot be stored in the
home address and must be stored elsewhere. Since searches for records
begin with home addresses, searches for records that are not stored at their
home addresses generally involve extra disk accesses. The term average
search length is used to describe the average number of disk accesses that are
required to retrieve a record. An average search length of 1 is ideal.
Much of the study of hashing deals with techniques for decreasing the
number and effects of collisions. In this chapter we look at three general
approaches to reducing the number of collisions:

K to produce an address.

Spreading out the records;

Using extra memory; and

Using buckets.
Spreading out the records involves choosing a hashing function that
distributes the records at least

randomly over the address

space.

distribution

spreads out records evenly, resulting in no collisions.

or nearly

random

distribution

much

easier to achieve

and

A uniform
A random
is

usually

considered acceptable.
In this chapter a simple hashing algorithm

developed to demonstrate

the kinds of operations that take place in a hashing algorithm.

The

three

steps in the algorithm are:

Represent the key in numerical form;

Fold and add; and

Divide by the

When we examine

size

of the address space, producing

a valid address.

several different types of hashing algorithms,

sometimes algorithms can be found

see that

produce better-than-random distributions. Failing this, we suggest some algorithms that generally produce
distributions which are approximately random.
The Poisson distribution provides a mathematical tool for examining in
detail the effects of a random distribution. Poisson functions can be used to
predict the numbers of addresses likely to be assigned 0, 1, 2, and so on,
records, given the number of records to be hashed and the number of
available addresses. This allows us to predict the number of collisions likely
to occur when a file is hashed, the number of overflow records likely to
that

occur, and sometimes the average search length.

SUMMARY

Using extra memory is another way to avoid collisions. When a fixed

number of keys is hashed, the likelihood of synonyms occurring decreases
as the number of possible addresses increases. Hence, a file organization that
allocates many more addresses than are likely to be used has fewer
synonyms than one that allocates few extra addresses. The term packing
density describes the

proportion of available address space that actually holds

function is used to determine how differences in

The Poisson

records.

packing density influence the percentage of records that are likely to be

synonyms.
is the third method for avoiding collisions. File addresses
more records, depending on how the file is organized by the
The number of records that can be stored at a given address,

Using buckets

can hold one or

file

designer.

called bucket size, determines the point at

address will overflow.

effects

The Poisson

which records assigned

to the

function can be used to explore the

of variations in bucket sizes and packing densities. Large buckets,

a low packing density, can result in very small average

combined with
search lengths.

Although we can reduce the number of collisions, we need some means

with collisions when they do occur. We examined one simple

to deal

collision resolution technique in detail

store a

new

record results in

progressive overflow. If an attempt to

a collision,

progressive overflow involves

searching through the addresses that follow the record's

order until one

not found in

its

found to hold the new record.

home

If a

home

record

address in

sought and

address, successive addresses are searched until either

found or an empty address is encountered.

Progressive overflow is simple and sometimes works very well.
Progressive overflow creates long search lengths, however, when the
packing density is high and the bucket size is low. It also sometimes
produces clusters of records, creating very long search lengths for new
the record

records

whose home

addresses are in the clusters.

Three problems associated with record deletion

The

possibility that

empty

slots created

hashed

files

are

deletions will hinder later

searches for overflow records;

The need
leted;

recover space

made

available

when

records are de-

and

The deterioration of average search lengths caused by empty spaces

keeping records further from home than they need be.
The

that are

first two problems can be solved by using tombstones to mark spaces

empty (and can be reused for new records) but should not halt a

search for a record. Solutions to the deterioration problem include local

491

492

HASHING

reorganization, complete

file

reorganization, and the choice of a collision-

resolving algorithm that does not cause deterioration to occur.

Because overflow records have a major influence on performance,

different overflow handling techniques have been proposed. Four
such techniques that are appropriate for file applications are discussed

many

briefly:
1.

Double hashing reduces local clustering but

records so far from

home

may

place

some overflow

that they require extra seeks.

Chained progressive overflow reduces search lengths by requiring that

when a record is being sought. For

only synonyms be examined

chained overflow to work, every address that qualifies

record for

nisms for
3.

some record in the file must hold a home

making sure that this occurs are discussed.

home
Mecha-

Chaining with a separate overflow area simplifies chaining substantially

and has the advantage

that the

ways more appropriate

overflow area may be organized in

overflow records. A danger of

to handling

Scatter tables

is that it might lose locality.

combine indexing with hashing. This approach provides

much more

flexibility in

this
4.

as a

record.

approach

using scatter tables

organizing the data

that, unless the

requires

one extra disk access for every

Since in

many

accessed,

home

than

it is

disadvantage of

RAM,

frequently than

often worthwhile to take access patterns

can identify those records that are most likely to be

can take measures to

less

search.

cases certain records are accessed

others (the 80/20 rule of thumb),

into account. If

file.

index can be held in

make

sure that they are stored closer to

frequently accessed records, thus decreasing the

average search length.

One

such measure

to load the

effective

most frequently

accessed records before the others.

KEY TERMS

Average search length.

define average search length as the sum of

number of accesses required for each record in the file divided by the
number of records in the file. This definition does not take into account
the number of accesses required for unsuccessful searches, nor does it
the

account for the

fact that

some

records are likely to be accessed

often than others. See 80/20 rule of thumb.

Better-than-random. This term is applied to distributions in which the

records are spread out

more uniformly than they would be

the

KEY TERMS

hash function distributed them randomly. Normally, the distribution

produced by a hash function is a little bit better than random.

Bucket.

area of space

and

for storage

on the

retrieval

eral logical records.

file

that

treated as a physical record

purposes but that

capable of storing sev-

storing and retrieving logical records in buck-

than individually, access times can, in

ets rather

many

cases,

be im-

proved substantially.
Collision. Situation in which

room

not have sufficient

curs,

some means

Double hashing. A
c,

which

hashed to an address that does

collision resolution

ber of addresses) as

record

When

a collision

oc-

has to be found to resolve the collision.

handled by applying

number

to store the record.

scheme

which

collisions are

second hash function to the key to produce a

added to the original address (modulo the numa

many

located or an

times as necessary until either the desired

empty space

found. Double hashing helps

avoid some of the clustering that occurs with progressive overflow.

The

thumb. An assumption

80/20 rule of

80%) of the

accesses are

performed on

that a large percentage (e.g.,

small percentage (e.g.,

20%)

of the records in a file. When the 80/20 rule applies, the effective average search length is determined largely by the search lengths of the

active records, so attempts to

make

these search lengths short

can result in substantially improved performance.

Fold and add.

sized parts

added.

The

A
of

method of hashing
a

key are extracted

resulting

sum

which the encodings of fixedevery two bytes) and are

(e.g.,

can be used to produce an address.

Hashing. A technique for generating a unique home address for a given

key. Hashing is used when rapid access to a key (or its corresponding record)

required. In this chapter applications of hashing involve

direct access to records in a

cess items in arrays in

file,

RAM.

but hashing

also often used to ac-

In indexing, for example, an index

might be organized for hashing rather than for binary search if exfast searching of the index is desired.
Home address. The address generated by a hash function for a given
tremely

key. If a record
for the record

the record.

stored at

its

home

address, then the search length

one because only one access

record not

at its

home

required to retrieve

address requires

more than one

access to retrieve or store.

Indexed hash.

Instead of using the results of a hash to produce the ad-

dress of a record, the hash can be used to identify a location in an

index that in turn points to the address of the record. Although

approach requires one extra access for every search,

ble to organize the actual data records in a

way

makes

this

possi-

that facilitates other

types of processing, such as sequential processing.

493

494

HASHING

Mid-square method.
the key

hashing method in which

representation of

squared and some digits from the middle of the result are

used to produce the address.

Minimum
is

hashing. Hashing scheme in which the number of addresses

number of records. No storage space is

exactly equal to the

wasted.

Open

addressing. See progressive overflow.

Overflow. The
its

home

when

situation that occurs

record cannot be stored in

address.

Packing density. The proportion of allocated

space that actually

file

holds records. (Sometimes referred to as load factor.) If

full, its

are the

packing density

50%. The packing

two most important measures

ol a collision occurring

when

a file is half

density and bucket size

determining the likelihood

searching for

record in

a file.

Perfect hashing function. A hashing function that distributes records

uniformly, minimizing the number of collisions. Perfect hashing
functions are very desirable, but they are extremely difficult to find
for large sets of keys.
Poisson distribution. Distribution generated by the Poisson function,

approximate the distribution of records among

is random. A particular Poisson distribution depends on the ratio of the number of records to the number of
available addresses. A particular instance of the Poisson function,
p(x), gives the proportion of addresses that will have x keys assigned
to them. See better-than-random.
Prime division. Division of a number by a prime number and use of
the remainder as an address. If the address size is taken to be a prime
number p, a large number can be transformed into a valid address by
dividing it by p. In hashing, division by primes is often preferred to
division by nonprimes because primes tend to produce more random

which can be used

addresses if the distribution

remainders.

Progressive overflow. An overflow handling technique in which collisions are resolved by storing a record in the next available address
after its home address. Progressive overflow is not the most efficient
overflow handling technique, but it is one of the simplest and is adequate for many applications.
Randomize. To produce a number

(e.g.,

by hashing)

that appears to be

random.

Synonyms. Two

When

each

file

result in collisions. If buckets are used, several

are

same address.
synonyms always
records whose keys

different keys that hash to the

address can hold only one record,

synonyms may be

stored without collisions.

EXERCISES

Tombstone. A special marker placed in the key field of a record to

mark it as no longer valid. The use of tombstones solves two problems associated with the deletion of records: The freed space does
not break a sequential search for
ily

recognized

Uniform. Term
out evenly

as available

and

a record,

may

and the freed space

be reclaimed for

eas-

later additions.

applied to a distribution in which records are spread

among

addresses. Algorithms that produce uniform distri-

butions are better than randomizing algorithms in that they tend to

avoid the numbers of collisions that would occur by chance from

randomizing algorithm.

EXERCISES
1.

Use

the function

hash(KEY,

MAXAD)

described in the text to answer

the following questions.

What

Find two different words of more than four characters that are

the value of hash("Browns", 101)?

synonyms.
c. It is assumed

in the text that the function hash() does not

need to

generate an integer greater than 19,937. This could present a problem

ways

have

with addresses larger than 19,937. Suggest some

around this problem.

a file

to get

it is important to understand the relationof the available memory, the number of keys to be
hashed, the range of possible keys, and the nature of the keys. Let us give

In understanding hashing,

ships

between the

names

size

to these quantities, as follows:

the

number of memory

spaces available (each capable of hold-

ing one record);

=
=

the

number of records

the

number of unique home

to be stored in the

memory

spaces;

addresses produced by hashing the

record keys; and

key, which

may

be any combination of exactly five uppercase

characters.

Suppose h(K)
and

Ma.

hash function that generates addresses between

How many

were one upperunique keys are possible? (Hint: If

would be 26 possible unique keys.)

case letter, rather than five, there

How

are n

and

related?

495

496

HASHING

How

If the function h

would
3.

M related?
were minimum perfect hashing function,
and M be related?

are

and

The following

table

different hash functions

how

shows distributions of keys resulting from three

on a file with 6,000 records and 6,000 addresses.

Function

<J(0)

0.71

0.25

0.40

*(1)

0.05

0.50

0.36

d{2)

0.05

0.25

0.15

d(3)

0.05

0.00

0.05

d(4)

0.05

0.00

0.02

d(S)

0.04

0.00

0.01

d(6)

0.05

0.00

0.01

d(l)

0.00

Which of the

records that
b.
c.

Which
Which
Which

There

three functions

(if

any) generates a distribution of

approximately random?

generates a distribution that

(if

nearest to uniform?

any) generates a distribution that

worse than random?

function should be chosen?

surprising mathematical result called the birthday paradox that

says that if there are

more than 23 people in a room, then there is a

two of them have the same birthday. How

than 50-50 chance that

better
is

the

birthday paradox illustrative of a major problem associated with hashing?

5. Suppose that 10,000 addresses are allocated to hold 8,000 records in a
randomly hashed file and that each address can hold one record. Compute

the following values:

The packing density for the file;

The expected number of addresses with no records
them by the hash function;
c. The expected number of addresses with one record
a.

assigned to
assigned (no

synonyms);

The expected number of addresses with one record

more synonyms;
e. The expected number of overflow records; and
f.
The expected percentage of overflow records.
d.

plus

one or

EXERCISES

Consider the

file

described in the preceding exercise.

number of overflow

expected

records

if the

What

the

10,000 locations are reorganized

as
a.

b.
7.

5,000 two-record buckets; and

1000 10-record buckets?

Make

1, 2, 5,

showing Poisson function values for r/N = 0.1, 0.5,

Examine the table and discuss any features and patterns

a table

and

10.

0.8,

that

provide useful information about hashing.

There

an overflow handling technique called count-key progressive

works on block-addressable
number from a key,

overflow (Bradley, 1982) that

Instead of generating a relative record

disks as follows.

the hash function

generates an address consisting of three values: a cylinder, a track, and a

block number.

The corresponding

numbers

three

constitute the

home

address of the record.

Since block-organized drives (see Chapter

find a record with a given key, there
to find out

whether or not

no need

can often scan a track to

to load a block into

contains a particular record.

The I/O processor

can direct the disk drive to search a track for the desired record.

empty record

direct the disk to search for an

its

home
a.

position, effectively

What

sive

about

is it

this

slot if a

memory

record

can even

not found in

implementing progressive overflow.

technique that makes

superior to progres-

overflow techniques that might be implemented on sector-orga-

nized drives.
b.

The main disadvantage of this technique

only with a bucket

size

Why

that

this the case,

can be used

and

why

is it

disadvantage?
9. In discussing

implementation

issues,

suggest initializing the data

file

marked empty before loading the file with

There are some good reasons for doing this. However, there
might be some reasons not to do it this way. For example, suppose you
want a hash file with a very low packing density and cannot afford to have
the unused space allocated. How might a file management system be
designed to work with a very large logical file, but allocate space only for

creating real records that are

actual data.

those blocks in the

file

that actually contain data?

10. This exercise (inspired

by an example

concerns the problem of deterioration.

are to be

made

to a

file.

Tombstones

Wiederhold, 1983, p. 136)

and deletions

A number of additions
are to be used

where necessary

preserve search paths to overflow records.

Show what the file looks like after

compute the average search length.

the following operations, and

497

498

HASHING

Operation

Add
Add
Add
Add
Add

Alan

Home

Address

Bates

Cole

Dean
Evans

Del

Bates

Del

Cole

Add
Add

Finch

Del

Alan

Gates

Add Hart

How

has the use of tombstones caused the

What would be

file

to deteriorate?

the effect of reloading the remaining items in the

file

Dean, Evans, Finch, Gates, Hart?

b. What would be the effect of reloading the remaining items using
two-pass loading?

in the order

11. Suppose you have a file in which 20% of the records account for 80%
of the accesses, and that you want to store the file with a packing density of
and a bucket size of 5. When the file is loaded, you load the active 20% of
the records first. After the active 20% of the records are loaded, and before

what is the packing density of the partially

Using this packing density, compute the percentage of the active
20% which would be overflow records. Comment on the results.
the other records are loaded,
filled file?

our computations of average search lengths, we consider only the

takes for successful searches. If our hashed file were to be used in such
a way that searches were often made for items that are not in the file, it
would be useful to have statistics on average search length for an unsuccessful
search. If a large percentage of searches to a hashed file are unsuccessful,
how do you expect this to affect overall performance if overflow is han12. In

times

dled by
a.

Progressive overflow; or

Chaining to

a separate

(See Knuth, 1973b, pp.

overflow area?

535-539

for a treatment of these differences.)

Although hashed files are not generally designed to support access to

in any sorted order, there may be times when batches of
transactions need to be performed on a hashed data file. If the data file is
13.

records

EXERCISES

sorted (rather than hashed), these transactions are normally carried out by

some

of cosequential process, which means that the transaction file also

file is hashed, the transaction file might also be
presorted, but on the basis of the home addresses of its records rather than
sort

has to be sorted. If the data

some more "natural" criterion.

Suppose you have a file whose records are usually accessed directly, but
that is periodically updated from a transaction file. List the factors you
would have to consider in deciding between using an indexed sequential
organization and hashing. (See Hanson, 1982, pp. 280-285, for a discussion

of these

issues.)

assume throughout this chapter that a hashing program should be

correctly whether a given key is located at a certain address. If
this were not so, there would be times when we would assume that a record
exists when in fact it does not, a seemingly disastrous result. But consider
what Doug Mcllroy did in 1978 when he was designing a spelling checker
program. He found that by letting his program allow one out of every 4,000
misspelled words to sneak by as valid (and using a few other tricks), he
could fit a 75,000-word spelling dictionary into 64 K of RAM, thereby
improving performance enormously.
Mcllroy was willing to tolerate one undetected misspelled word out of
14.

able to

tell

every 4,000 because he observed that drafts of papers rarely contained more
than 20 errors, so one could expect

program to
cases where
it

fail
it

to detect a misspelled

might be reasonable

most one out of every 200 runs of the

word. Can you think of some other

to report that a

key

exists

when

in fact

does not?

Jon Bentley (1985) provides an excellent account of Mcllroy 's program,

on the process of solving problems of this nature. D.
Dodds (1982) discusses this general approach to hashing, called
J.
check-hashing. Read Bentley's and Dodds's articles, and report on them to
your class. Perhaps they will inspire you to write a spelling checker.
plus several insights

Programming Exercises
15.

Implement and

16. Create a

key

this

version of the function hash().

with one record for every city in California. The

name of the corresponding city. (For the
exercise, there need be no fields other than the key field.)

hashed

in each record

purposes of

test a
file

to be the

Begin by creating a sorted list of the names of all of the cities and towns in
California. (If time or space is limited, just make a list of names starting
with the letter '5'.)

499

500

HASHING

the sorted list. What patterns do you notice that might

your choice of a hash function?
b. Implement the function hash() in such a way that you can alter
the number of characters that are folded. Assuming a packing density
of 1, hash the entire file several times, each time folding a different
number of characters, and producing the following statistics for
a.

Examine

affect

each run:

The number of collisions; and

The number of addresses assigned
more records.

0, 1, 2,

10,

and 10-or-

Discuss the results of your experiment in terms of the effects of

numbers of characters, and how they compare to

the results you might expect from a random distribution.
c. Implement and test one or more of the other hashing methods described in the text, or use a method of your own invention.
folding different

17.

Using some

set

of keys, such

as the

names of California towns, do

the

following:
a.

Write and

hash

files

test a

program

for loading the keys into three different

using bucket sizes of

1, 2,

and

respectively,

Use progressive overflow

Have your program maintain statistics on

ing density of 0.8.

length, the

maximum

and

pack-

for handling collisions.

the average search

search length, and the percentage of records

overflow records.
a Poisson distribution, compare your results with the
expected values for average search length and the percentage of
records that are overflow records.
that are
c.

Assuming

18.

Repeat exercise

17,

but use double hashing to handle overflow.

19.

Repeat exercise

17,

but handle overflow using chained overflow into

separate overflow area.

number of keys

Assume

to available

program

that the packing density

the ratio of

home addresses.

perform insertions and deletions in the file

problem using a bucket size of 5. Have the program
keep running statistics on average search length. (You might also
implement a mechanism to indicate when search length has deteriorated to
a point where the file should be reorganized.) Discuss in detail the issues
you have to confront in deciding how to handle insertions and deletions.
20. Write a

that can

created in the previous

FURTHER READINGS

number of good surveys of hashing and issues

Knuth (1973b), Severance (1974), Maurer

related to hashing

and SorenTremblay, and Deutscher (1978). Textbooks concerned with file design
generally contain substantial amounts of material on hashing, and they often provide
extensive references for further study. Each of the following can be useful:
generally, including

(1975),

son,

Hanson

(1982)

of the issues

on comparing

with analytical and experimental

is filled

introduce, and

different

Bradley (1982) covers

file

many more, and

results exploring

also contains a

all

good chapter

organizations.

hashing generally but also includes

much informa-

on programming for hashed files using IBM PL/I.

Loomis (1983) also covers hashing generally, with additional emphasis on
tion

for hashed files in COBOL.

Teorey and Fry (1982) and Wiederhold (1983) will be useful to practitioners
interested in analyses of trade-offs among the basic hashing methods.

programming

One of the

applications of hashing that has stimulated a great deal of interest recently

development of spelling checkers. Because of special characteristics of spelling
checkers, the types of hashing involved are quite different from the approaches we
describe in this text. Papers by Bentley (1985) and Dodds (1982) provide entry into
the literature on this topic. (See also exercise 14.)

the

501

Extendible Hashing

CHAPTER OBJECTIVES
Describe the problem solved by extendible hashing
and related approaches.

Explain
it

how

combines

Show how

extendible hashing works;

tries

with conventional,

show how

static

hashing.

implement extendible hashing,

in-

cluding deletion.

Review

studies of extendible hashing performance.

Examine alternative approaches to the same problem, including dynamic hashing, linear hashing, and
hashing schemes that control splitting by allowing
for

overflow buckets.

503

CHAPTER OUTLINE
11.1

11.2

Introduction

How
11.2.1

11.4.2

Extendible Hashing Works

11.4.3 Collapsing the Directory

Tries

11.4.4 Implementing the Deletion

11.2.2 Turning the Trie into a

Creating the Addresses

11.3.2 Implementing the Top-level
Operations
11.3.3 Bucket and Directory
Operations
11.3.1

11.3.4 Implementation

Extendible Hashing Performance

Space Utilization for Buckets

11.5.1

11.5.2 Space Utilization for the

Directory
11.6 Alternative

Summary

Approaches

Dynamic Hashing

11.6.1

11.6.2 Linear Hashing

11.4 Deletion
11.4.1

Operations
Summary of the Deletion
Operation

11.4.5

11.2.3 Splitting to

11.3

Procedure for Finding

Buddy Buckets

11.6.3 Approaches to Controlling

Overview of

the Deletion

Splitting

Process

11.1

Introduction
In Chapter 8
B-trees.

we began

B-trees

are

with

a historical

review of the work that led up to

such an effective solution to the problems that

stimulated their development that it is easy to wonder if there is any more

important thinking to be done about file structures. Work on extendible
hashing during the late 1970s and early 1980s shows that the answer to that

question

of the

file

yes. This chapter tells the story

structures that

emerge from

of that work and describes some

it.

B-trees do for secondary storage what

memory: They provide

dynamic data.
data

set.

dynamic

The key

feature of both

AVL

self-adjusting structures that include

As we add and

delete

AVL

trees

for storage in

way of using tree structures that works well with

we mean that we add and delete records from the

records,

trees

and B-trees

mechanisms

that they are

to maintain themselves.

the tree structures use limited,

local

restructuring to ensure that the additions and deletions do not degrade

performance beyond some predetermined

level.

HOW

Robust, self-adjusting data and

file

EXTENDIBLE HASHING

WORKS

505

structures are critically important to

data storage and retrieval. Judging from the historical record, they are also
hard to develop. It was not until 1963 that Adel'son-Vel'skii and Landis
developed a self-adjusting structure for tree storage in memory, and it took

another decade of

dynamic

work

before computer scientists found, in B-trees,

works well on secondary storage.

B-trees provide 0(\og k N) access to the keys in a file. Hashing, when
there is no overflow, provides access to a record with a single seek. But as
a file grows larger, the need to look for records that overflow their buckets
degrades performance. For dynamic files that undergo a lot of growth, the
tree structure that

performance of a static hashing system such as we described in Chapter 10

typically worse than the performance of a B-tree. So, by the late 1970s,
after the initial burst of new research and design work revolving around
B-trees was over, a number of researchers began to work on finding ways
to modify hashing so that it, too, could be self-adjusting as files grow and
shrink. As often happens when a number of groups are working on the
same problem, several different, yet essentially similar, approaches emerged
to extend hashing to dynamic files. We begin our discussion of the problem
by looking closely at the approach called "extendible hashing" described by
Fagin, Nievergelt, Pippenger, and Strong (1979). Later in this chapter we
compare this approach with others that emerged over the last decade.

1 1

How

Extendible Hashing

Works

11.2.1 Tries
idea behind extendible hashing is to combine conventional hashing
with another retrieval approach called the trie. (The word trie is pronounced
so that it rhymes with sky.) Tries are also sometimes referred to as radix
searching because the branching factor of the search tree is equal to the
number of alternative symbols (the radix of the alphabet) that can occur in
each position of the key. A few examples will illustrate how this works.

The key

Suppose

we want to

anderson, andrews,
11.1.

As you can

and

build a

baird.

see, the

trie that stores

the keys able, abrahms, adams,

schematic form of the

trie is

shown

in Fig.

searching proceeds letter by letter through the key.

Since there are 26 symbols in the alphabet, the potential branching factor at
every node of the search is 26. If we used the digits 0-9 as our search

would be
one shown in

alphabet, rather than the letters a-z, the radix of the search

reduced to
Fig. 11.2.

10.

search tree using digits might look like the

506

EXTENDIBLE HASHING

anderson

andrews

baird

FIGURE 11.1

Radix 26

trie

that indexes

names according

trie

that indexes

numbers according

to the letters

of the alphabet.

FIGURE 11.2 Radix 10

digits they contain.

to the

HOW

Notice that

EXTENDIBLE HASHING

507

WORKS

we sometimes

use only a portion of the

need more information to complete the
search. This use-more-as-we-need-more capability is fundamental to the
structure of extendible hashing.

We use

key.

in searching a trie

more of the key

11.2.2 Turning the Trie

with

into a Directory

of two

our approach to extendible hashing:

Furthermore, since we are
retrieving from secondary storage, we will not work in terms of individual
keys, but in terms of buckets containing keys, just as in conventional
hashing. Suppose we have bucket A containing keys that, when hashed,
use

tries

a radix

Search decisions are

made on

a bit-by-bit basis.

have hash addresses that begin with the

hash addresses beginning with

10,

bits 01. Bucket B contains keys with

and bucket C contains keys with

addresses that start with 11. Figure 11.3 shows

a trie that

allows us to

retrieve these buckets.

How

should

represent the trie? If we represent

as a tree structure,

do a number of comparisons as we descend the tree. Even

worse, if the trie becomes so large that it, too, is stored on disk, we are faced
once again with all of the problems associated with storing trees on disk.
We might as well go back to B-trees and forget about extendible hashing.

are forced to

So, rather than representing the trie as a tree,

flatten

into an array

of contiguous records, forming a directory of hash addresses and pointers to

the corresponding buckets. The first step in turning a tree into an array
involves extending it so it is a complete binary tree with all of its leaves at
is enough
the same level as shown in Fig. 11.4(a). Even though the initial
to select bucket A, the new form of the tree also uses the second address bit
so both alternatives lead to the same bucket. Once we have extended the
tree this

11.4(b).

we can collapse it into the directory structure shown in Fig.

Now we have a structure that provides the kind of direct access
way,

associated with hashing:

10 2

Given an address beginning with the

bits 10, the

directory entry gives us a pointer to the associated bucket.

FIGURE 11.3 Radix 2 trie that

provides an index to buckets.

508

EXTENDIBLE HASHING

J
01

(a)

FIGURE 11.4 The

trie

from

(b)

11.3 transformed

Fig.

first into

complete binary

and then

tree,

flattened into a directory to the buckets.

11.2.3 Splitting to Handle Overflow

key

any hashing system

issue in

overflows.

The

what happens when

goal in an extendible hashing system

to find a

bucket

way

increase the address space in response to overflow, rather than responding

by creating long sequences of overflow records and buckets

that

have to be

searched linearly.

Suppose we
overflow. In

insert records that cause bucket

Fig.

this case the solution is simple: Since addresses

11.4(b)

beginning with

we can split bucket A by putting

new bucket D, while keeping only the 00 addresses
in A. Put another way, we already have two bits of address information,
but are throwing one away as we access bucket A. So, now that bucket A
is overflowing, we must use the full two bits to divide the addresses
between two buckets. We do not need to extend the address space; we
simply make full use of the address information that we already have.
00 and 01 are mixed together in bucket A,
all

the 01 addresses in a

Figure 11.5 shows the directory and buckets after the

Let's consider a

more complex

case.

split.

Starting once again with the

directory and buckets in Fig. 11.4(b), suppose that bucket

overflows.

How do we split bucket B and where do we attach the new bucket after the
split?

Unlike our previous example,

we do

not have additional, unused

bits

of address space that we can press into duty as we split the bucket. We now
need to use three bits of the hash address in order to divide up the records
that hash to bucket B. The trie illustrated in Fig. 11.6(a) makes the
distinctions required to complete the split. Figure 11.6(b)
trie

looks like once

leaves at the

form of the

same

trie.

it is

extended into

level,

and

Fig.

completely

11.6(c)

shows

full

shows what

this

binary tree with

all

the collapsed, directory

<
^
(

FIGURE 11.5 The directory

from Fig. 11.4(b) after bucket
/I

overflows.

FIGURE 11.6 The results of an overflow of bucket B in Fig. 11.4(b), represented

then as a complete binary tree, and finally as a directory.

(a)

000

001

CZEZJ

CZD
(b)

(c)

first

as a

trie,

510

EXTENDIBLE HASHING

By
used in

building on the
search,

trie's ability to

we have doubled

from 2 2 to 2 3 cells. This ability to

shrink) the address space gracefully is what extendible hashing is

therefore, of our directory), extending

grow
all

(or

extend the amount of information

the size of our address space (and,

about.

have been concentrating on the contribution that

tries

make

extendible hashing; one might well ask where the actual hashing comes into
play. Why not just use the tries on the bits in the key itself, splitting buckets

and extending the address space as necessary? The answer to this question
grows out of hashing's most fundamental characteristic: A good hash
function produces a nearly uniform distribution of keys across an address
space. Notice that the trie shown in Fig. 11.6 is poorly balanced, resulting
is twice as big as it actually needs to be. If we had an
uneven distribution of addresses that placed even more records in buckets B
and D without using other parts of the address space, the situation would
get even worse. By using a good hash function to create addresses with a
nearly uniform distribution, we avoid this problem.

in a directory that

11.3

Implementation
11.3.1 Creating the Addresses

Now that we have a high-level overview of how extendible hashing works,

let's

look

The

pseudocode

in Fig. 11.7 that describes the algorithms in

with the functions that create the addresses, since

the notion of an extendible address underlies all other extendible hashing
detail.

place to start

operations.

The hash

function itself

hashing algorithm

simple variation on the fold-and-add

used in Chapter

10.

The only

difference

that

we do

not conclude the operation by returning the remainder of the folded address
divided by the address space.

don't need to do that, since in extendible

we don't have a fixed address space, instead using as much of the

address as we need. The division that we do perform in this function, when
we take the sum of the folded character values modulo 19,937, is to make
hashing

summation stays within the range of a signed 16-bit

machines that use 32-bit integers, we could divide by a larger
number and create an even larger initial address.
Since extendible hashing uses more bits of the hashed address as they
are needed to distinguish between buckets, we need a make_address function
that extracts just a portion of the full hashed address. We also use the
make_address function to reverse the order of the bits in the hashed address,
making the lowest-order bit of the hash address the highest-order bit of the
value used in extendible hashing. To see why this reversal of bit order is
sure that the character

integer. For

511

IMPLEMENTATION

FUNCTION hash(KEY)
set
set
set
if

SUM to
J

LEN to the length of the key

LEN is odd, concatenate a blank
to make the length even

while (J < LEN)

SUM := (SUM + 100*KEY[J]
J by 2
i nc rement
endwh i 1

the key

KEYCJ+13) mod 19937

return SUM
end FUNCTION
FIGURE

1.7 Function hash(KEY) returns an integer hash value for

KEY

for a

15-bit address space.

which is a set of keys and binary hash addresses

produced by our hash function. Even a quick scan of these addresses reveals
that the distribution of the least significant bits of these integer values tends
to have more variation than the high-order bits. This is because many of the
addresses do not make use of the upper reaches of our address space; the

desirable, look at Fig. 11.8,

high-order

bits often turn

out to be zero.

reversing the bit order,

working from

advantage of the greater variability of low-order

right to
bit values.

left,

take

For example,

we want to avoid having the addresses of bill,

and pauline turn out to be 0000, 0000, and 0000. If we work from right
to left, starting with the low-order bit in each address, we get 0011 for bill,
0001 for lee, and 1010 for pauline, which is a much more useful result.
given a four-bit address space,
lee,

The

make_address function, described in Fig. 11.9, accomplishes this bit

extraction and reversal.

number of address

bill
lee

pauline
alan
julie
mike
elizabeth
mark

The

0110
0100 0010
1111 0110
1100 1010
00101110 0000
0000 0111 0100
1100 0110
001
01
0000
0000

0000
0000
0000
0100

0011

DEPTH

argument

tells

the function the

bits to return.

1100
1000

FIGURE 11.8 Output from the

hash function for a number

0101

of keys,

001
1 001

1101

1010
01

512

EXTENDIBLE HASHING

FUNCTION make_address(KEY, DEPTH)

set

RETVAL to

accumulate reversed
* /
string
0...001 mask to extract
*
low bit from N
to

bit
set MASK

HASH_VAL

hash(KEY)

Summary of loop logic:

Shift RETVAL one position
Then move
left, to make room for a new low bit.
Then
HASH_VAL's low bit to RETVAL s low bit.
shift HASH_VAL in the opposite direction, to the
'

* *

right, so we can look at the next lowest bit.

Keep doing this until we have moved as many bits
as we need from HASH_VAL to RETVAL in reverse
order
.

for

RETVAL
LOWBIT
RETVAL
HASH_VAL
next

DEPTH
RETVAL left shifted one position
HASH_VAL bitwise ANDed with MASK
RETVAL bitwise ORed with LDWBIT
= HASH_VAL right
shifted one position

return RETVAL
end FUNCTION
FIGURE 11.9 Function make_address (DEPTH) gets a hashed address, reverses the
bits, and returns an address of DEPTH bits.

order of the

FIGURE 11.10

BUCKET and DIR_CELL

record structures.

Record Type: BUCKET

DEPTH
integer count of the number of bits used
"in common" by the keys in this bucket
COUNT
integer count of the number of keys in
the buc ket

KEYM

array
MAX_BUCKET_S ZE
hold k eys
[ 1

strings to

Record Type: D RECT0RY_CELL

BUCKET_REF relative record number or other reference
to a specific BUCKET record on disk
I

IMPLEMENTATION

513

11.3.2 Implementing the Top-level Operations

Our

extendible hashing scheme consists of

a set of buckets and a directory

them. Each bucket is a record that contains the information
shown in Fig. 11.10. These bucket records are stored in a file; we retrieve

that references

them as necessary.
Each cell in the directory
Because

use

consists of a reference to a

direct access to find

directory as an array of these cells in

directory records,

RAM. The

BUCKET

we implement

than the directory

From

the

address values returned by

make_address are treated as subscripts for this array, ranging from

less

record.

one

size.

view of the driver function, use of the system

RAM from
disk, a set of calls from the user to find or add keys, and a final, closing step
that writes the possibly modified directory back to disk. Pseudocode for the
driver, initialization, and close functions are shown in Fig. 11.11.
the high-level

consists of an initialization step that reads the directory into

FIGURE 11.11 The driver, ex_init, and ex_close functions provide a high-level view
hashing program operation.

of the extendible

FUNCTION driverO
ex_i n i t (
call op_add() and op_find() as directed by the user
ex_c 1 ose(
end FUNCTION

FUNCTION ex_init()
open (or create, as necessary) the directory
and buc ket files
if the hash file already exists
read directory records into the array DIRECTORY
DIR_DEPTH := log 2 (size of DIRECTORY)
else
allocate an initial directory consisting of a
single cell
set DIR_DEPTH to
allocate an initial bucket and assign its address
to the directory cell
end i f
end FUNCTION

FUNCTION ex_close()
write the directory back to disk
close files
end FUNCTION

514

EXTENDIBLE HASHING

Note

that the

DIR_DEPTH

directly

the size of the

directory, since
2
If

dir_depth =

we are starting
we are using

the

hash

file,

the

DIRECTORY.

DIR_DEPTH

same bucket, no matter what

which means
the keys go
get the address of

zero,

their address.

everything-goes-here bucket and assign

initial,

no bits to distinguish between addresses;

that

into the

new

number of cells

the

all

to the single directory

cell.

Given a way to open and close the file, we are ready to add records. The
op_add and op_find functions are outlined in Fig. 11.12.
The op_find function turns the key into a directory address. Given this
address,

we do

and assign

a direct

FOUND_BUCKET
return

calls

contains the key,

return success; otherwise,

FAILURE.

The op_add
in the

lookup of the bucket location, retrieve the bucket

and then search for the key. If

FOUND_BUCKET,

hash

file,

function begins by calling op_find. If the key already exists

op_add returns immediately;

bk_add_key

to insert

the key

not found, op_add

it.

11.3.3 Bucket and Directory Operations

When
not

op_add

calls

bk_add_key

full,

bucket, however,

passes a bucket and a key. If the bucket

11.13) simply inserts

(Fig.

requires a

split,

which

into the bucket.

where things

start

full

get

interesting.

What we do when we split a bucket depends on the relationship

between the number of address bits used in the bucket and the number used
in the directory as a whole. The two numbers are often not the same. To see
this,

look

at Fig. 11.6(a).

The keys

The

directory uses three bits to define

its

address

from keys in other

buckets by having an initial bit. All the other bits in the hashed key values
in bucket A can be any value; it is only the first bit that matters. Bucket A
space (8

cells).

using only one

bucket

are distinguished

bit.

The keys in bucket C all share a common

with 11. The keys in buckets B and D use

first

two

bits;

they

all

begin

three bits to establish their

bucket locations. If you look at Fig. 11.6(c),

you can see how using more or fewer address bits changes the relationship
between the directory and the bucket. Buckets that do not use as many
address bits as the directory have more than one directory cell pointing to
identities and, therefore, their

them.
If

directory,

one of the buckets that is using fewer address bits than the
which
therefore is referenced from more than one directory
and

split

515

IMPLEMENTATION

FUNCTION: op_add (KEY)

/* if we find the key, we do not add
if op_find(KEY, FOUND_BUCKET)
return FAILURE
/*
* *

second copy */

otherwise, add the key to the bucket that op_find

ret r i eved

bk_add_key(FOUND_BUCKET, KEY)
return SUCCESS
end FUNCTION

FUNCTION: op_find (KEY, FOUND_BUCKET)

/*
**
*/

uses DIR_DEPTH, the number of bits used to create

the addresses in the directory

/* create an address based on directory depth

ADDRESS := ma k e_addr es s ( KE Y DIR_DEPTH)

/*
**
*/

get the bucket that will

key exists in the file

FOUND_BUCKET

contain the key,

the

bucket referenced by
DIRECTORYCADDRESS] .BUCKET_REF

FOUND_BUCKET contains the KEY

return SUCCESS

else

return FAILURE
end FUNCTION
FIGURE 11.12 op_add and op_find functions.

FUNCTION bk_add_key(BUCKET, KEY)

(BUCKET. USED
add the key
else
if

MAX_BUCKET_S ZE )

bk_split(BUCKET)
op_add(KEY)
endi

end FUNCTION

FIGURE 1 1.13 bk_add_key

function adds the key to the
existing bucket if there is
room.
splits

key.

If
it

the bucket

is full,

and then adds the

516

EXTENDIBLE HASHING

FUNCTION bk_split(BUCKET)
/* if the depth used for the BUCKET addresses is
** already the same as the address depth in the
** directory, we must first split the directory
** to double the directory address space
*/
if

(BUCKET. DEPTH == DIR_DEPTH)

di r_double( )

allocate NEW_BUCKET
/*
**
*/
f

find the range of directory entries for the new

bucket, given the depth and keys in the old bucket

ind_new_range(BUCKET, NEW_START, NEH_END)

the new bucket over this range */

insert

dir_ins_bucket(NEW_BUCKET, NEW_START, NEN_END)

change the address depths in the buckets to

reflect the split

increment BUCKET. DEPTH

NEULBUCKET. DEPTH := BUCKET. DEPTH

redistribute the keys between the two buckets

end FUNCTION
FIGURE 11.14 bk_split function divides keys between an existing bucket and a new
bucket. If necessary, it doubles the size of the directory to accommodate the new
bucket.

cell,

the

can use half of the directory

split.

cells to

Suppose, for example, that

point to the

split

bucket

new

bucket

in Fig.

after

11.6(c).

Before the split only one bit, the initial zero, is used to identify keys that
belong in bucket A. After the split, we use two bits. Keys starting with 00
(directory cells 000 and 001) go in bucket A; keys starting with 01 (directory
cells 010 and Oil) go in the new bucket. We do not have to expand the
directory because the directory already has the capacity to keep track of the
additional address information required for the
If,

on the other hand, we

as the directory,

such

split a

split.

bucket that has the same address depth

B or D in Fig. 11.6(c), then there are no

we can use to reference the new bucket.
we have to double the size of the directory,
for every one that is currently there, so we

buckets

additional directory entries that

Before

creating a

can

new

split the

bucket,

directory entry

can accommodate the

new

address information.

517

IMPLEMENTATION

Figure 11.14 shows the bucket-splitting logic in pseudocode. First

compare

number of bits used

the

we double
new bucket

depths are the same,

Next we

get the

we need

addresses

for

to double the directory. If the

the directory before proceeding.

we need for
we will use

the

that

range of directory addresses that

instance,

number used

for the directory with the

the bucket to determine whether

split.

Then we

new

for the

find the

bucket.

For

when we split bucket A in Fig. 11.6(c), the range of directory

for the new bucket is from 010 to Oil. We attach the new bucket

to the directory over this range, adjust the bucket address depth information
in

both buckets to

of an additional address

reflect the use

redistribute the keys

from

the original bucket across the

The most complicated operation supporting

find_new_range, which finds the range of directory
the

new

bucket instead of the old one

pseudocode

in Fig. 11.15.

see

how

bit,

the bk_split function

cells that

after the split.

described in

works, return, once again, to

FIGURE 11.15 find_new_range function finds the start and end directory addresses
the new bucket by using information from the old bucket.
f

md_new_range(OLD_BUCKET

NEN_START, NEW_END)

/* find the shared address for the OLD bucket

SHARED_ADDRESS := ma k e_addres s (any KEY from
OLD_BUCKET, OLD_BUCKET DEPTH

/*
**
**
**
**
**
*/

shift everything one bit to the left, then put

in the lowest bit. This is the shared address
for the new bucket. Fill the new shared address on
the right with zero bits until we have reached the
directory depth. This is the start of the range.
Fill it with
bits -- this is the range's end.

NEW_SHARED
NEW_SHARED

SHARED_ADDRESS left shifted

NEW_SHARED bitwise ORed with

place

1)
BITS_TO_FILL := DIR_DEPTH - ( OLD_BUCKET DEPTH
set NEW_START and NEW_END to the NEW_SHARED value
for J :=
to BITS_TO_FILL
place
NEH_START := NEN_START left shifted
place
NEU_END := NEU_END left shifted
NEW_END := NEH_END bitwise ORed with
.

end FUNCTION

should point to

for

FUNCTION

and then

two buckets.

Fig.

518

EXTENDIBLE HASHING

Assume that we need to split bucket A, putting some of the keys

new bucket E. Before the split, any address beginning with a leads

11.6(c).

into a

address of the keys in bucket A is 0.

add another address bit to the path leading
to the keys; addresses leading to bucket A now share an initial 00 while
those leading to E share an 01. So, the range of addresses for the new bucket
is all directory addresses beginning with 01. Since the directory addresses

A. In other words, the shared

When we

split

bucket

A we

FIGURE 11.16 Directory operations to support bk_split: the dir_double and

dir_ins__bucket functions.

FUNCTION dir_double()
/* calculate the current size and new size */
CURRENT_SIZE := 2 DIR - DEPTH
NEH_SIZE := 2 * CURRENT_SIZE

allocate memory for the new, larger directory

temporarily call it NEW_DIR
/*
**
**
**
*/

Transfer the bucket addresses from the old

directory to the new one. Each cell in the
original is copied, into two cells of the
expanded directory

for

:=
to CURRENT_SIZE neh_dirc2*i .bucket_ref
d rectory
neh_dir[2*i+1 .bucket_ref
directory:
I

.bucket_ref

free memory for old DIRECTORY

rename NEN_DIR to DIRECTORY
increment DIR_DEPTH
end FUNCTION

FUNCTION dir_ins_bucket(BUCKET_ADDRESS, START, LAST)

for J := START to LAST
DIRECTORYC J] .BUCKET_REF := BUCKET_ADDRESS
next

end FUNCTION

519

IMPLEMENTATION

bits, the new bucket is attached to the directory cells starting with
010 and ending with Oil.
Suppose that the directory used a five-bit address instead of a three-bit
address. Then the range for the new bucket would start with 01000 and
would end with 01111. This range covers all five bit addresses that share 01
as the first two bits. The logic for finding the range of directory addresses

use three

for the

new

bucket.

bucket, then, starts by finding shared address bits for the

then

fills

the address out with zeroes until

used in the directory. This

bits

we have

the

new

number of

the start of the range. Filling the address

out with ones produces the end of the range.

The

directory operations required to support bk_split are easy to

implement. They are outlined in pseudocode in Fig. 11.16. The first,

dir_double, simply calculates the new directory size, allocates the required
memory, and writes the information from each old directory cell into two
successive cells in the new directory. It finishes by freeing the old space
associated with the name DIRECTORY, renaming the new space as the
DIRECTORY, and increasing the DIR_DEPTH value to reflect the fact
that the directory

The

range of directory

make

now

using an additional address

bit.

dir_ins_bucket function, used to attach a bucket address across a

cells, is

simply

loop that works through the

cells to

the change.

11.3.4 Implementation Summary

Now that we have assembled all of the pieces necessary to add records to an
extendible hashing system,

The op_add

let's

see

how

the pieces

work

together.

function manages record addition. If the key already exists,

op_add returns immediately. If the key does not exist, op_add calls
bk_add_key, passing it the bucket into which the key is to be added. If
bk_add_key finds that there is still room in the bucket, it adds the key and
the operation

complete. If the bucket

is full,

bk_add_key

calls bk_split to

handle the task of splitting the bucket.

The bk_split function starts by determining whether the directory
large

enough

accommodate

the

new

larger, bk_split calls a function that doubles the directory size.

then allocates

new

bucket, attaches

bucket. If the directory needs to be

The

function

to the appropriate directory cells,

and divides the keys between the two buckets.

When bk_add_key regains control after bk_split has allocated a new
bucket, it calls op_add to try to place the key into the new, revised directory
structure. The op_add function, of course, calls bk_add_key again, recursively.

new

This cycle continues until there

key.

bucket that can accommodate the

520

11.4

EXTENDIBLE HASHING

Deletion
11.4.1 Overview of the Deletion Process
If

extendible hashing

trees,

must be

AVL

to be a truly dynamic system, like B-trees or

able to shrink

files

gracefully as well as

grow them. When

delete a key, we need a way to see if we can decrease the size of the file
system by combining buckets and, if possible, decreasing the size of the

directory.

As with any dynamic system,

the important question during deletion

concerns the definition of the triggering condition:

When do we combine

buckets? This question, in turn, leads us to ask,

Which buckets can be

combined? For B-trees the answer involves determining whether buckets

are siblings and whether they are at the leaf level of the tree. In extendible
hashing

Look

use a similar concept: buckets that are buddy buckets.

again

the

combined? Trying

trie

Fig.

11.6(b).

Which buckets could be

A would mean

combine anything with bucket

collapsing everything else in the trie

first.

Similarly, there

bucket that could be combined with bucket C. But buckets

B and

single

D are in

same configuration as buckets that have just split. They are ready to be
combined; they are buddy buckets. We will take a closer look at the
question of finding buddy buckets as we consider implementation of the
deletion procedure; for now let's assume that we combine buckets B and D.
After combining buckets, we examine the directory to see if we can
make changes there. Looking at the directory form of the trie in Fig.
1 1 .6(c), we see that once we combine buckets B and D, directory entries 100
and 101 both point to the same bucket. In fact, each of the buckets has at
least a pair of directory entries pointing to it. In other words, none of the
buckets requires the depth of address information that is currently available
in the directory. That means that we can shrink the directory and reduce the

the

address space to half

its size.

Reducing the size of the address space restores the directory and bucket
structure to the arrangement shown in Fig. 11.4, before the additions and
splits that produced the structure in Fig. 11.6(c). Reduction consists of
collapsing each adjacent pair of directory cells into a single
since both cells in each pair point to the

nothing more than a reversal of the directory

when we need to add new directory cells.

11.4.2 A Procedure

for Finding

cell.

This

same bucket. Note

splitting

easy,

that this

procedure that

we use

Buddy Buckets

overview of how deletion works, we begin by focusing on

buddy buckets. Given a bucket, how do we find its buddy? Figure 11.17

Given

this

521

DELETION

FUNCTION bk_f i nd_buddy(BUCKET)

/* NOTE: this function uses DIR_DEPTH -- we
** assume this value is available globally or
** through a function call
*/
/*
**
*/
if

There

/*
**
*/
if

unless the bucket has the same depth as the

directory, there is no single bucket to pair with

is no buddy if the DIR_DEPTH is

just a single bucket)

(there

(DIR_DEPTH == 0)
return N0_BUDDY

(BUCKET. DEPTH < DIR_DEPTH)

return N0_BUDDY

/* find the shared address for this bucket */

SHARED_ADDRESS := ma k e_addr ess(any KEY from
BUCKET, BUCKET. DEPTH)
/*
**
*/

flip the last bit -- that

buddy bucket

BUDDY_ADDRESS

the address of

the

SHARED_ADDRESS exclusive ORed with

return BUDDY_BUCKET found at BUDDY_ADDRESS

end FUNCTION
FIGURE 11.17 The bk_find_buddy function returns a buddy bucket or the special
NO_BUDDY if none is found.

signal

The procedure works by checking

whether it is possible for there to be a buddy bucket. Clearly, if the
directory depth is zero, meaning that there is only a single bucket, there
cannot be a buddy.
The next test compares the number of bits used by the bucket with the
number of bits used in the directory address space. A pair of buddy buckets
is a set of buckets that are immediate descendents of the same node in the
trie. They are, in fact, pairwise siblings resulting from a split. Going back
to Fig. 11.6(b), we see that asking whether the bucket uses all the address
bits used in the directory is another way of asking whether the bucket is at
the lowest level of the trie. It is only when a bucket is at the outer edge of
the trie that it can have a single parent and a single buddy.
describes the procedure in pseudocode.
to see

522

EXTENDIBLE HASHING

Once we determine

that there

is a

buddy bucket, we need

to find

its

we have at hand;
this is the shared address of the keys in the bucket. Since we know that the
buddy bucket is the other bucket that was formed from a split, we know
address. First

that the

Once

find the address used to find the bucket

buddy has

the

same address

again, this relationship

11.6(b). So, to get the

buddy

all

illustrated

address,

regards except for the

last bit.

in Fig.

by buckets B and
flip

the last

bit.

return the

buddy bucket.

11.4.3 Collapsing the Directory

The

other important support function used to implement deletion

the

function that handles collapsing the directory. Downsizing the directory

one of the principal potential benefits of deleting records. In our implementation we use one function to check to see whether downsizing is possible
and, if it is, to actually collapse the directory. Figure 11.18 shows
pseudocode for this function, called dir_try_collapse( ).
The function begins by making sure that we are not at the lower limit
of directory size. By treating the special case of a directory with a single cell
here, at the start of the function, we simplify subsequent processing: With
the exception of this case, all directory sizes are evenly divisible by two.
The actual test for the COLLAPSE_CONDITION consists of examining each pair of directory entries. We assume at the outset that we can
collapse the directory and then look for a pair of directory cells that do not
both point to the same bucket. As soon as we find such a pair, we know that

We set the value of the COLLAPSE_

and break out of the test loop. If we get all the way
through the directory without encountering such a pair, then we can
we

cannot collapse the directory.

CONDITION

to false

collapse the directory.

The

actual collapsing operation consists of allocating space for a

directory that

references shared

new

half the size of the original and then copying the bucket

by each

cell pair to a single cell in

the

new

directory.

11.4.4 Implementing the Deletion Operations

Now

that

we have

deletion, finding

an approach to the two

buddy buckets and

critical

support operations for

collapsing the directory,

are ready

of the deletion operation.

The highest-level deletion operation, op_del, is very simple. We first try
to find the key to be deleted. If we cannot find it, we return failure; if we
do find it, we call a service function to remove the key from the bucket. We
to construct the higher levels

DELETION

FUNCTION dir_try_collapse()
/* the directory is already at minimum size when
** the depth is zero
*/
if

/*
**
**
*/

(DIR_DEPTH == 0)
return FAILURE
check each pair of directory cells to see whether
each member references the same bucket -- if so,
we can collapse the directory.

DIR_SIZE := 2 DIR - DEPTH

COLLAPSE_CDNDITION := TRUE;
for J

to DIR_SIZE

assume the best,

try to disprove

then
it

(DIRECTORY!! J] .BUCKET_REF

directory: j+1 .bucket_ref)

c0llapse_c0nditi0n := false
break out of the loop
]

next
/*
**
**
*/
if

endi f
J by

if we have a collapse condition, create a new

directory that is half the size of the original,
and transfer the bucket references

(C0LLAPSE_C0NDITI0N)
NEN_DIR_SIZE := DIR_SIZE / 2
allocate memory for NEW_DIR
for

NEW_DIR[
next

NEH_DIR_SIZE .BUCKET_REF :=
DIRECT0RY[2*J] .BUCKET_REF
1

free memory for old DIRECTORY

rename NEN_DIR to DIRECTORY
decrement DIR_DEPTH
endif

return C0LLAPSE_C0ND

end FUNCTION
FIGURE 11.18 The dir_try_collapse function first tests to see whether the directory
can be collapsed. If the test succeeds, the directory is collapsed.

523

524

EXTENDIBLE HASHING

FUNCTION: op_del (KEY)

if (op_f indCKEY, FOUND_BUCKET) == FAILURE)
return FAILURE
found it -- now delete it */
return ( b k_de l_k ey ( FOUND_BUCKET
/*

KEY))

end FUNCTION

FUNCTION bk_del_key(BUCKET, KEY)

set KEY_REMOVED to FALSE
Look

/*
**
*/
if

for KEY in BUCKET -- if found

remove the KEY
set KEY_REMOVED to TRUE
decrement BUCKET. COUNT

if a key was removed, see whether we can combine

this bucket with its buddy bucket

(KEY_REMOVED)
bk_try_combine( BUCKET)
return SUCCESS

else

return FAILURE
endi

end FUNCTION
FIGURE 11.19 The op_del and bk_del_key functions.

return the value reported back from the service function. Figure 11.19
describes op_del and the service function, bk_del_key

The bk_del_key

function does

consists of finding the

second

which

its

work

key and physically

pseudocode.

two steps. The first step

removing it from the bucket. The
in

is removed, consists of calling

key has decreased the size of the bucket
enough to allow us to combine it with its buddy.
Figure 11.20 shows the pseudocode for bk_try_combine and its service
function, bk_combine. Note that when we combine buckets, we reduce the
address depth associated with the bucket: Combining buckets means that
we use one less address bit to differentiate keys.
After combining the buckets, we call dir_try_collapse to see if the
decrease in the number of buckets enables us to decrease the size of the

step,

takes place only if a key

bk_try_combine to see

if deleting the

DELETION

525

directory. If we do, in fact, collapse the directory (dir_try_collapse succeeds),

bk_try_combine
created a

calls itself recursively.

new buddy

for the

Collapsing the directory

BUCKET;

may

may have

do even more
recursive combining and

be possible to

combination and collapsing. Typically, this

collapsing happens only when the directory has a number of empty buckets
that are awaiting changes in the directory structure that finally produce a

buddy

combine with.

FIGURE 1 1.20 The bk_try_combine function tests to see whether a bucket can be
combined with its buddy. If the test succeeds, bk_try_combine calls bk_combine to do the actual combination.

FUNCTION bk_try_combine(BUCKET)
/* If there i5 no baddy return right away
BUDDY := b k_f i nd_buddy(BUCKET)
if (BUDDY == ND_BUDDY)
return
/*
if

see if we can combine buckets */

(BUDDY. COUNT + BUCKET. COUNT <= MAX_BUCKET_S ZE
I

bk_combine(BUCKET

BUDDY)

free memory used by the BUDDY bucket

reassign the DIRECTORY value for the BUDDY so

that it now references the BUCKET
/*
**
*/
if

see if we can collapse the directory -- if

there may be a new buddy to combine with

so,

(di r_t ry_col lapse(

bk_try_combine( BUCKET)
endi

end FUNCTION

FUNCTION bk_combine(BUCKET, BUDDY)

for J :=
to BUDDY. COUNT
increment BUCKET. COUNT
BUCKETCBUCKET. COUNT] .KEY = BUDDYCJ3.KEY
1

decrement BUCKET. DEPTH

end FUNCTION

526

EXTENDIBLE HASHING

11.4.5 Summary

of the Deletion Operation

Deletion begins with

a call to op_del that passes the key that is to be deleted.

key cannot be found, there is nothing to delete. If the key is found, the
bucket containing the key is passed to bk_del_key
The bk_del_key function deletes the key and then passes the bucket on
to bk_try_combine to see if the smaller size of the bucket will now permit
combination with a buddy bucket. The bk_try_combine function first checks
to see if there is a buddy bucket. If not, we are done. If there is a buddy, and
if the sum of the keys in the bucket and its buddy is less than or equal to the
size of a single bucket, we combine the buckets.
The elimination of a bucket through combination might enable
collapsing of the directory to half its size. We investigate this possibility by
calling bk_try_collapse. If collapsing succeeds, we may have a new buddy
bucket, and so bk_try_combine calls itself again, recursively.

If the

1.5

Extendible Hashing Performance

Extendible hashing

an elegant solution to the problem of extending and

contracting the address space for a hash

shrinks.

How

must consider the

The time dimension

hashing can be kept in

RAM, two

RAM,

accesses

as the file itself

grows and

the answer to this question

space.

easy to handle: If the directory for extendible

a single access

retrieve a record. If the directory

file

work? As always,
trade-off between time and

well does

may

is all

so large that

that

ever required to

must be paged

and out

be necessary. The important point

extendible hashing provides O(l) performance: Since there

these access time values are truly independent of the size of the

file.

for extendible hashing

are

Questions about
complicated than questions about access time.
space utilization

that

no overflow,

We need to be concerned
about two uses of space: the space for the buckets and the space for the
directory.

11.5.1 Space Utilization

for

Buckets

In their original paper describing extendible hashing, Fagin, Nievergelt,

and Strong include analysis and simulation of extendible

hashing performance. Both the analysis and simulation show that the space
utilization is strongly periodic, fluctuating between values of 0.53 and 0.94.
The analysis portion of their paper suggests that for a given number of
Pippenger,

527

EXTENDIBLE HASHING PERFORMANCE

r and a block size of

approximated by the formula

records

the average

N
Space utilization, or packing density,
to the total

b In

number of records

number of blocks

defined as the ratio of the actual

number of records

that could be stored in the

allocated space:

Utilization

Substituting the approximation for

Utilization

So,

N gives

us:

0.69.

In 2

expect average utilization of 69%. In Chapter

space utilization for B-trees,

we found

where we looked

that simple B-trees tend to

have

at
a

of about 67%, but this can be increased to over 85% by

redistributing keys during insertion, rather than just splitting when a page
utilization

full.

So, B-trees tend to use less space than simple extendible hashing,

typically at a cost of requiring a

The average

few extra

seeks.

space utilization for extendible hashing

only part of the

story; the other part relates to the periodic nature of the variations in space
utilization.

turns out that if

have keys with randomly distributed

fill up at about

addresses, the buckets in the extendible hashing table tend to

the

same time and therefore tend

to split at the

same

time. This explains the

As the buckets
up, space utilization
can reach past 90%. This is followed by a concentrated series of splits that
reduce the utilization to below 50% As these now nearly half-full buckets
fill up again, the cycle repeats itself.

large fluctuations in space utilization.

fill

11.5.2 Space Utilization

for the Directory

directory used in extendible hashing grows by doubling its size. A

prudent designer setting out to implement an extendible hashing system
will want assurance that this doubling levels off for reasonable bucket sizes,

The

even when the number of keys is quite large. Just how large
should we expect to have, given an expected number of keys?

Flajolet (1983) addressed this question in a lengthy, carefully

paper that produces

size.

number of

Table 11.1, which

different

ways

directory

developed

to estimate the directory

taken from Flajolet's paper, shows the expected

value for the directory size for different numbers of keys and different

bucket

sizes.

528

EXTENDIBLE HASHING

TABLE 11.1

Expected directory size

records

for a given

bucket size b and

total

number

100

200

10
10
10
10
10
1

68.20

M
M

M
11.64 M

25.60

424.10

6.90

111.11

K =

From

K
K
K

1.50

3
,

0.30
4.80

K
K
K

0.10
1.70

16.80

62.50

K
K
K
K

16.80

K
K
K
K

0.52

0.26

0.00
0.50
4.10

M
2.25 M

0.26

1.02

K
K
K

0.00
0.20
2.00

0.13

0.00
1.00

6
.

Flajolet, 1983.

Flajolet

also

provides

the

following

formula

for

making rough

estimates of the directory size for values that are not in this table.
that this

formula tends to overestimate directory

Estimated directory

1 1

8.10

K
K
K
K

0.00

size

-^j r (1

+ 1/

a factor

notes

of 2 to

Alternative Approaches
11.6.1 Dynamic Hashing
and Strong produced their
paper describing a scheme
called dynamic hashing. Functionally, dynamic hashing and extendible
hashing are very similar. Both use a directory to track the addresses of the
buckets, and both extend the directory through the use of tries.
In 1978, before Fagin, Nievergelt, Pippenger,

paper on extendible hashing, Larson published

The key

difference

between the approaches

like conventional, static hashing, starts

address space of a fixed

size.

with

that

dynamic hashing,

hash function that covers an

As buckets within

that fixed address space

forming the leaves of a trie that grows down from the

original address node. Eventually, after enough additions and splitting, the
buckets are addressed through a forest of tries that have been seeded out of

overflow, they

split,

the original static address space.

Let's look at an example. Figure 11.21(a) shows an initial address space
of four, and four buckets descending from the four addresses in the

529

ALTERNATIVE APPROACHES

(a)-

)(

Original address space

X XIX

Original address space

FIGURE 11.21 The growth of index

dynamic hashing.

directory. In Fig. 11.21(b)

the

two buckets

resulting

we have

split the

from the

split as

bucket

address

40 and 41.

of the directory node at address 4 from a square to a

changed from an external node, referencing a bucket,
that points to

two

We address

circle

because

to an internal

has

node

child nodes.

In Fig. 11.21(c)

new

change the shape

split the

external nodes 20 and 21.

downward

bucket addressed by node

2, creating the

also split the bucket addressed

41,

410 and 411. Because the directory

node 41 is now an internal node rather than an external one, it changes from
a square to a circle. As we continue to add keys and split buckets, these
directory tries continue to grow.
Finding a key in a dynamic hashing scheme can involve the use of two
extending the

trie

to include

hash functions, rather than just one.

covers the original address space. If

First,

you

external node, and therefore points to

However,

if the

directory node

there

the hash function that

find that the directory

bucket, the search

node

complete.

an internal node, then you need additional

address information to guide you through the ones and zeroes that form the

530

EXTENDIBLE HASHING

trie.

Larson suggests using a second hash function on the key and using the
of this hashing as the seed for a random-number generator that

result

produces

sequence of ones and zeroes for the key. This sequence describes

the path through the

It is

trie.

interesting to

compare dynamic hashing and extendible hashing.

but illuminating, characterization of similarities and differences is that

while both schemes extend the hash function locally, as a binary search trie,

brief,

order to handle overflow, dynamic hashing expresses the extended

directory as a linked structure while extendible hashing expresses

perfect tree,

which

Because of this fundamental

utilization within

the buckets

Moreover, since the

as a

in turn expressible as an array.

similarity,
is

the

it is

not surprising that the space

same (69%)

for

both approaches.

directories are essentially equivalent,

expressed differently,

and are just

follows that the estimates of directory depth

developed by Flajolet (1983) apply equally well to dynamic hashing and

extendible hashing. (In section 11.5.2 we talk about estimates for the
directory size for extendible hashing, but we know that in extendible
hashing directory depth = log 2 directory size.)

The primary

difference

between the two approaches

that

dynamic

hashing allows for slower, more gradual growth of the directory, wmereas
it. However, because
dynamic hashing must be capable of holding pointers
to children, the actual .size of a node in dynamic hashing is larger than a
directory cell in extendible hashing, probably by at least a factor of two. So,
the directory for dynamic hashing will usually require more space in
memory. Moreover, if the directory becomes so large that it requires use of

extendible hashing extends the directory by doubling

the directory nodes in

virtual

memory,

extendible hashing offers the advantage of being able to

access the directory with

hashing uses
incur

no more than

a single

page

linked structure for the directory,

more than one page

fault to

move through

fault.

may

Since

dynamic

be necessary to

the directory.

11.6.2 Linear Hashing

The key

feature of both extendible hashing and

dynamic hashing

that they

use a directory to direct access to the actual buckets containing the key

modify the hashed

After expanding
the directory, more than one directory node can point to the same bucket.
However, the directory adds an additional layer of indirection which, if the
directory must be stored on disk, can result in an additional seek.
Linear hashing, introduced by Litwin in 1980, does away with the
directory. An example, developed in Fig. 11.22, shows how linear hashing
records. This directory

makes

possible to expand and

address space without expanding the

number of buckets:

ALTERNATIVE APPROACHES

(a)

FIGURE 11.22 The growth of

address space in linear hashing. Adapted from Enbody
and Du (1988).

w
i

(b)

000

(c)

000

100

001

100

101

100

101

110

b
1

I
X

i i

(d)

000

001

010

d
1 1

z
i k

(e)

000

001

010

Oil

100

1( )1

110

D
111

531

532

EXTENDIBLE HASHING

works. This example

Enbody and Du

adapted from

a description

Linear hashing, like extendible hashing, uses

address space of four, which means that

with two

that produces addresses

that

developed
a

as the address space

key and

of linear hashing by

(1988).

earlier in this

second argument of 2.

h 2 (k) address function.

Note

of hashed value

bits

(Fig.

11.22a) with an

are using an address function

of depth. In terms of the pseudocode

chapter, we are calling make_address with a
For this example we will refer to this as the
bits

that the address space consists of four buckets,

rather than four directory nodes that can point to buckets.

As we add records, bucket b overflows. The overflow forces a split.

However, as Fig. 11.22(b) shows, it is not bucket b that splits, but bucket
a. The reason for this is that we are extending the address space linearly, and
bucket a is the next bucket that must split to create the next linear extension,
which we call bucket A. A three-bit hash function, h 3 (k), is applied to
buckets a and A to divide the records between them. Since bucket b was not
the bucket that

split,

the overflowing record

placed into an overflow

bucket w.

We add

more records, and bucket d overflows. Bucket b is the next one

and extend the address space, so we use the h 3 (k) address function
to divide the records from bucket b and its overflow bucket w between b and
the new bucket B. The record overflowing bucket d is placed in an overflow
bucket x. The resulting arrangement is illustrated in Fig. 11.22(c).
Figure 11.22(d) shows what happens when, as we add more records,
bucket d overflows beyond the capacity of the overflow bucket w. Bucket
to split

c is

the next in the extension sequence, so

to divide the records

Finally,
in

assume

between

that bucket

the overflow bucket z.

use the h 3 (k) address function

and C.

overflows.

The overflow

The overflow record

placed

also triggers the extension to

and y between buckets d and D. At

this point all of the buckets use the h 3 (k) address function, and w e have
finished the expansion cycle. The pointer for the next bucket to be split
bucket D, dividing the contents ofd,

returns to bucket a to get ready for a

function to reach

new

cycle that will use an h 4 (k) address

buckets.

Since linear hashing uses

two hash

functions to reach the buckets during

an expansion cycle, an h d (k) function for the buckets

the current address

depth and an h d+l (k) function for the expansion buckets, finding
requires

knowing which function

to use. \{p

record

the pointer to the address of

the next bucket to be split and extended, then the procedure for finding the

address of the bucket containing

key k

as follows:

ihjk)> = P
address

h d (k)

ALTERNATIVE APPROACHES

533

else

address

hd +

(k)

Litwin (1980) shows that the access time performance of linear hashing
is no directory to access or maintain, and since we
extend the address space through splitting every time there is overflow, the

quite good. There

overflow chains do not become very large. Given a bucket size of 50, the
average number of disk accesses per search approaches very close to one.
Space utilization, on the other hand, is lower than it is for extendible
hashing or dynamic hashing, averaging around only 60%.

11.6.3 Approaches to Controlling Splitting

We know

from Chapter 8 that we can increase the storage capacity of

by implementing measures that tend to postpone splitting,
redistributing keys between pages rather than splitting pages. We can apply
B-trees

similar logic to the hashing schemes introduced in this chapter, placing

records in chains of overflow buckets to postpone splitting.

Since linear hashing has the lowest storage utilization of the schemes

introduced here, and since

buckets,
its

it is

already includes logic to handle overflow

an attractive candidate for use of controlled splitting logic. In

uncontrolled-splitting form, linear hashing splits

bucket and extends

the address space every time any bucket overflows.

triggering event for splitting

the bucket that splits

arbitrary, particularly

utilization

Suppose we

reaches

some

let

when we consider that

split a

such

75%

and even 85%, the

still

average number of

stays

an alternative

below

75%. Every time

the

bucket and extend the address

space. Litwin simulated this kind of system and

unsuccessful searches

file

the buckets overflow until the space

desired figure,

utilization exceeds that figure,

typically not the bucket that overflows. Litwin

(1980) suggests using the overall load factor of the

triggering event.

This choice of

found

that for load factors

accesses for successful and

can also use overflow buckets to defer splitting and increase space

utilization for

which use
attraction

dynamic hashing and extendible hashing. For

these methods,

directories to the buckets, deferring splitting has the additional

of keeping the directory

size

down. For extendible hashing

particularly advantageous to chain to an overflow bucket and therefore

avoid

a split

when

the split

would cause

the directory to double in size.

Consider the example that we used early in this chapter, where we split the
bucket B in Fig. 11.4(b), producing the expanded directory and bucket
structure shown in Fig. 11.6(c). If we had allowed bucket B to overflow
instead, we could have retained the smaller directory. Depending on how
much space we allocated for the overflow buckets, we might also have

534

EXTENDIBLE HASHING

improved space

utilization

ments, of course,

among

the buckets.

The

cost of these

improvedue to the overflow

a potentially greater search length

chains.

Studies of the effects of different overflow bucket sizes and chaining

mechanisms has supported a small industry of academic research during the

early and mid-1980s. Larson (1978) suggested the use of deferred splitting in
his original paper on dynamic hashing but found the results of some
preliminary simulations of the idea to be disappointing.

developed

refinement of

this idea in

Scholl (1981)

which overflow buckets

are shared.

by Chang (1985) tested Scholl's suggestions

empirically and found that it was possible to achieve storage utilization of
Master's

about

thesis

81%

research

while maintaining search performance in the range of

1.1 seeks

per search. Veklerov (1985) suggested using buddy buckets for overflow
rather

than

allocating

chains

new

buckets.

This

attractive

since splitting buckets without buddies can never cause a

suggestion,

doubling of the directory in extendible hashing. Veklerov obtained storage

of about 76% with a bucket size of 8.

utilization

SUMMARY
Conventional,
dynamic, that

static

hashing does- not adapt well to

grow and

file

structures that are

shrink over time. Extendible hashing

one of

several hashing systems that allow the address space for hashing to

grow

and shrink along with the file. Because the size of the address space can
grow as the file grows, it is possible for extendible hashing to provide
hashed access without the need for overflow handling, even as files grow

many

times beyond their original expected

The key

to extendible hashing

size.

the idea of using

bits

of the

hashed value as we need to cover more address space. The model for
extending the use of the hashed value is the trie: Every time we use another
bit of the hashed value, we have added another level to the depth of a trie
with a radix of two.
In extendible hashing we fill out all the leaves of the trie until we have
a perfect tree,

The

and then

collapse that tree into a one-dimensional array.

array forms a directory to the buckets, kept on disk, that actually hold

is managed in RAM, if possible.

no room for it in a bucket, we split the
bucket. We use one additional bit from the hash values for the keys in the
bucket to divide the keys between the old bucket and the new one. If the

the keys and records.

we add

The

directory

record and there

address space represented in the directory can cover the use of this

new

bit,

KEY TERMS

no more changes

however, the address space is using

we double the
address space to accommodate the use of the new bit.
fewer

are necessary.

If,

than are needed by our splitting buckets, then

bits

Deletion reverses the addition process, recognizing that

combine the records

for

two buckets only

of buckets that resulted from

to say that they are the pair

Access performance for extendible hashing

directory can be kept in

it is

RAM.

possible to

they arc buddy buckets, which

If the directory

a split.

single seek

must be paged off

the

to disk,

is two seeks. Space utilization for the buckets is

approximately 69%. Tables and an approximation formula developed by
Flajolet (1983) permit estimation of the probable directory size, given a

worst-case performance

bucket

and

size

There are

number of records.
number of other approaches

total

extendible hashing.

Dynamic hashing uses

to the

problem solved by

very similar approach but

The
more cumbersome but grows more smoothly. Space

expresses the directory as a linked structure rather than as an array.

linked structure
utilization

and seek performance for dynamic hashing are the same

as for

extendible hashing.
Linear hashing does

address space

overflow of

away with

the directory entirely, extending the

by adding new buckets

in a linear sequence.

Although the

bucket can be used to trigger extension of the address space

in linear hashing, typically the bucket that overflows is not the one that is
a

and extended. Consequently, linear hashing implies maintaining

overflow chains and a consequent degradation in seek performance. The
degradation is slight, since the chains typically do not grow to be very long
before they are pulled into a new bucket. Space utilization is about 60%.
Space utilization for extendible, dynamic, and linear hashing can be
improved by postponing the splitting of buckets. This is easy to implement
for linear hashing, since there are already overflow buckets. Using deferred
splitting, it is possible to increase space utilization for any of the hashing
split

schemes described here to 80% or better while still maintaining search

performance averaging less than two seeks. Overflow handling for these
approaches can use the sharing of overflow buckets.

KEY TERMS

Buddy

bucket. Given a bucket with an address uvwxy, where u, v, w,

or 1, the buddy bucket, if it exists,
and y have values of either
has the value uvwxz, such that

XOR

535

536

EXTENDIBLE HASHING

Buddy

buckets are important in deletion operations for extendible

hashing since,

enough keys

are deleted, the contents of

buddy

buckets can be combined into a single bucket.

Deferred splitting.

It is

possible to

improve space

utilization for dy-

namic hashing, extendible hashing, and linear hashing by postponing, or

deferring, the splitting of buckets, placing records into overflow

buckets instead. This

a classic

space/time trade-off in which

accept diminished performance in return for

more compact

storage.

Directory. Conventional, static hashing schemes transform a key into a

bucket address. Both extendible hashing and dynamic hashing introduce
an additional layer of indirection, in which the key is hashed to a
directory address.

The

directory, in turn, contains information about

the location of the bucket. This additional indirection

ble to extend the address space

by extending the

makes

possi-

directory, rather

work with an address space made up of buckets.

hashing. Used in a generic sense, dynamic hashing can refer to
any hashing system that provides for expansion and contraction of
the address space for dynamic files where the number of records
changes over time. In this chapter we use the term in a more specific
sense to refer to a system initially described by Larson (1978). The
system uses a directory to provide access to the buckets that actually
contain the records. Entries in the directory can be used as root
nodes of trie structures that accommodate greater numbers of buckets
than having to

Dynamic

buckets

split.

Extendible hashing. Like dynamic hashing, extendible hashing is sometimes used to refer to any hashing scheme that allows the address
space to grow and shrink so it can be used in dynamic file systems.

Used more

precisely, as

it is

used in

this chapter, extendible hashing

files that was

proposed by Fagin, Nievergelt, Pippenger, and Strong (1979).
Their proposal is for a system that uses a directory to represent the
address space. Access to buckets containing the records is through
the directory. The directory is handled as an array; the size of the
array can be doubled or halved as the number of buckets changes.
Linear hashing. An approach to hashing for dynamic files that was first
proposed by Litwin (1980). Unlike extendible hashing and dynamic

refers to

an approach to hashed retrieval for dynamic

first

hashing, linear hashing does not use a directory. Instead, the actual
is extended one bucket at a time as buckets overflow.
Because the extension of the address space does not necessarily correspond to the bucket that is overflowing, linear hashing necessarily
involves the use of overflow buckets, even as the address space ex-

address space

pands.

EXERCISES

Splitting.

new

The hashing schemes described

chapter

in this

records by splitting buckets to form

new

make room

tending the address space to cover these buckets. Conventional,

hashing schemes rely

strictly

for

buckets, and then exstatic

on overflow buckets without extending

the address space.

Trie.

key

search tree structure in which each successive character of the

used to determine the direction of the search at each successive

of the tree. The branching factor (the radix of the trie) at any
is potentially equal to the number of values that the character

level
level

can take.

wmmmmmmmam
EXERCISES
ex
1.

Briefly describe the differences

hashing, and linear hashing.

What

between extendible hashing, dynamic

and weaknesses of each

are the strengths

approach?
2.

The

tries

that

are

the basis

change
3.

use

a larger

procedure

the extendible hashing

for

described in this chapter have a radix of two.

How

does performance

radix?

what would happen if we did not reverse

number of low-order
same left-to-right order that they occur in the address? Think
way the directory location would change as we extend the

In the make_address function,

the order of the bits but just extracted the required

bits in the

about the

implicit trie structure to use yet another bit.

language that you are using to implement the make_address

function does not support bit shifting and masking operations, how could
you achieve the same ends, even if less elegantly and clearly?
4.

If the

between the original

for this
implementation
bucket and a new one. Outline a possible
redistribution. How do you decide whether a key belongs in the new bucket
5.

In the bk_split function,

redistribute keys

or the original bucket?

Suppose the redistribution of keys

in bk_split

any keys into the new bucket. Under

happen? How will the program handle

does not result

moving

what conditions could such an event

The bk_try_combine function

described

situation in

this?

potentially recursive. In section 11.4.4

which there

combined with other buckets through

are
a

empty buckets
series

that can be

of recursive

calls

537

538

EXTENDIBLE HASHING

two

bk_try_combine. Describe
in the
8.

produce empty buckets

Deletion occasionally results in collapsing the directory. Describe the

conditions that must be

situations that could

hash structure.

met before

the directory can collapse.

Deletion depends on finding buddy

buckets.

Why

does the address depth

have to be the same as the address depth for the directory

bucket to have a buddy?

tor a bucket

order for

10. In the extendible hashing

procedure described in

directory can occasionally point to

that can

produce empty buckets.

empty

How

this

buckets. Describe

could

we modify

chapter,

two

the

situations

the procedures to

avoid empty buckets?

11. If buckets are large, a
less

wasteful than an

bucket containing only

empty bucket.

How

could

few records

we minimize

not

much

nearly empty

buckets?

makes use of overflow records. Assuming an unconimplementation where we split and extend the address
space as soon as we have an overflow, what is the effect of using different
bucket sizes for the overflow buckets? For example, consider overflow
buckets that are as large as the original buckets. Now consider overflow
buckets that can only hold one record. How does this choice affect
performance in terms of space utilization and access time?
12. Linear hashing

trolled splitting

13. In section

11.6.3

described an approach to linear hashing that

controls splitting. For a load factor of 85%, the average

for a successful search

number of accesses

1.20 (Litwin, 1980). Unsuccessful searches require

an average of 1.78 accesses.

Why

the average search length greater for

unsuccessful searches?

Because linear hashing splits one bucket at a time, in order, until it has
reached the end of the sequence, the overflow chains for the last buckets in
the sequence can become much longer than those for the earlier buckets.
Read about Larson's approach to solving this problem through the use of
"partial expansions," originally described in Larson (1980) and subsequently summarized in Embody and Du (1988). Write a pseudocode
description of linear hashing with partial expansions, paying particular
14.

attention to

how

addressing

15. In section 11.6.3

splitting

of buckets

utilization.

What

larger ones?

How

handled.

discussed different mechanisms for deferring the

in extendible hashing in order to increase

storage

the effect of using smaller overflow buckets rather than

does the use of smaller overflow buckets compare with

the idea of sharing overflow buckets?

FURTHER READINGS

Programming Exercises
Write

16.

version of the make_address function that prints out the input

key, the hash value, and the extracted, reversed address. Build a driver that

allows you to enter keys interactively for this function and see the results.

Study the operation of the function on different keys.

Write

17.
in

a simplified version

pseudocode

Keep
Hold

of the extendible hashing program described

in this chapter. This simplified version should

the directory and buckets in

RAM

rather than

disk;

three keys per bucket;

Find and add keys, but not delete them;

Accept keys entered interactively; and

Display the resulting directory structure and buckets so you can see
how the directory references the buckets and can see which buckets
contain which keys.

Once you
as

buckets

it to see how the directory grows

program developed in exercise 16 to develop
hash to the same bucket. Enter such sequences and

build this program, play with

split.

Use

sequences of keys that

the
all

watch what happens.

Extend exercise 17 to include deletion. Once again, experiment with

program to see how deletion works. Try deleting all the keys. Try to
create situations where the directory will recursively collapse over more
18.

the

than one level.

19.

Write an extendible hashing program that stores and retrieves buckets

from disk rather than from

RAM.

Using the information in Enbody and Du (1988) and Litwin

implement a simple, RAM-based linear hashing program.
20.

(1980),

FURTHER READINGS
For information about hashing for dynamic
here,

you must turn

to journal

articles.

files

The

beyond what we present

summary of the different
article titled "Dynamic Hashing

that goes

best

is Enbody and Du's Computing Surveys

Schemes," which appeared in 1988.
The original paper on extendible hashing is "Extendible Hashing A Fast
Access Method for Dynamic Files" by Fagin, Nievergelt, Pippenger, and Strong
(1979). Larson (1978) introduces dynamic hashing in an article titled "Dynamic
Hashing." Litwin's initial paper on linear hashing is titled "Linear Hashing: A New
Tool for File and Table Addressing" (1980). All three of these introductory articles

approaches

539

540

EXTENDIBLE HASHING

are quite readable; Larson's paper

especially

and Fagin, Nievergelt, Pippcnger, and Strong are

recommended.

Michel Scholl's 1981 paper titled "New File Organizations Based on Dynamic
Hashing" provides another readable introduction to dynamic hashing. It also
investigates implementations that defer splitting by allowing buckets to overflow.
Papers analyzing the performance of dynamic or extendible hashing often
derive results that apply to either of the
careful analysis

of directory depth and

two methods.

size.

Flajolet (1983) presents a

Mendelson

(1982) arrives at similar

and goes on to discuss the costs of retrieval and deletion as different design
parameters are changed. Veklerov (1985) analyzes the performance of dynamic
hashing when splitting is deferred by allowing records to overflow into a buddy
results

bucket. His results can be applied to extendible hashing as well.

number of papers

building

ideas associated with linear hashing. His 1980 paper titled "Linear

Hashing

After introducing dynamic hashing, Larson wrote a

on the

with Partial Expansions" introduces an approach to linear hashing that can avoid the
uneven distribution of the lengths of overflow chains across the cells in the address

He followed up with a performance analysis in a 1982 paper titled

"Performance Analysis of Linear Hashing with Partial Expansions." A subsequent,
Handling by Linear Probing"
1985 paper titled "Linear Hashing with Overflow
introduces a method of handling overflow that does not involve chaining.
space.

Appendix A
File Structures

CD-ROM

OBJECTIVES
Introduce the commercially important characteristics

CD-ROM

storage.

medium with performance charvery different from those o{

magnetic disks; show how to apply good file structure design principles to develop solutions that are
appropriate to this new medium.
Examine

storage

acteristics that are

Describe the directory structure of the

system and show how

acteristics of the medium.
file

CD-ROM

grows from the char-

541

OUTLINE
A.l Using this Appendix

A.2 Introduction to
A. 2.1
A. 2. 2

A. 5. 2 Block Size
A. 5. 3 Special Loading Procedures and
Other Considerations
A. 5. 4 Virtual Trees and Buffering

CD-ROM
CD-ROM

Short History of

CD-ROM

as a File Structure

Blocks
A. 5. 5 Trees as Secondary Indexes on

Problem

A.3 Physical Organization of

CD-ROM

A.6 Hashed

A. 3.1 Reading Pits and Lands

A.3.2 CLV Instead of CAV
A. 3. 3 Addressing
A. 3. 4 Structure of a Sector

A.4

CD-ROM
1

A. 4. 2
A. 4. 3
A. 4. 4
A. 4. 5

CD-ROM

A. 6.1 Design Exercises

A. 6. 2 Bucket Size
A. 6.3 How the Size of

Helps
A.6.4 Advantages of

Strengths and

Weaknesses
A.4.

Files

Read-Only

Seek Performance
Data Transfer Rate
Storage Capacity
Read-Only Access
Asymmetric Writing and
Reading

A.5 Tree Structures on

A.7 The

CD-ROM

CD-ROM's

Status

File

System

A. 7.1 The Problem

A. 7. 2 Design Exercise
A. 7. 3 A Hybrid Design

CD-ROM

A. 5.1 Design Exercises

A.1

Using this Appendix

This appendix has two purposes.

performance

characteristics

The

first

CD-ROM,

information distribution medium. The second

designing

file

structures for

and techniques presented

CD-ROM

review

tell

you about

commercially
is

to use the

many of the

the

important

problem of
design issues

in the text.

We begin by introducing CD-ROM. We explain how CD-ROM

works and enumerate the features that make file structure design for
CD-ROM a different problem than file structure design for magnetic
media.

Once we have examined CD-ROM's performance, we provide

high-level look at

how this

performance

affects the

design of tree structures,

hashed indexes, and directory structures for CD-ROM. These discussions

of trees and hashing do not present new information; they review material
that has already been developed in detail. Since you already have the tools

543

INTRODUCTION TO CD-ROM

required to think through these design problems,

introduce exercises

and questions throughout this discussion, rather than holding them to the
end. We encourage you to stop at these blocks of questions, think carefully
about the answers, and then compare results with the discussion that
follows.

A.2

CD-ROM

Introduction to

CD-ROM
CD audio
CD-ROM

an acronym for

disc
is

contains

that

digital

data

than

rather

commercially interesting because

can be reproduced cheaply.

That

Compact Disc Read-Only Memory.

can hold

digital

sound.

of data and
megabytes of

a lot

single disc can hold over 600

approximately 200,000 printed pages, enough storage to hold

size of this one. Replicates can be stamped from a
master disc for about only a dollar a copy.

data.

almost 400 books the

CD-ROM

cannot record on

read-only in the same sense as

it.

many

publishing medium,

audio

disc:

You

used for distributing

and retrieval medium

magnetic disks. Currently, CD-ROMs are often used to publish
database information such as telephone directories, zip codes, and demographic information. There are also many CD-ROM products that deliver
textual data, such as bibliographic indexes, abstracts, dictionaries, and
encyclopedias, often in association with digitized images stored on the disc.
They are also used to publish video information and, of course, digital

information to

users, rather than a data storage

audio.

A.2.1 A Short History of

CD-ROM

the offspring of videodisc technology developed in the late

1960s and early 1970s, before the advent of the

to store

movies on

methods

for storing video signals, including

disc.

respond mechanically to grooves

does.

home VCR. The

Different companies developed

The consumer products

in a disc,

one that used

much

industry spent

like a vinyl

great

deal

goal

was

number of

needle to

LP record
of money

developing the different technologies, including several approaches to

optical storage, and then spent years fighting over the question of which
approach should become standard. The surviving format is one called
LaserVision. By the time LaserVision emerged as the winner of these

""Usually
spell

with

spell disk
a

with

a k,

but the convention

among

optical disc manufacturers

544

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

standards battles, the competing developers had not only spent enormous
sums of money, but had also lost important market opportunities. These
hard lessons were put to use in the subsequent development of CD audio

and

CD-ROM.

From the outset, there was interest in using LaserVision discs to do

more than just record movies. The LaserVision format supports recording
in

both

CLV

capacity, and a

(Constant Linear Velocity) format that maximizes storage

CAV

seek performance.

frames quickly,

(Constant Angular Velocity) format that enables fast

using the
format to access individual video

CAV

number of organizations,

produced prototype interactive video

including the

MIT Media

Lab,

discs that could be used to teach

and

entertain.
In the early 1980s, a

number of firms began looking

at the possibility

storing digital, textual information on LaserVision discs. LaserVision stores

data in an analog form;

is,

after

storing an analog video signal.

all,

came up with different ways of encoding digital information

analog form so it could be stored on the disc. The capabilities

Different firms
in

demonstrated in the prototypes and early, narrowly distributed products

were impressive. The videodisc has a number of performance characteristics
that

make

a technically

desirable

medium

than the

CD-ROM;

one can build drives that seek quickly and deliver information
from the disc at a high rate of speed. But, reminiscent of the earlier disputes
over the physical format of the videodisc, each of these pioneers in the use
of LaserVision discs as computer peripherals had incompatible encoding
schemes and error correction techniques. There was no standard format,
and none of the firms was large enough to impose their format over the
others through sheer marketing muscle. Potential buyers were frightened
by the lack of a standard; consequently, the market never grew.
During this same period the Philips and Sony companies began work
on a way to store music on optical discs. Rather than storing the music in
the kind of analog form used on videodiscs, they developed a digital data
format. Philips and Sony had learned hard lessons from the expensive
particular,

standards battles over videodiscs. This time they

worked with other

in the consumer products industry to develop

players

licensing system that

emergence of CD audio discs as a broadly accepted, standard

format as soon as the first discs and players were introduced. CD audio
appeared in the United States in early 1984. CD-ROM, which is a digital
data format built on top of the CD audio standard, emerged shortly
thereafter. The first commercially available CD-ROM drives appeared in
resulted in the

1985.

Not

surprisingly,

LaserVision discs

first

the

saw

firms

that

CD-ROM

were delivering

digital

data

as a threat to their existence.

They

545

INTRODUCTION TO CD-ROM

however,

also recognized,

always eluded them

CD-ROM

that

in the past: a

was guaranteed
manufactured by any

drive

of any disc

promised

what had

to provide

standard physical format.

Anyone with

that they could find and read a sector off

medium

firm. For a storage

such

to be used in

fundamental level is essential.

What happened next is remarkable in the history of standards and
cooperation within an industry. The firms that had been working on
products to deliver computer data from videodiscs recognized that a
publishing, standardization

CD-ROM, was not

standard physical format meant that everyone was guaranteed to

standard physical format, such as that provided by

enough.

be able to read sectors off of any

work

disc.

But computer applications do not

terms of sectors; they store data in files. Having an agreement

about finding sectors, without further agreement about how to organize the
in

sectors into
settled

files, is like

how

relatively

moving

small,

into the

system to be

be organized into words on

from the

the firms emerging

were

everyone's agreeing on an alphabet without having

letters are to

together

called

CD-ROM

built

industry

on top of the

many of the much

to begin work on a

CD-ROM

format. In

file

now emerged

official international

of which

standard

file

display of

a rare

system standard by early summer of 1986;

all

larger firms

worked out

cooperation, the different firms, large and small,

features of a

page. In late 1985

videodisc/digital data industry,

that

the

main

work

standard for organizing

has

files

CD-ROM.
The
has

CD-ROM

begun

industry

show mature

is still

matters such as disc formats to

in the past two years it

moving away from concentration on

young, though

signs of

concern with

CD-ROM

applications;

on the new medium in isolation, vendors are seeing

an enabling mechanism for new systems. As it finds more uses in

rather than focusing

broader array of applications,

CD-ROM

looks like an optical publishing

technology that will be with us over the long term.

A. 2.

CD-ROM

as a File Structure Problem

presents

medium with

include the fact that

durable.

slow,

interesting

structure problems

The key weakness

often taking

and showed

The

from

that if

that

seek performance on

RAM

access

RAM

CD-ROM

inexpensive, and
a

half second to a second

we compared

because

strengths of

has a lot of storage capacity,

introduction to this textbook

access

file

great strengths and weaknesses.

CD-ROM
per seek.

very
the

access and magnetic disk

analogous to your taking 20

seconds to look up something in the index to this textbook, the equivalent

disk access would take 58 days, or almost two months. With a

CD-ROM

546

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

the analogy stretches the disc access to over two and a half years! This kind
of performance, or lack of it, makes intelligent file structure design a critical
concern for CD-ROM applications. CD-ROM provides an excellent test of
our ability to integrate and adapt the principles we have developed in the
preceding chapters of this book.

A.3

Physical Organization of

CD-ROM
is

the child of

CD-ROM

audio. In this instance, the impact of heredity

strong, with both positive and negative aspects. Commercially, the

audio parentage
the market.
it is

It is

possible to

probably wholly responsible for CD-ROM's viability in

because of the enormous size of the CD audio market that

make

CD-ROM discs so inexpensively.

Similarly, advances

and decreases in the costs of making CD audio players affect

performance and price of CD-ROM drives. Other optical disc media that
have not enjoyed the benefits of this parentage have not experienced the
commercial success of CD-ROM.
On the other hand, making use of the manufacturing capacity
associated with CD audio means adhering to the fundamental physical
organization of the CD audio disc. Audio discs are designed to play music,
in the design

not to provide

fast,

random

access to data.

This difference in design

objective biases

CD toward having high storage capacity and moderate data

transfer

but against decent seek performance.

requires

rates,

good random-access performance,

from our
in the

file

structure design efforts;

medium

an application

performance has to emerge

won't come from anything inherent
that

itself.

A.3.1 Reading Pits and Lands

CD-ROM

stamped from a master disc. The master is formed by

we want to encode to turn a powerful laser on and
off very quickly. The master disc, which is made of glass, has a coating that
is changed by the laser beam. When the coating is developed, the areas hit
by the laser beam turn into pits along the track followed by the beam. The
smooth, unchanged areas between the pits are called lands. The copies
formed from the master retain this pattern of pits and lands.
When we read the stamped copy of the disc, we focus a beam of laser
light on the track as it moves under the optical pickup. The pits scatter the
light, but the lands reflect most of it back to the pickup. This alternating
pattern of high- and low-intensity reflected light is the signal used to
reconstruct the original digital information. The encoding scheme used for
discs are

using the digital data that

547

PHYSICAL ORGANIZATION OF CD-ROM

this signal is

not simply

the Is are represented

matter of calling

by the

transitions

a pit a

from

and

pit to

land

a 0. Instead,

land and back again.

Every time the light intensity changes, we get a 1. The zeroes arc
represented by the amount of time between transitions; the longer between

more zeroes we have.

you think about this encoding scheme, you realize that it is not
possible to have two adjacent Is
Is are always separated by zeroes. In fact,
due to the limits of the resolution of the optical pickup, there must be at
least two 0s between any pair of Is. This means that the raw pattern of Is
transitions, the
If

and 0s has

to be translated in order to get the 8-bit patterns

and

()s

that

form the bytes of the original data. This translation scheme, which is done
through a lookup table, turns the original 8 bits of data into 14 expanded
bits that can be represented in the pits and lands on the disc; the reading
process reverses this translation. Figure A.l shows a portion of the lookup
table values. Readers

players

may have

who

have looked closely

encountered the term

EFM

the specifications for

encoding.

EFM

stands for

"eight to fourteen modulation" and refers to this translation scheme.

It is

important to

realize that since

represent the zeroes in the

EFM-

the length of time between transitions, our ability to read the

encoded data by
data is dependent on moving the pits and lands under the optical pickup at
a precise and constant speed. As we will see, this affects the CD-ROM
drive's ability to seek quickly.

A.3.2 CLV Instead

Data on

CD-ROM

three miles

of
is

CAV

stored in a single, spiral track that winds for almost

from the center

to the outer

edge of the

disc.

This spiral pattern

which
from CD
requires a lot of storage space, we want to pack the data on the disc as
tightly as possible. Since we "play" audio data, often from start to finish

part of the

CD-ROM's

heritage

Decimal Original Translated

bits
value
bits
00000000 01001000100000
00000001 10000100000000
2
00000010 10010000100000
3
00000011 10001000100000
4
00000100 01000100000000
5
00000101 00000100010000
6
00000110 00010000100000
7
00000111 00100100000000
8
00001000 01001001000000
1

audio. For audio data,

FIGURE A. 1 A portion
EFM encoding table.

of the

548

APPENDIX

STRUCTURES ON CD-ROM

A: FILE

Constant angular
velocity

Constant linear
velocity

FIGURE A.2 CLV and

CAV

recording.

without interruption, seeking is not important. As Fig. A.2 shows, a spiral

pattern serves these needs well. A sector toward the outer edge of the disc
takes the same amount of space as a sector toward the center of the disc.
This means that we can write all of the sectors at the maximum density
permitted by the storage medium. Since reading the data requires that it
pass under the optical pickup device at a constant rate, the constant data
density implies that the disc has to spin more slowly when we are reading
at the outer edges than when we are reading toward the center. This is why
the spiral is a Constant Linear Velocity (CLV) format: As we seek from the
center to the edge, we change the rate of rotation of the disc so the linear
speed of the spiral past the pickup device stays the same.
By contrast, the familiar Constant Angular Velocity (CAV) arrangement shown in Fig. A.2, with its concentric tracks and pie-shaped sectors,
writes data less densely in the outer tracks than in the tracks toward the
center. We are wasting storage capacity in the outer tracks but have the
advantage of being able to spin the disc at the same speed for all positions
of the read head. Given the sector arrangement shown in the figure, one
rotation

reads

Furthermore,
start

no matter where we
timing mark placed on the disk makes
eight

sectors,

are
it

on the

disc.

easy to find the

a sector.

The CLV format is responsible, in large part, for the poor seeking
performance of CD-ROM drives. The CAV format provides definite track
boundaries and a timing mark to find the start of a sector. The CLV format,

549

PHYSICAL ORGANIZATION OF CD-ROM

on the other hand, provides no straightforward way

location. Part of the problem is associated with
rotational speed as
that

seek across the disc.

adjust the speed,

know where we

we need

to a specific

need to change

we need

the correct speed. But to

to be

moving

know how

to be able to read the address information so

How

are.

jump

read the address information

stored on the disc along with the user's data,

the data under the optical pickup

this

the

does the drive's control mechanism break out of

loop? In practice, the answer often involves making guesses, finding the

correct speed through

trial

and error. This takes time and slows

down

seek

performance.

On the positive side, the CLV sector arrangement

CD-ROM's large storage capacity. Given a CAV

CD-ROM
A. 3.

would have only

a little better

than half

its

contributes to the

arrangement,

the

present capacity.

3 Addressing

The use of
sector

the

Instead,

CLV

organization means that the familiar cylinder, track,

CD-ROM.
we use a sector-addressing scheme that is related to the CD-ROM's

way of

identifying

sector address will not

work on

roots as an audio playback device. Each second of playing time on

divided into 75 sectors, each of which holds 2 Kbytes of data. According to

the original Philips/Sony standard, a

CD-ROM,
disc

disc,

whether used

for audio or

contains at least one hour of playing time. That means that the

capable of holding

at least

540,000 Kbytes of data:

60 minutes x 60 seconds/minute x 75 sectors/second = 270,000 sectors.

In fact, since

it is

possible to put over 70 minutes of playing time on

the capacity of the disk

CD,

over 600 Mbytes.

We address a given sector by referring to the minute,

second, and sector

of play. So, the 34th sector in the 22nd second in the 16th minute of play
would be addressed with the three numbers 16:22:34.

A. 3. 4 Structure of a Sector
It is

interesting to look at the

way

that the

fundamental design of the

designed for delivering digital audio information, has been

adapted for computer data storage. This investigation will also help answer
the question, "If the disc is capable of storing a quarter of a million printed
pages, why does it hold only an hour's worth of Roy Orbison?"
disc, initially

we need to convert a wave pattern into

wave.
At any given point in time, the
digital form. Figure A. 3
digitize
the wave by measuring the
wave has a specific amplitude. We

When we want

to store sound,

shows

550

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

32767
actual

wave

wave reconstructed
from sample data

sampling frequency

-32767

FIGURE A.3 Digital sampling

amplitude

of a

very frequent intervals and storing the measurements. So, the

question of how
turns into

we need to represent a wave digitally

questions: How much space does it take to store each
and how often do we take samples?

much

storage space

two other

amplitude sample,

CD audio uses
that

wave

16 bits to store each amplitude measurement; that

our "ruler" that

different gradations.

we use to measure the height of the wave has

To accurately approximate a wave through

means
65,536
digital

need to take the samples at a rate that is more than twice as

frequent as the highest frequency that we want to capture. This makes sense
if you look at the wave in Fig. A. 4. You can see that if we sample at less than
twice the frequency of the wave, we lose information about the variation in
the wave pattern. The designers of CD audio selected a sampling frequency
of 44.1 KHz, or 44,100 times per second, so they could record sounds with
frequencies ranging up to 20 KHz (20,000 cycles per second), which is
toward the upper bound of what people can hear.
So, if we are taking a 16-bit, or 2-byte sample 44,100 times per second,
sampling,

we need to store 88,200 bytes per second. Since we want to store stereo
sound, we need double this, storing 176,400 bytes per second. You can see
why storing an hour of Roy Orbison takes so much space.

551

PHYSICAL ORGANIZATION OF CD-ROM

actual

wave

wave reconstructed
from sample data

*
sampling frequency

FIGURE A.4 The effect of sampling at less than twice the frequency of the wave.

you divide

the 176,400-byte-per-second storage capacity of the

into 75 sectors per second,

divides

this

"raw"

sector

you have 2,352 bytes per sector. CD-ROM

storage as shown in Fig. A. 5 to provide 2 K of

user data storage, along with addressing information, error detection, and
error correction information.

because, although

The

error correction information

not adequate to meet computer data storage needs.

correction
discs.

would

The

result in an average

additional

error

2,352-byte sector decreases

20,000

12 bytes

necessary

The audio

it is

error

of one incorrect byte for every two

information stored within the

correction

this error rate to

one uncorrectable byte

every

discs.

FIGURE A.5 Structure of a

synch

audio contains redundancy for error correction,

CD-ROM

4 bytes
sector ID

sector.

2,048 bytes
user data

4 bytes

8 bytes

error
detection

null

276 bytes
error
correction

552

A.4

APPENDIX

STRUCTURES ON CD-ROM

A: FILE

CD-ROM

Strengths and Weaknesses

As we say throughout this book, good file design is responsive to the nature
of the medium, making use of strengths and minimizing weaknesses. We
begin, then, by cataloging the strengths and weaknesses of CD-ROM.

A.4.1 Seek Performance

chief weakness of CD-ROM
Current magnetic disk technology

The

random

the

random-access performance.

such that the average time for

data access, combining seek time and rotational delay,

about 30

On a CD-ROM,

this average access takes 500 msec, and can take up

more. Clearly, our file design strategies must avoid seeks to
an even greater extent than on magnetic disks.

msec.

to a second or

A.4. 2 Data Transfer Rate

A CD-ROM

drive reads 75 sectors, or 150 Kbytes of data per second. This

data transfer rate

part of the fundamental definition of

CD-ROM;

can't

be changed without leaving behind the commercial advantages ot adhering

to the

CD audio standard.

It is

modest

transfer rate, about five times faster

than the transfer rate for floppy disks, and an order of magnitude slower
than the rate for good Winchester disks.

makes

itself felt

when we

The inadequacy of the

are loading large

files,

such

transfer rate

those associated

On the other hand, the transfer rate is fast enough

CD-ROM's seek performance that we have a design

with digitized images.

relative

the

incentive to organize data into blocks, reading

with the hope that

can avoid as

much

seeking

data with each seek

as possible.

A. 4. 3 Storage Capacity

A CD-ROM
to use

holds

images, 600 Mbytes

more than 600 Mbytes of data. Although

this storage area

big

very quickly, particularly

when

download 600 Mbytes of

text

it is

possible

if you are storing raster

comes

to text applications. If you decide

with

2,400-baud

modem,

will take

about three days of constant data transmission, assuming errorless transmission conditions. Many typical text databases and document collections
published on

CD-ROM

The design

use only

benefit arising

a fraction

from such

of the

disc's capacity.

large capacity

that

enables us

and other support structures that can help overcome some

of the limitations associated with CD-ROM's poor seek performance.
to build indexes

553

TREE STRUCTURES ON CD-ROM

A. 4. 4

From

Read-Only Access
design standpoint, the fact that

CD-ROM

storage device that cannot be changed

significant advantages.

organization.

through

file

structures

to optimize our index structures

We know

later additions

publishing medium,

provides

manufacture,

never have to worry about updating. This not

only simplifies some of the

worthwhile

after

but also

means

that

and other aspects of

file

that our efforts to optimize access will not be lost

or deletions.

Asymmetric Writing and Reading

A. 4. 5

For most media,

files

are written

and read using the same computer system.

Often, reading and writing are both interactive and are therefore con-

by the need

strained

to provide quick response to the user.

CD-ROM

We create the files to be placed on the disc once; then we distribute

is accessed thousands, even millions, of times. We are in a
and

different.

the disc

position

bring

substantial

computing power

the

task

file

when the disc will be used on systems with

we can use extensive, batch-mode processing

organization and creation, even

much
on

less capability. In fact,

large computers to try to provide systems that will perform well

small machines.
file

We make

the investment in intelligent, carefully designed

structures only once; users can enjoy the benefits of this investment

again and again.

A.5

CD-ROM

Tree Structures on
A. 5.1

Design Exercises

Tree structures are a good way to organize indexes and data on CD-ROM.
+
Chapters 8 and 9 took a close look at B-trees and B trees. Before we
discuss the effective use of trees on CD-ROM, think through these design
questions:
1.

How
How

big should the block size be for B-trees and

B+

trees?

should you go in the direction of using virtual tree structures? How much memory should you set aside for buffering blocks?
+
How could you use special loading procedures to advantage in a B
far

implementation? Are there similar procedures that will assist in

the loading of B-trees?
Suppose we have a primary index and several secondary indexes to
tree

set

of records.

How

should you organize these access mechanisms

554

APPENDIX

tor

A: FILE

CD-ROM?

your

STRUCTURES ON CD-ROM

Address the issues of binding and pinned records

reply.

5.2 Block Size

Avoiding seeks
Consequently,

the key strategy in

and

B-tree

tree

CD-ROM

file

structure design.

good choices for

As we showed in Chapters 8

structures

are

implementing index structures on CD-ROM.

and 9, given a large enough block size, B-trees and B + trees can provide
access to a large number of records in only a few seeks.
How large should the block size be? The answer, of course, depends on
the application, but

it is

possible to provide

since the sector size of the

less

than 2 Kbytes.

consequently,

sector

does not

Since the

CD-ROM's

especially

when viewed

CD-ROM
make

some

general guidelines. First,

2 Kbytes, the block size should not

sequential reading performance

attractive to use a block

on the

the smallest addressable unit

disc;

sense to read in anything less than a sector.

relative to

its

moderately

seeking performance,

composed of several

sectors.

Once you have

the better part of a second seeking for the sector and reading

it,

fast,

usually

spent

reading an

make an 8-Kbyte block takes only an additional 40

added fraction of a second can contribute to avoiding another

additional 6 Kbytes to

msec. If

this

it is time well spent.

Table A.l shows the maximum number of 32-byte records that can be
contained in a B-tree as the tree changes in height and block size. The

seek,

dramatic effect of block

trees

size

on the record counts

for

two- and

suggests that large tree structures should usually use

three-level
at

least

8-Kbyte block.

TABLE

The maximum number

32-byte records that can be stored

a B-tree of

giver height a nd block size

Tree Height

One
Block

size

Block

size

Block

size

=
=
=

4
8

K
K
K

Level

Two

Levels

Three Levels

4,224

274,624

128

16,640

2,146,688

256

66,048

16,974,592

555

TREE STRUCTURES ON CD-ROM

A. 5.3 Special

Loading Procedures and Other Considerations

trees are commonly used in CD-ROM applications because they

provide both indexed and sequential access to records. If, for example, you
are building a telephone directory system for CD-ROM, you will need an
index that can provide fast access to any one of the millions of names that

appear on the

You

will also want to provide sequential access so once

name, they can browse through records with the same
name, checking addresses, to make sure they have the right phone number.
disc.

users have found a

B+

trees are also attractive in

CD-ROM

applications because they can

of sequenced records. As we
+
the content of the index part of a B tree can consist

provide very shallow, broad indexes to

showed

a set

in Chapter 9,
of nothing more than the shortest separators required to provide access to
lower levels of the tree and, ultimately, to the target records. If these
shortest separators are only a few bytes long, as is frequently the case, it is
often possible to provide access to millions of records with an index that is
only tw o levels deep. An application can keep the root of this index in
RAM, reducing the cost of searching the index part of the tree to a single
seek. With one additional seek we are at the record in the sequence set.
+
Another attractive feature of B trees is that it is easy to build a
two-level index above the sequence set with a separate loading procedure
T

from the bottom up. We described this operation in

Chapter 9. The great advantage of this kind of loading procedure, as
opposed to building the tree through a series of top-down insertions, is that
we can pack the nodes and leaves of the tree as fully as we wish. With
CD-ROM, where the cost of additional seeks is so high, and where there is
absolutely no possibility that anyone will make additional insertions to the
tree, we will want to pack the nodes and leaves of the tree so they are
completely full. This is an example of a design decision that recognizes that
the CD-ROM is a publishing medium that, once constructed, is used only
that builds the tree

and never for additional storage.

This kind of special, 100%-full loading procedure can also be designed
for B-tree applications. The procedure for B-trees is usually somewhat
more complex because the index will often consist of more than just a root
for retrieval,

node and one

level

manage more

levels

of children. The loading procedure for B-trees has to

of the tree at a time.

This discussion of indexes, and the importance of packing them as
tightly as possible, brings home one of the interesting paradoxes of

CD-ROM

The

CD-ROM

large

storage

capacity that usually gives us a great deal of freedom with regard to

how we

design.

disc

has

relatively

on the disc; a few bytes here or there usually doesn't matter much
when you have 600 Mbytes of capacity. But when we design the index
store data

556

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

structures for

CD-ROM, we

even counting

bits as

reason for this

we pack

not, in

most

find ourselves counting bytes,

information into
cases, that

a single

are

sometimes

byte or integer.

The

running out of space on the

but because packing the index tightly can often save us from making
file design the cost of seeks adds up very
quickly; the designer needs to get as much information out of every seek as
disc,

CD-ROM

an additional seek. In
possible.

5.4 Virtual Trees and Buffering Blocks

Given the very high cost of seeking on

blocks in

RAM

CD-ROM, we

node should always be buffered. As we indicated

trees in Chapter 8, buffering nodes below
contribute

will

for as long as they are likely to be useful.

significantly

buffering

intelligent

Buffering

most

to
in

useful

selecting

when

the

node

keep

tree's root

our discussion of virtual

root

the

reducing seek time,

want

The
can

particularly

sometimes

when

the

replace in the buffer.

successive accesses to the tree tend to be

clustered in one area.

Note

that packing the tree as tightly as possible during loading,

discussed earlier as a

way

likelihood that an index block in

to reduce tree height,

which

also increases the

RAM will be useful on successive accesses

to the data.

A. 5.

5 Trees as Secondary Indexes on

Typically,
the data

CD-ROM

on the

disc.

CD-ROM

applications provide

more than one

access route to

For example, document retrieval applications usually

give direct access to the documents, so you can page through them in

them up by name, chapter, or section while also providing

access through an index of keywords or included terms. Similarly, in a
telephone directory application you would have access to the database by
name, but also by location (state, city, zip code, street address). As we
sequence or

call

described in Chapter

secondary indexes provide these multiple views of

the data.

Chapter 5 raised the design issue of whether the secondary indexes

should be tightly bound to the records they point to, or whether the binding
should take place at retrieval time, through the use of a common key

Viewed another way, the issue is

whether the target records should be pinned to a specific location through
references in secondary indexes, or whether they should be left unpinned so
accessed through yet another index.

they can be reorganized.

Records will never be reorganized on a CD-ROM; since it is a

disc, there is no disadvantage to having pinned records. Further,

read-only

HASHED

FILES

ON CD-ROM

557

minimizing the number of seeks is the overriding design consideration on

CD-ROM. Consequently, secondary index designs for CD-ROM should
usually bind the indexes to the target records as tightly as possible, ensuring
that once you have found the correct place in the index, you are ready to
retrieve the target with, at most, one additional seek.
One objection to this bind-tightly approach to CD-ROM index design
is that, although it is true that the indexes cannot be reorganized once

CD-ROM,

written to the

they

are, in fact, quite

frequently reorganized

between successive "editions" of the disc. Many CD-ROM publications are

reissued to keep them up to date. The period between successive versions

may

be years, or

may

as short as a

week. So, although pinned records

cause no problem on the finished disc, they

difficulty in the files

used to prepare the

may

cause a great deal of

disc.

There are a number of approaches to resolving this tension between

what is best on the published disc and what is best for the files used to
produce it. One solution is to maintain loosely bound records in the source
database, transforming them to tightly bound records for publication on

CD-ROM. CD-ROM

product designers often

on the

fail

to realize that the

file

and often should, be different than the

structures used to maintain the source data and produce the discs. Another
solution, of course, is to trade off performance on the published disc for
decreased costs in producing it. Production costs, time constraints, user
acceptance, and competitive factors interact to determine which course is
structures placed

best.

The key

issue

from

disc can,

the

file

designer's standpoint

to recognize that

the alternatives exist, and then to be able to quantify the costs and benefits

of each.

A.6

Hashed
A. 6.1

Files

on CD-ROM

Design Exercises

Hashing, with

its

promise of single access

organize indexes on

CD-ROM. We begin

retrieval,

an excellent

way

with some design questions that

your knowledge of hashing with what you now know about

CD-ROM. As you think through your answers, remember that your goal
should be to avoid any additional seeking due to hash bucket overflow. As
in any hashing design problem, the design parameters that you can

intersect

manipulate are

Bucket size;
Packing density for the hashed index; and

The hash

function

itself.

558

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

The following
efficient

which you should try to answer before you

think about ways to use these parameters to build

questions,

you

read on, encourage

CD-ROM

applications.

What

considerations go into choosing a bucket size?

does the relatively large storage capacity of
developing efficient hashed retrieval?

How

CD-ROM

Since a

CD-ROM

read-only,

you have

hashed before you create the

to be

disc.

complete

How

list

assist in

of the keys

can this assist in reduc-

ing retrieval costs?

Bucket Size

A. 6. 2

Chapter 10 we showed how to reduce overflow, and therefore retrieval

time, by grouping records into buckets, so each hashed address references an
entire bucket of records. Since any access to a CD-ROM always reads in a
minimum of a 2-Kbyte sector, the bucket size should be a multiple of 2
Kbytes. Having the bucket be only a part of a sector would be
counterproductive. As we described in Chapter 3, transferring anything less
than a sector means first moving the data into a system buffer, and from
there into the user's data area. With transfers ot a complete sector, many
operating systems can move the data directly into the user area.
How many sectors should go into a bucket? As with trees, it is a
trade-off between seeking and sequential reading. In addition, larger
buckets require more searching and comparing to find the record once the
bucket is read into RAM. In Chapter 10 we provided tools to allow you to
calculate the effect of bucket size on the probability of overflow. For
In

CD-ROM

you

applications,

will

want

to use these tools to reduce the

probability of overflow to almost nothing.

How

A.6.3
Packing

the Size of

hashed

additional seeking.

bucket

loosely

good

rule

Helps
another

of thumb

way
that,

all

see that for

avoid overflow and

even with only

keeping the packing density below

size,

overflow almost
10,

file

CD-ROM

60%

moderate

will tend to avoid

the time. Consulting Tables 10.4 and 10.5 in Chapter

randomly

distributed keys, a packing density of 60% and

of 10 will reduce the percentage of records that overflow to

1.3% and will reduce the average number of seeks required for a successful

bucket

size

search to 1.01.

When

there

disadvantage to

unused space available on the disc, there is no

expanding the size of the hashed index so overflow is
is

virtually eliminated.

THE CD-ROM

A. 6. 4

Advantages

CD-ROM

collections often use

the
is

CD-ROM

90%

full?

discs, this situation

most of the

images along with the

digitized

559

SYSTEM

CD-ROM's Read-Only Status

What if space is at a premium on

a way to pack your index so it
capacity of

FILE

disc,

and you need to find

Despite the relatively large

fairly

common. Large

disc just for text. If the product

text

file

storing

the available space disappears even

text,

two discs at once are much

and deliver than a single disc application; when a disc is
already nearly full of data, the index files are always a target for size
quickly. Applications requiring the use of

harder to

sell

reduction.

The
space.

we do

calculations that

packing density assume

could find

uniformly, rather than

a
a

to estimate the effects of bucket size and

random distribution of keys across the address
hash function that would distribute the keys

randomly,

could achieve

100% packing

density

and no overflow.
that

Once again, the fact that CD-ROM is

would not be available in a dynamic,

we produce

CD-ROM, we have all

This means that

we do

expecting

a distribution that is

that

read-write environment.

When

the keys that are to be hashed at hand.

not have to choose

whatever distribution of keys

read-only opens up possibilities

hash function and then

settle for

produces, hoping for the best, but

merely random. Instead,

function that provides the performance

can select

hash

need, given the set of keys

have to hash. If our performance and space constraints require it, we can
develop a hash function that produces no overflow even at very high
packing densities. We identify the selected hash function on the disc, along
with the data, so the retrieval software knows how to locate the keys. This
relatively expensive and time-consuming function-fitting effort is worthwhile because of the asymmetric nature of writing and reading CD-ROMs;
the one-time effort spent in making the disc is paid back many times as the
disc

A.7

distributed to

The CD-ROM

many

File

A. 7.1

The Problem

When

the

users.

System

firms involved in developing

together to begin

work on

confronted with an interesting

common
file

CD-ROM

applications

came

system in late 1985, they were

structures problem. The design goals and
file

constraints included the following:

Support hierarchical directory structures;

560

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

Find and open any one of thousands of

and
Support the use of generic

with only one or two

files

seeks;

file

names,

as in "file*.c",

during directory

access.

The

way

usual

more than

directories as nothing

you

notation,

support hierarchical directories

are looking for a

a special

file

with the

kind of
full

file.

is
If,

the

treat

using

UNIX

path

/usr/home/mydir/fi lebook/cdrom/part3. txt

you look

in the root directory

usr to find the location

(/)

to find the directory file usr, then

of the directory

file

you open

home, you seek to home and open

and so on until you finally open the directory file named

where you find the location of the target file part3.txt. This is a very

to find mydir,

cdrom,

simple, flexible system;

and many

CD-ROM developer,
must seek

it is

the approach used in

other operating systems.

to,

CD-ROM,

that before

the standpoint of a

can find the location ofpart3.txt,

open, and use six other

such

MS-DOS, UNIX, VMS,

The problem, from

files.

half-second per seek on

a directory structure results in a

very unresponsive

file

system.

A. 7.

2 Design Exercise

At the time of the

meetings to begin looking

initial

directory structure and

file

system,

at a

standard

CD-ROM

number of vendors were using

this

magnetic disc directwo alternative approaches

treat-directories-as-files approach, literally replicating

tory systems on
that

ROM. One
child,

left

CD-ROM.

were commercially

There were at least

and more specifically

available

placed the entire directory structure in

right sibling tree to

CD-

building a

express the directory structure. Given the

directory hierarchy in Fig. A. 6, this system produced a

shown

tailored to

a single file,

file

containing the

A. 7. The other system created an index to the file

locations by hashing the full path names of each file. The entries in the hash
tree

in Fig.

table for the directory structure in Fig. A. 6 are

shown

in Fig. A. 8.

Considering what you know about CD-ROM (slow seeking, readonly, and so on), think about these alternative file systems and try to answer
the following questions. Keep in mind the design goals and constraints that
were facing the committee (hierarchical structure, fast access to thousands

files,

use of generic

file

names).

and disadvantages of each system.

Try to come up with an alternative approach that combines the best
features of the other systems while minimizing the disadvantages.
List the advantages

THE CD-ROM

FILE

SYSTEM

561

ROOT

REPORTS

LETTERS

CALLS.LOG

WORK

SCHOOL

ZATX
S2.RPT

Sl.RPT

WLRPT

Wl.LTR

P2.LTR

Pl.LTR

FIGURE A.6 A sample directory hierarchy.

FIGURE A.7 Left child, right sibling tree to express directory structure.

ROOT

I
REPORTS

LETTERS

CALLS.LOG

I
PERSONAL
ISC

ABC.LTR

SCHOOL

h WORK

I
I
Sl.RPT

Wl.RPT

S2.RPT

XYZ.LTR

WORK

i
Pl.LTR

P2.LTR

Wl.LTR

562

APPENDIX

A: FILE

STRUCTURES ON CD-ROM

/REPORTS/SCHOOL/S 1 .RPT
/REPORTS/SCHOOL/S2.RPT

/REPORTS /WORK/W1 .RPT

Hash

/LETTERS/PERSONAL/P 1 .LTR
/LETTERS/PERSONAL/P2.LTR

Table
of
Path

Hash Function

/LETTERS/ABC.LTR

Names

/LETTERS/XYZ.LTR

/LETTERS/WORK/W1 .LTR
/CALLS.LOG
FIGURE A.8 Hashed index of

file

pathnames.

A.7.3 A Hybrid Design

Placing the entire directory structure into a single
right sibling tree,

works well

long

file,

with the

left-child,

as the directory structure is small. If

into a few kilobytes, the entire directory

and can be accessed without any seeking at
all. But if the directory, structure is large, containing thousands of files,
accessing the various parts of the tree can require multiple seeks, just as it

the

file

containing the tree

structure can be held in

does

when

fits

RAM

each directory

a separate file.

Hashing the path names, on the other hand, provides single-seek access
to any file but does a very poor job of supporting generic file and directory
names, such as prog* .c, or even a simple command such as Is or dir to list
all

the

files

in a given subdirectory.

distribution of the keys, scattering

all

of the

files

definition, hashing

them over

in a given subdirectory, say the

randomizes the

the directory space. Finding

letters

subdirectory for the tree

shown in Fig. A. 6, requires a sequential reading of the entire directory.

What about a hybrid approach, in which we build a conventional
file for each directory and then supplement
hashed index to all the files in all directories? This
approach allows us to get to any subdirectory, and therefore to the
information required to open a file, with a single seek. At the same time, it

directory structure that uses a

this

by building

provides us with the ability to

using generic
a

file

work with

names and commands such

conventional directory structure to get

all

the
as

files
Is

and

inside each directory,

dir.

In short,

we build

the advantages of that approach,

and then solve the access problem by building an index for the subdirectories.

563

SUMMARY

Parent

RRN
1

3
4
5

FIGURE A.9 Path index table of directories.

-1

Root
Report 5
Letters
School
Work

Persona 1
Hork

2
2

This

very close to the approach that the committee

they went one step further.

settled on.

Since the directory structure

But

highly

organized, hierarchical collection of

special index that takes advantage

files, the committee decided to use a

of that hierarchy, rather than simply

hashing the path names of the subdirectories. Figure A.9 shows what
index structure looks like

when it is

A. 6. Only the directories are

this

applied to the directory structure in Fig.

listed in the index; access to the data files

through the directory files. The directories are ordered in the index so
parents always appear before their children. Each child is associated with an
integer that is a backward reference to the relative record number (RRN) of
directory
the parent. This allows us to distinguish between the
directory under LETTERS. It also
under REPORTS and the
allows us to traverse the directory structure, moving both up and down
with a command such as the cd command in DOS or UNIX, without
having to access the actual directory files on the CD-ROM. It is a good
example of a specialized index structure that makes use of the organization
inherent in the data to produce a very compact, highly functional access

WORK

mechanism.

Summary
CD-ROM

an electronic publishing medium that allows us to replicate

and distribute large amounts of information very inexpensively. The
primary disadvantage of CD-ROM is that seek performance is relatively
slow. This is not a problem that can be solved simply by building better
is

performance grow directly from the fact that

CD-ROM is built on top of the CD audio standard. Adherence to this
standard, even given its limitations, is the basis for CD-ROM's success as
a publishing medium. Consequently, CD-ROM application developers

drives; the limits in seek

must look
software.

to careful

file

structure design to build

fast,

responsive retrieval

564

APPENDIX

STRUCTURES ON CD-ROM

A: FILE

B-tree and

B+

sector size

work

tree structures

provide access to

ability to

CD-ROM

many

it is

few

because of their

Because the

seeks.

2 Kbytes, the block size used in a tree should be

2 Kbytes or an even multiple of this sector

seek so slowly,

CD-ROM

well on

keys with just

size.

Because

CD-ROM

drives

usually advantageous to use larger blocks consisting of

Kbytes or more. Since no additions or deletions will be made to a tree

once it is on CD-ROM, it is useful to build the trees from the bottom up
8

so the blocks are completely

filled.

indexes, the read-only nature of

When

using trees to create secondary

CD-ROM

makes

possible to bind the

indexes tightly to the target data, pinning the index records to reduce
seeking and increase performance.

Hashed indexes

are often a

good choice

for

CD-ROM because they can

provide single-seek access to target information. As with

trees, the

The bucket

sector size affects the design of the hashed index:

size

2-Kbyte

should be

one or more full sectors. Since CD-ROMs are large, there is often enough
space on the disc to permit use of packing densities of 60% or less for the
hashed index. Use of packing densities of less than 60%, combined with a
bucket size of 10 or more records, results in single-seek access for almost all
records. But it is not always possible to pack the index this loosely. Higher
packing densities can be accommodated without loss of performance if we
hash function to the records in the index, using

tailor the

provides

nearly uniform distribution. Since

we know

function that
that there will

be no deletions or additions to the index, and since time spent optimizing

the index will result in benefits again and again as the discs are used,

often worthwhile
functions. This

densities

In 1985

90%

to invest the effort in finding the best

especially true

there

to support higher packing

or more.

companies trying

faced an interesting

common

when we need

it is

of several hash

file

to build the

structure problem.

directory structure and

file

CD-ROM

They

system for

were no directory structure designs

in use

publishing market

needed a
At the time,

realized that they

CD-ROM.

CD-ROM that provided

nearly optimal performance across a wide variety of applications.

Directory structures are usually implemented

Moving from
another

CD-ROM,

This is not a good design for

of several seconds just to locate and open

file.

in a wait

directory to a subdirectory beneath

series

since

files

could result

a single file.

alternatives, such as putting the entire directory structure in a single

hashing the path names of all the

files.

means seeking

Simple
file,

on a disc, have other drawbacks. The

problem emerged with a design that

committee charged with solving this

combined a conventional hierarchical directory of files with an index to the
directory. The index makes use of the structure inherent in the directory
hierarchy to provide

very compact, yet functional

map of the

directory

SUMMARY

565

CD-ROM indexing problems, this directory

importance of building indexes very tightly on
CD-ROM, despite the vast, often unused capacity of the CD-ROM disc.
Tight, dense indexes work better on CD-ROM because they require fewer
seeks to access. Avoiding seeks is the key consideration for all CD-ROM

structure. Typical

index

file

illustrates

of other

the

structure design.

Appendix B
ASCII Table

Dec. Oct. Hex.

nul

Dec. Oct. Hex.

42
43
44
45
46
47

22
23
24
25
26
27

28
29

64
65
66
67
68
69
70

N
O

72
73
74
75
76
77
78
79

62
63
64
65
66
67

32
33
34
35
36
37

38
39

stx

etx

cot

enq

ack

bel

42
43
44
45
46
47

die

del

dc2
dc3
dc4

nak

syn

22
23

22
23
24
25
26
27

48
49
50

52
53
54
55

32
33
34
35
36
37

etb

24
25
26
27
28
29
30

can

cm
sub
esc
fs

566

1C
ID

56
57
58
59
60

32
33
34
35
36
37
38
39

sol

Dec. Oct. Hex.

62
63

53
54
55
56
57

72
73
74
75
76
77

2A
2B
2C
2D
2E

3C
3D

D
E

42
43
44
45
46
47

110

48
49

68
69

111

160

P
q

112
113

161

122

123

53
54
55
56
57

114
115
116
117
118
119

162
163
164
165
166
167

72
73
74
75
76
77

58
59

120

170

121

171

78
79

122

172
173
174
175
176
177

130

150

88
89
90

62
63
64
65
66
67

82
83
84
85
86
87

142
143
144
145
146
147

152
153
154
155
156
157

121

140
141

104
105
106
107
108
109
110

111

3E
3F

102
103
104
105
106
107

Q
T

102
103

101

120

101

96
97
98
99
100

4C
4D

100

112
113
114
115
116
117

Dec. Oct. Hex.

124
125
126
127

131

k
1

u
V

132
133

92
93
94
95

134
135
136
137

5C
5D

123
124

125
126

del

127

151

6A
6B
6C

6D
6E
6F

7A
7B
7C

7D
7E
7F

Appendix C
String Functions

Pascal: tools. pre

Functions and Procedures Used to Operate on strng

The following

make up

functions and procedures

the tools for operating

variables that are declared as:

TYPE
strng

packed array

The length of

the strng

MAX_REC_LGTH

char;

stored in the zeroth byte of the array as a

character representative of the length. Note that the Pascal functions CHR(
and ORD() are used to convert integers to characters and vice versa.
Functions include:
Returns the length of str.

len_str(str)

dear_str(str)

Clears

copy _jtr(str l,str2)

Copies contents of str2

cat_str(strl,str2)

str

setting

Concatenates

str2 to

Puts result in

strl.

Reads

write _str(str)

Writes contents of

Reads

str as

a str

Trims

trim_str(str)

first,

to the screen.

str

with length

trailing blanks

Returns length of

Converts

ucase(strl ,str2)

key)

strl to

Combines

end of strl.

Igth

Writes contents of str to

fwrite_str(fd,str)

makekey (last,

length to
to strl.

input from the keyboard.

read_str(str)

fread_str(fd,str, Igth)

its

last

from

file fd.

from

str.

uppercase, storing result in

and first into key

str2.

in canonical

form, storing result in key.

minfintl ,int2)
cmp_jstr(strl ,str2)

Returns the

Compares
If strl
If strl
If strl

<
>

minimum

of two integers.

strl to str2:

str2,

cmp_str returns

str2,

returns a negative number.

str2,

returns a positive

number

567

568

APPENDIX

STRING FUNCTIONS

PASCAL: tools.prc

FUNCTION len_5tr Cstr: 5trng): integer;

{
len_str() returns the length of 5tr >
BEGIN
len_str := QRDCstrCO]
END;

PROCEDURE clear_str(VAR str: strng);

{
A procedure that clears str by setting its length to
BEGIN
str CO]
CHR(O)
=

END;

PROCEDURE copy_str(VAR strl: strng; str2: strng);

>
{
A procedure to copy str2 into strl
VAR
i

eger

BEGIN
for

strl
strl CO]

en_s t
str2C
str2[01
to

1
[

r ( s t r
i

END;

PROCEDURE cat_str (VAR strl: strng; str2: strng);

{
cat_str() concatenates str2 to the end of strl and stores
the resu
VAR
i

strl

integer

BEGIN
for

:=
to 1 en_s t r (
strl [(len_str(str1

strHO]

5 t r 2 )
) +

i)l

CHR(len_str(str1

:=
)

str2Ci];
1 en_s t r (

s t r 2 ) )

END;

PROCEDURE read_str (VAR str: strng);

A procedure that reads str as input from the keyboard
VAR
1

gt h

nt eger

BEGIN
lgth

while (not EOLN) and (lgth <= MAX_REC_S ZE ) DO

BEGIN
I

lgth
=
lgth +
read (strtlgthl)
:

END;
read 1 n
str CO]
END;

;
:

CHR(lgth)

FUNCTIONS AND PROCEDURES USED TO OPERATE ON

PROCEDURE write_str (VAR str: strng);

write_str() writes str to the screen
{

strng

569

VAR
i

eger

BEGIN
for
wr

wr i t e(
t eln

s t r

len_str(str) DO
[ i 3

END;

PROCEDURE fread_str (VAR fd: text; VAR str: strng; lgth:

fread_str() reads a str with length lgth from fd >
{
VAR
i

nt eger

BEGIN
for

str [0]

readC

f d

lgth DO

s t r

CHR(lgth)

END;

PROCEDURE fwrite_str (VAR fd: text; str

{
fwrite_str() writes str to file fd }

strng);

VAR
i

eger

BEGIN
for

i t

to
f d

len_str(str) DO
s t r

END;

FUNCTION trim_str (VAR str: strng): integer;

{
trim_str() trims the blanks off the end of str and
returns its new length >
VAR
lgth

nt eger

BEGIN
lgth := 1 en_5 t r (
while strtlgth]

s t r )
=

lgth
=
lgth strCO]
=
CHR(lgth);
lgth
t r im_5 t r
=
:

END;

integer);

570

APPENDIX

STRING FUNCTIONS

PASCAL: tools.prc

PROCEDURE ucase (strl: strng; VAR str2: strng);

{
ucaseO converts 5tr1 to uppercase letters and stores the
capitalized string in str2 }
VAR
i
integer
BEGIN
to 1 en_s t r ( s t r ) DO
for i :=
BEGIN
i
) <= ORDCz
if CDRDCstrl Ci]) >= ORD('a')) AND CORDC s t r
str2Ci] := CHR(0RD(str1 til)- 32)
else
str2[i] := str! [i]
:

END;
str2[0]

)) then

strl [0]

END;

PROC EDURE makekey (last: strng; first: strng; VAR key: strng);
{
ma kekey() trims the blanks off the ends of the strngs last and
f i r st concatenates last and first together with a space
se parating them, and converts the letters to uppercase }
VAR
integer
lenl
integer;
lenf
blank_str: strng;
BEGIN
lenl := t r im_s t r ( las t )
copy_str (key last);
blank_str[0] := CHRC1 );
;
blank_str[1 ] :=
:

cat_str(key blank_str)
T

lenf

t r

im_s t r ( f

rst )

cat_5 1 r(key,first);
ucase( key key)
,

END;

FUNCTION min (int1,int2: integer): integer;

min() returns the minimum of two integers
BEGIN
if

intl

int2 then

min

else
END;

FUNCTION cmp_str (strl: strng; str2: strng):

{

integer;

If strl = str2, then

function that compares strl to str2.
If strl < str2, then cmp_str returns a
cmp_str returns 0.
A

FUNCTIONS AND PROCEDURES USED TO OPERATE ON strng

negative number.
positive number.

strl

str2,

then cmp_str returns a

VAR
i
integer
length
i nt eger
BEGIN
if len_str(str1 )
BEGIN
:

while strl
:=

len_str(str1

cmp_5 t r

str2Ci] DO

len_s t r ( s t r2) then

then

else

cmp_str := (ORDC s t r 1 [ i ]
END
else BEGIN
length := mi n( len_s t r ( s t r 1
i

len_s t r (

s t r

while (striCil
:=

(ORDC s t r2

) )

str2Ci]) and (i <= length) DO

length then
cmp_str := len_s t r ( s t r

cmp_str

) )

len_s t r ( s t r2)

else
END
END;

(0RD( s t r 1

(0RD(

s t

i ]

Appendix D
Comparing Disk Drives

There are enormous differences among different types of drives in terms of

the amount of data they hold, the time it takes them to access data, overall
cost, cost per bit, and intelligence. Furthermore, disk devices and media are
evolving so rapidly that the figures on speed, capacity, and intelligence that
apply one month may very well be out of date the next month.
Access time, you will recall, is composed of seek time, rotational delay,
and transfer time.
Seek times are usually described in two ways: minimum seek time and
average seek time. Usually, but not always,

time

takes for the head to accelerate

settle to a stop.

from

as these since their

Average seek time

sector

seek time includes the

a standstill,

move one

Sometimes the track-to-track seek time

separate figure for head settling time.

such

minimum

as likely to

One

the average time

track,

and

given, with a

has to be careful with figures

meanings are not always stated

clearly.

takes for a seek if the desired

be on any one cylinder

on any

other. In a

completely random accessing environment, it can be shown that the

number of cylinders covered in an average seek is approximately one-third
of the total number of cylinders (Pechura and Schoeffler, 1983). Estimates
of average seek time are commonly based on this result.
Certain disk drives, called fixed head disk

drives,

require no seek time.

Fixed head drives provide one or more read/write heads per track, so there
is

no need

very

fast,

There

move

the heads

from

track to track. Fixed head disk drives are

but also considerably more expensive than movable head drives.

are generally

significant differences in rotational delay

among

similar drives. Most floppy disk drives rotate between 300 and 600 rpm.
Hard disk drives generally rotate at approximately 3600 rpm, though this
will increase as disks decrease in physical size. There is at least one drive that
rotates at 5400 rpm, and speeds of 7200 rpm are possible. Floppy disks
usually do not spin continuously, so intermittent accessing of floppy drives
might involve an extra delay due to startup of a second or more. Strategies

572

COMPARING DISK DRIVES

such

as sector interleaving

some circumstances.
The volume of data
years, thereby focusing
rate

from

a single

drive

can mitigate the effects of rotational delay

to be transferred has increased

much
is

attention

enormously

data transfer rate.

in recent

Data transfer

constrained by rotation speed, recording density

itself, and the speed at which the controller can pass data
through to or from RAM. Since rotation speeds vary little, the main
differences among drives are due to differences in recording density. In
recent years there have been tremendous advances in improving recording
densities on disks of all types. Differences in recording densities are usually
expressed in terms of the number of tracks per surface, and the number of
bytes per track. If data are organized by sector on a disk, and more than one
sector is transferred at a time, the effective data transfer rate depends also on
the method of sector interleaving used. The effect of interleaving can be
substantial, of course, since logically adjacent sectors are often widely

on the disk

separated physically.

A
from

different

approach to increasing data transfer

A technology

different places simultaneously.

rate

called

to access data

PTD

(parallel

and writes data simultaneously from multiple read/write

heads. The Seagate Sable PTD reaches a transfer rate of over 20 Mbytes per
second using eight read/write heads.
Another promising technology for achieving high transfer rates is
RAID (redundant arrays of inexpensive disks), in which a collection of
small inexpensive disks function as one. RAIDs allow the use of several
separate I/O controllers operating in parallel. These parallel accesses can be
coordinated to satisfy a single logical I/O request, or can service several
independent I/O requests simultaneously.
Although it is very possible that most of the figures in Table D.l will
be superseded during the time between the writing and the publication of
this text, they should give you a basic idea of the magnitude and range of
performance characteristics for disks. The fact that they are changing so
rapidly should also serve to emphasize the importance of being aware of
disk drive performance characteristics when you are in a position to choose
transfer disk) reads

among

different drives.

course, in addition to the quantitative differences

among

drives,

there are other important differences. The IBM 3380 drive, for example, has
many built-in features, including separate actuator arms that allow it to
perform two accesses simultaneously. It also has large local buffers and a
it to optimize many operations that,
have to be monitored by the central

great deal of local intelligence, enabling

with less sophisticated drives,

computer.

573

-w

<J so

C/5

C/3

r^
rf

cm
oc

O.SO

< s
z 8

m in
m o>
c

13 JZ

.* 13++

J2 -S

t-h

00
0 CO
o co

ffl

CO
oc

sc
r-

moo

Efl

lib

m
CO
CM

CM
CM

-3

_
ed

C/5

C
u

r-,

^^
(N
3\

rn .5 uu

o
oc

^
4-1

O
>
n

t/5

O
>

7^-

n j
p
e a

O
C/5
(/;

t/5

'-*-

"5 5c

Tj co
.s

E
o

Ih
CJ

c/3

574

;/i

H U

~
r;
J-i

W)*T3

H u

"T3

pq
ij

CJ
c/J

o
>

o
o

3 S
m a,
u tfl w

"T3
4-1

^ H u 2

u
C/5

t/5

(/5

u u >

'13

Bibliography

AT&T.

System

V Interface

Definition. Indianapolis, IN:

AT&T,

1986.

Computer Algorithms: Introduction to Design and Analysis. Reading, Mass.:

Addison-Wesley, 1978.
+
Batory, D.S. "B trees and indexed sequential files: A performance comparison."
SIGMOD (1981): 30-39.
Bayer, R., and E. McCreight. "Organization and maintenance of large ordered
Baase,

ACM

indexes." Acta Informatica

no. 3 (1972): 173-189.

Bayer, R., and K. Unterauer. "Prefix B-trees."

Systems
Bentley,

no.

(March

ACM

Transactions on Database

1977): 11-26.

"Programming pearls: A spelling checker." Communications of the

no. 5 (May 1985): 456-462.

ACM 28,
Bohl,

Introduction to

IBM

Direct Access Storage Devices. Chicago: Science Re-

search Associates, Inc., 1981.

Borland. Turbo Toolbox Reference Manual. Scott's Valley, Calif: Borland International, Inc., 1984.

Bourne, S.R. The Unix System. Reading, Mass.: Addison-Wesley. 1984.

Bradley, J. File and Data Base Techniques. New York: Holt, Rinehart, and Winston, 1982.

Chaney, R., and B. Johnson. "Maximizing hard-disk performance." Byte 9, no.

5 (May 1984): 307-334.
Chang, C.C. "The study of an ordered minimal perfect hashing scheme." Communications of the
27, no. 4 (April 1984): 384-387.
Chang, H. "A Study of Dynamic Hashing and Dynamic Hashing with Deferred
Splitting." Unpublished Master's thesis, Oklahoma State University, De-

ACM

cember 1985.
"Minimal

Chichelli, R.J.
the

ACM 23,

no.

perfect hash functions

made simple." Communications

(January 1980): 17-19.

Comer, D. "The ubiquitous B-tree."

ACM Computing

Surveys 11, no. 2 (June

1979): 121-137.

Cooper, D. Standard Pascal User Reference Manual.

New

York:

W.W. Norton &

Co., 1983.

575

576

BIBLIOGRAPHY

Crotzer, A.D. "Efficacy of B-trees in an information storage and retrieval envi-

ronment." Unpublished Master's

thesis,

Oklahoma

State University, 1975.

Davis, W.S. "Empirical behavior of B-trees." Unpublished Master's thesis,

Oklahoma

State University, 1974.

H. An Introduction to Operating Systems. Revised 1st Ed. Reading, Mass.:

Addison-Wesley, 1984.
Digital. Introduction to VAX-11 Record Management Services. Order No. AADeitel,

D024A-TE.

Equipment Corporation, 1978.

Equipment Corporation, 1981.
RMS-II User's Guide. Digital Equipment Corporation, 1979.
VAX-U SORT/MERGE User's Guide. Digital Equipment Corporation,
Digital

Digital. Peripherals Handbook. Digital

Digital.

1984.

VAX Software Handbook. Digital Equipment Corporation, 1982.

Dodds, DJ. "Pracnique: Reducing dictionary size by using a hashing technique."
Digital.

Communications of the

ACM 25,

Dwyer, B. "One more time

the

ACM 24,

no.

how

no. 6 (June 1982): 368-370.

to update a master file." Communications of

(January 1981): 3-8.

ACM

Computing
and H.C. Du. "Dynamic Hashing Schemes."
Qune 1988): 85-113.
Fagin, R., J. Nievergelt, N. Pippenger, and H.R. Strong. "Extendible hashTransactions on Database
ing
a fast access method for dynamic files."
Systems 4, no. 3 (September 1979): 315-344.
Computing Surveys 17, no. 1
Faloutsos, C. "Access methods for text."
(March 1985): 49-74.
Flajolet, P. "On the Performance Evaluation of Extendible Hashing and Trie
Searching." Acta Informatica 20 (1983): 345-369.

Enbody,

R.J.,

Surveys 20, no. 2

ACM

Flores,

Peripheral Devices.

Englewood

Cliffs, N.J.: Prentice-Hall, 1973.

Gonnet, G.H. Handbook of Algorithms and Data Structures. Reading, Mass.: Addison-Wesley, 1984.
Hanson, O. Design of Computer Data Files. Rockville, Md.: Computer Science
Press, 1982.

Held, G., and

ACM 21,

Stonebraker. "B-trees reexamined." Communications of the

no. 2 (February 1978): 139-143.

Hoare, C.A.R. "The emperor's old clothes." The C.A.R. Turing

dress. Communications of the

ACM 24,

Award

ad-

no. 2 (February 1981): 75-83.

IBM. DFSORT General Information. IBM Order No. GC33-4033-11.

IBM. OS/VS Virtual Storage Access Method (VSAM) Planning Guide. IBM Order
No. GC26-3799.
Jensen, K., and N. Wirth. Pascal User Manual and Report, 2d Ed. Springer Verlag,
1974.

Keehn, D.G., andJ.O. Lacy.

"VSAM

data set design parameters."

IBM

Systems

fournal 13, no. 3 (1974): 186-212.

Kernighan, B., and R. Pike. The

UNIX Programming

Environment.

Englewood

Cliffs, N.J.: Prentice-Hall, 1984.

Kernighan, B., and D. Ritchie. The

N.J.: Prentice-Hall, 1978.

Programming Language. Englewood

Cliffs,

577

BIBLIOGRAPHY

Kernighan, B., and D. Ritchie. The

wood

Programming Language, 2nd Ed. Engle-

Cliffs, N.J.: Prentice-Hall, 1988.

Knuth, D. The Art of Computer Programming. Vol.

Ed. Reading, Mass.: Addison-Wesley, 1973a.

Fundamental Algorithms. 2d

Knuth, D. The Art of Computer Programming. Vol. 3, Searching and Sorting. Reading, Mass.: Addison-Wesley, 1973b.
Lang, S.D., J.R. Dnscoll, andJ.H. Jou. "Batch insertion for tree structured file

improving differential database representation." C.^-TR-85,

Department of Computer Science, University of Central Florida, Orlando,

organizations

Flor.

Lapin, J.E. Portable

and

UNIX

System Programming. Englewood Cliffs, N.J.:

Prentice-Hall, 1987.

"Dynamic Hashing." BIT

Larson, P.

18 (1978): 184-201.

ACM

Larson, P. "Linear Hashing with Overflow-handling by Linear Probing."

Transactions on Database Systems 10, no.

(March

75-89.

1985):

Larson, P. "Linear Hashing with Partial Expansions." Proceedings of the 6th Conference on Very Large Databases. (Montreal, Canada Oct 1-3, 1980) New

ACM/IEEE: 224-233.

York:

Larson, P. "Performance Analysis of Linear Hashing with Partial Expansions."

ACM
Laub, L.
S.

Transactions on Database Systems 7, no. 4

"What

CD-ROM?"

Ropiequet, eds.

(December

1982):

566-587.

CD-ROM: The New Papyrus. S. Lambert

Redmond, WA: Microsoft Press, 1986: 47-71.
In

and

M.K. McKusick, M. Karels, andJ.S. Quarterman. The Design and

4.3BSD UNIX Operating System. Reading, Mass.: Addi-

Leffler, S.,

Implementation of the

son-Wesley, 1989.
Levy, M.R. "Modularity and the sequential

file

update problem." Communications

ACM

25, no. 6 (June 1982): 362-367.

of the
Litwin, W. "Linear Hashing: A New Tool for File and Table Addressing." Proceedings of the 6th Conference on Very Large Databases (Montreal,

1-3, 1980)
Litwin,

New

York:

"Virtual Hashing:

the 4th Conference on

Canada, Oct

ACM/IEEE: 212-223.

Dynamically Changing Hashing.

Very Large Databases (Berlin 1978)

New

Proceedings oj

York:

ACM/

IEEE: 517-523.
Loomis, M. Data Management and

File Processing.

Englewood

Cliffs, N.J.:

Pren-

tice-Hall, 1983.

Lorin, H. Sorting and Sort Systems. Reading, Mass.: Addison-Wesley, 1975.

Lum, V.Y.,

P.S. Yuen, and M. Dodd. "Key-to-Address Transform Techniques,

Fundamental Performance Study on Large Existing Formatted Files."
Communications of the
14, no. 4 (April 1971): 228-39.

ACM

Lynch, T. Data Compression Techniques and Applications.

trand Reinhold

Madnick, S.E., and

Company,
J.J.

Inc.,

Donovan.

New

York: Van Nos-

1985.

Operatifig Systems.

Englewood

Cliffs. N.J.:

Prentice-Hall, 1974.

Maurer, W.D., and T.G. Lewis. "Hash table methods."

7, no. 1 (March 1975): 5-19.

ACM Computing

Surveys

578

BIBLIOGRAPHY

McCrcight, E. "Pagination of
cations

of the

ACM 20,

McKusick, M.K.,

W.M.

ACM

UNIX."

trees

with variable length records." Communi-

no. 9 (September 1977): 670-674.

Joy,

S.J. Leffler,

Transactions on

and R.S. Fabry.

Computer Systems

fast file

system for

no. 3 (August 1984):

181-197.
Mendelson, H. "Analysis of Extendible Hashing." IEEE Transactions on Software
Engineering 8, no. 6 (November 1982): 611-619.
Microsoft, Inc. Disk Operating System. Version 2.00.

Language Series. IBM, 1983.

Morgan, R., and H. McGilton. Introducing

UNIX

IBM

Personal

System V.

New

Computer
York:

Mc-

Graw-Hill, 1987.

Murayama,

K., and S.E. Smith. "Analysis of design alternatives for virtual

memory

indexes." Communications of the

ACM 20,

no. 4 (April 1977):

245-254.
Nievergelt,

J.,

H. Hinterberger, and K. Sevcik. "The grid

metric, multikey
1

file

structure."

ACM

file:

an adaptive sym-

Transactions on Database Systems 9, no.

(March 1984): 38-71.

Ouskel, M., and P. Scheuermann. "Multidimensional B-trees: Analysis of dynamic behavior." BIT 21 (1981):401-418.
Pechura, M.A., and J.D. Schoeffler. "Estimating file access of floppy disks."
Communications of the
26, no. 10 (October 1983): 754-763.

ACM

Peterson, J.L., and A. Silberschatz. Operating System Concepts, 2nd Ed. Reading,

Mass.: Addison- Wesley, 1985.

Peterson,

W.W.

"Addressing for random access storage."

and Development

IBM fournal

of Research

no. 2(1957):130-146.

and T. Sterling. A Guide to Structured Programming and PE/I. 3rd Ed.

York: Holt, Rinehart, and Winston, 1980.
Ritchie, B., and K. Thompson. "The UNIX time-sharing system." Communications of the
17, no. 7 (July 1974): 365-375.
Pollack, S.,

New

ACM

Ritchie,

D. The Unix I/O System. Murray

Hill, N.J.:

AT&T

Bell Laboratories,

1979.

Robinson, J.T. "The K-d B-tree:

dynamic indexes."

search structure for large multidimensional

ACM SIGMOD

1981 International Conference on Manage-

ment of Data. April 29-May 1, 1981.

Rosenberg, A.L., and L. Snyder. "Time and space optimality in B-trees."

ACM

(March 1981): 174-183.

Sager, T.J. "A polynomial time generator for minimal perfect hash functions."
Communications of the
28, no. 5 (May 1985): 523-532.
Salton, G., and M. McGill. Introduction to Modern Information Retrieval. McGrawTransactions on Database Systems 6, no.

ACM

Hill, 1983.

Salzberg, B. File Structures.

Salzberg, B., et

al.

Englewood

"FastSort:

Cliffs, N.J.: Prentice-Hall, 1988.

Distributed Single-Input, Single-Output Sort."

ACM SIGMOD International Conference on Management

SIGMOD RECORD, Vol. 19, Issue 2, (June 1990): 94-101.

Proceedings of the 1990

of Data,
Scholl,

M. "New

tions

file

organizations based on dynamic hashing."

on Database Systems

no.

(March

1981): 194-211.

ACM

Transac-

579

BIBLIOGRAPHY

Severance, D.G. "Identifier search mechanisms:

ACM Computing Surveys 6,

model."

"On

Snyder, L.

survey and generalized

no. 3 (September 1974): 175-194.

B-trees reexamined." Communications of the

ACM 21,

no. 7 (July

1978): 594.
J. P. Tremblay and R.F. Deutscher. "Key-to-Address Transformation Techniques." INFOR (Canada) Vol. 16, no. 1 (1978): 397-409.
Spector, A., and D. Gifford. "Case study: The space shuttle primary computer
system." Communications of the
27, no. 9 (September 1984): 872-900.
Standish, T.A. Data Structure Techniques. Reading, Mass.: Addison-Wcsley, 1980.
Sun Microsystems. Networking on the Sun Workstation. Mountain View, CA: Sun
Microsystems, Inc., 1986.
Sussenguth, E.H. "The use of tree structures for processing files." Communications of the
6, no. 5 (May 1963): 272-279.

Sorenson, P.G.,

ACM

"Keyfield design." Datamation (October

Sweet,

Teory,

T.J.,

and

1985): 119-120.

Fry. Design of Database Structures.

J. P.

Englewood

Cliffs, N.J.:

Prentice-Hall, 1982.

The Joint ANSI/IEEE

Pascal Standards Committee. "Pascal:

SIGPLAN Notices

Forward

to the can-

28-44.
and P.G. Sorenson. An Introduction to Data Structures with Applications. New York: McGraw-Hill, 1984.
Ullman, J. Principles of Database Systems, 2d Ed. Rockville, Md.: Computer Scididate extension library."

Tremblay,

19, no. 7 (July 1984):

J. P.,

ence Press, 1980.

Ullman, J.D.

Principles of Database Systems,

3d Ed. Rockville, Md.: Computer

Science Press, 1986.

U.C. Berkeley.

UNIX Programmer's

Reference Manual. University

of California

Berkeley, 1986.

VanDoren,
the

"Some

NSF-CBMS

empirical results on generalized

zation and Retrieval. University

VanDoren,

J.,

and

AVL

trees." Proceedings of

Regional Research Conference on Automatic Information Organi-

Gray.

In Information Systems,

"An

of Missouri

Columbia

(July 1973):

algorithm for maintaining dynamic

COINS

IV,

New

46-62.

AVL

trees."

York: Plenum Press, 1974:

161-180.
Veklerov, E. "Analysis of Dynamic Hashing with Deferred Splitting."
Transactions on Database Systems 10, no. 1 (March 1985): 90-96.

ACM

Wagner, R.E. "Indexing design considerations." IBM Systems Journal 12, no. 4
(1973): 351-367.
Wang, P. An Introduction to Berkeley Unix. Belmont, CA: Wadsworth Publishing
Co., 1988.

Webster, R.E.
sity,

"B +

trees."

Unpublished Master's

thesis,

Oklahoma

State Univer-

1980.

Welch, T.

"A Technique

for

High Performance Data Compression." IEEE Com-

puter, Vol. 17, no. 6 (June 1984):

Wells, D.C.,

8-19.

E.W. Greisen and R.H. Harten. "FITS:

Flexible

Image Transport

System." Astronomy and Astrophysics Supplement Series, no. 44 (1981):

363-370.
Wiederhold, G. Database Design, 2d Ed. New York: McGraw-Hill, 1983.

580

BIBLIOGRAPHY

Wirth, N.

"An

assessment of the programming language Pascal."

IEEE

Transac-

Sofiware Engineering SE-1, no. 2 (June 1975).

tions on

Yao, A. Chi-Chih.
159-170.
Zocllick, B.

"On random 2-3

"CD-ROM

trees." Acta Informatica 9, no. 2 (1978):

software development." Byte

11, no. 5

(May

1986):

173-188.

System Support for CD-ROM." In

Lambert and S. Ropiequet, eds. Redmond,

Zoellick, B. "File
rus. S.

CD-ROM: The New PapyWA: Microsoft Press,

1986: 103-128.

Zoellick, B. "Selecting an

ume

Approach

2: Optical Publishing. S.

1987: 63-82.

Document Retrieval." In CD-ROM, VolRedmond, WA: Microsoft Press,

Ropiequet, ed.

Index

Abstract data models

explanation

of,

FITS image

128
Access. See

124-125, 132

example

Random

Record

of,

access;

access

variable order, 422-425, 437

trees

explanation

trees,

B+

trees

447, 493

overflow
sector, 46

LRU replacement, 376

simple prefix, 429-430. See
Simple prefix

use of,

114, 115

352-362

Assign statement, 9
list

explanation of, 193, 217

of fixed-length records, 193
195

of variable-length records,

196-198
Average search length
of,

492

number of collisions

and, 476
progressive overflow and,

469-471
record turnover and, 482

433

trees vs.,

and,
of,

553-555
347

deletion, redistribution,

and

concatenation in, 366-372

depth of, 364-366
explanation

558

explanation

trees

B-trees

construction

versions of, 137

Avail

431-433

insertion,

and hex values, 107-109

Baver, R., 334-335, 337, 347,

348, 363, 371-372, 431
Berkeley UNIX, compression

of,

tor indexes, 234

and information placement,

377-379
invention

of,

334-336

383
of order m, 364, 382
order of, 362-364, 364, 382.
383
page structure used by, 253.
352
splitting and promoting, 347leaf of,

189

fit

of, 217
placement strategies, 202
Better-than-random, 492-493
Binary encoding, 137-138
Binary search
explanation of, 204. 2(>5. 217
of index on secondary
storage, 234, 336
limitations of, 207-208, 228

explanation

sequential vs., 204-206

on variable-length entities,

422
Binary search

trees

balanced. 4

explanation

of,

337340

heap and, 280- 2S1

paged, 343-347. 352, 353
Binding
explanation of, 252
in

indexing,

Bkdddkey

249-250

function. 514. 515.

519

Bk del key function. 524. 526

Bkfind buddy function. 521
Bk<plit function,

516-519
524-

BktryColl<ip<c function.

351

underflow

in,

Best

and

CD-ROM

table,

436

429-431

ASCII

UNIX,

of, 4, 6, 413,

algorithms for searching and

126

431-433

virtual, 373-377, 383

Balanced merge
explanation of, 312-314, 325
improving performance of,
315, 316

general discussion regarding,

B+

in headers,

553-555

and,

explanation

also

K-wav, 314-315

Adel'son-Vel'skh, G. M., 341

ASCII

340-343,

372-373, 382

CD-ROM

46-48

indexes to keep track of, 102,

103
open. See Progressive

of, 6,

382
and files, 4

B-trees vs., 433

buckets and, 471-479

extendible hashing and, 510
513
hashing and, 452-466

home,

use of,

AVL

access; Sequential

Access mode, 29
Addresses
block,

Average seek time, 572

in,

408

526

581

582

INDEX

Bk try combine

function,

524-

526
Block addressing, 46-48, 471
Block device, 82
Block I/O
explanation of, 83
UNIX. 46
use of,

Block
and

Byte count
Byte offset

file

78-79

and order

strengths and weaknesses of,

of,

109

63-68

of,

number

a predictable

of, 101

on performance, 53-54

of, 3,

45-47

sequence sets, 407-413,

417-421
Boolean functions, 18
in

Bpi, 82

392-393
394-396
Btutil.prc, 400-404

Btutil.c,

Buckets
buddv. 520-522, 535-536
and effect on performance,

472-476
explanation
493

closing

LIST program

tries

and,

507-509
of,

535-536

procedure for finding, 520522

Buffer pooling, 70-71

Buttering
bottlenecks

in,

double, 70, 311

explanation

of, 29, 38,

282-283, 287
multiple, 69-72, 283
disks and cache

input,

RAM

memory as, 55
during replacement selection,
303-304
and virtual

Collision resolution

155-156

trees,

373-377

156-157
94-95,
154-155

CD-ROM,

556

488

of, 447, 493

and extra memory, 462-466

velocity),

544^547-549

Comm, 322. 325

Compact disc read-onlv

memory (CD-ROM).

CD-ROM
Compaction

explanation

448-449
methods of reducing, 449450, 462-466
predicting, 457, 461-466
Color lookup table, 128
Color raster images, 128-129
Comer, Douglas, 334-336, 363
in hashing,

98, 99,

Cache, 55h
Canonical form
explanation of, 144
for keys, 110
in secondary indexes, 237
Cascade merge, 316
CAV (constant angular

of, 543,

563-565

problem.

545-546
system and, 79, 559-563
hashed files and, 557-559
file

tables, 487,

Collisions

explanation

as file structure

Buffering blocks, and

162-

103-105, 107, 109,

writstrm.c,

CD-ROM,

by chained progressive
overflow, 484-486
by chaining with separate
overflow area, 486- 487
by double hashing, 483
by progressive overflow,
466-471
and scatter

162

166
writrec.c,

(constant linear velocity),

Coalescing holes, 201

update. c, 119, 120, 123.

Buddy buckets
explanation

strjuncs.c,

from

547-549
Cmp, 320-322, 325
Coalescence, 217-218

106, 107, 158

readstrm.c, 99,

411

544,

389-396
391-392

readrec.c,

block size and, 411

effect on records, 469
explanation of. 42-43, 83,

CLV

keys into B-tree,

makekey.c, 161

13-14

use of, 45

159

to insert

14,

internal fragmentation

19-20
C programs
btio.c, 392-393
btutil.c, 394-396
driver. c, 390-391
fileio.h, 153-154
find.c, 160-161

520
and implementation
479
527, 534

15-18

in,

352,

space utilization for, 526-

in,

record length and, 105

insert. c,

476-

in,

files,

Clusters

in,

extendible hashing and, 512of,

Closing

119

hashing fold and add step

451

getrf.c,

of, 450, 471, 472,

in,

direct access in, 117, 123

seeks

Btio.c,

484-486
Chang, H., 534

CLOSE(
character strings

file

553-557

Character I/O, 83
Character I/O system, 78
Character strings, in Pascal and
C, 119

portability and, 141

organization

tree structure and,

Chained progressive overflow,

per track, 573

stream of, 146

Blocking factor, 46, 57-59

Blocks
explanation of, 82-83, 144
grouping records into, 113

546-

552-553

dump

making records
554

543-545

551

to calculate, 116

journey

choice of, 410-411

effect

of,

Bytes

size

CD-ROM,

history of,

physical organization of,

explanation

RRN

144

field,

explanation
storage,

of,

218

190-192

Compar( ), 320
Compression. See Data
compression

See

583

INDEX

Computer hardware, sort time

and. 293-295
Computer Systems Research
Group (CSRG), 53, 54
Concatenation

370

in B-trees, 367, 369,

due to insertions and

deletions. 408. 409
explanation of, 382

nominal. 59. 60, 85

Da tarec,

of block
performance

effect

117

Deletion. See Record deletion

Delimiters

organization

end of records. 102-103

ot.

144

and nondata overhead. 47-49

45-47
organizing tracks by sector.
speed

applied to general ledger

Descriptor table, 83

types of. 37

program, 268-276
and matching, 259-263
and merging, 263266
and multiway merging. 276-

Device driver. 76, 79. 83

Difi 321-322. 326
Dir_double function, 518. 519

285-286
of, 266-268

of.

(DASDs),

Direct

Conversion
text,

138-140

m UNIX.

318-320
320-322

utilities for.

Count subblocks.
),

turning

Cylinders

of.

tries into. 507,

508

Dir_ifi<-.bucket function. 518.

computing capacity

of,

of, 38, 40,

40
83

519

Dir_.try collapse function, 522.

523
Disk access

Dangling pointers, 213

Data

decreasing

application-oriented view of.

125

number

of. 5

rotational delay and, 50

49-50

seek access and, 37,

standardization of. 136-139

for,

188-189

218
irreversible, 189
and simple prefix method to
produce separators. 431
of,

suppressing repeated
sequences for, 186-188

UNIX. 189-190

using different data

186

for.

185

Data files, 212. 213. 239

Data transfer rate. 552. 573
Data transmission rate
estimating,

5960

explanation

of.

Record

distribution

Double buffering. 70, 311

Double hashing, 483, 493
Drive capacitv, 40

390-391
397-399

Driver. pa<,

Dynamic

531, 532

hashing. 528-530.

Disk bound. 54
Disk cache. 55. 84
Disk controller. 67
Disk drives, 37
comparison of. 572-574
dedicated. 294
fixed head. 572
replacement selection and use
of two. 307, 309
use of multiple. 309-31
Disk packs
explanation

removable.
Disks

EBCDIC

(extended binary

coded decimal interchange

code), 136, 137
Effective recording densitv. 59.

84
Effective transmission rate.

of. 38.
;

as bottleneck.

59-

60, 84

EFM encoding, 547

80/20 rule of thumb. 488-489,
493
Enbodv, R. J., 531. 532
End-of-file (EOF), 28. 29
EndPosition(f 21
Entropy reduction, 189m
Entrv-sequenced files
.

basic operations on.

transfer time and, 51. 112

assigning variable-length

explanation

timing computations and,

51-53

Data compression

522
536
extendible hashing and. 513
519. 527-528. 530
collapsing

space utilization for. 527-528

46, 83

explanation

.See

317-318

536

explanation

Cosequential processing

codes

(DMA),

67m. 83

number and

Distribution.

Du, H.

access

of. 2

tape vs.. 61-62.

Driver. c,

37, 83

memory

41-45

structure, 139. 141

CREATE(

144-145

Direct access storage devices

explanation

file

of.

use of, 115-117, 123

Controller

speed

Direct access

explanation

37-39

of,

organizing tracks by block,

separating fields with, 97-99

Density, packed. See Packed
density

279,

53-54

estimating capacities and

space needs of, 38, 40-41

summary

of.

Davis. W. S.. 372

Dedicated disk drives. 294
Deferred splitting. 536

explanation

Consequential operations, 258.

325
Consequential processing model

size

explanation

of,

230-234

252

simple indexes with. 227230

Extendible hashing

and controlling

splitting.

533-534
and deletion. 520-526
and dynamic hashing, 528530
explanation ot. 6. 505510,
536
implementation of. 510-519
and linear hashing, 530-533
use ot.

4-5

Extendible hashing performance

and space utilization for

54-55

buckets.

526-527

584

INDEX

and space

merge

utilization for

527-528

directory,
Extensibility,

External fragmentation

218
methods to combat, 201
placement strategies and, 203
of,

tools for,

317-318

96-99
145

of, 96,

190-203

work

in,

organization and, 122

CD-ROM,

79,

with mixtures of data objects,

kernel and,

79-80

483-487
448-450
deletions and, 479-483

22-23,

559-563

131-132
object-oriented, 132-133,
141, 145

dynamic, 528-530, 536

79, 84, 141

explanation of, 84
UNIX counterpart

to,

File descriptor table,

218
placement strategy,

201-202
FITS (flexible image transport
system), 126-129, 136-137

102,

118-119

Fixed-length records

File descriptor,

74-75

dump, 107-109
manager
clusters and, 42-43

and

access, 123,

File

organization
access and,

527-528, 530
Floppy disks, 37, 572
Fold and add, 451-452, 493
Formatting

method

of,

explanation
pre-,

145

File protection,

of, 5, 6, 75,

closing,

data, 212, 213,

239

displaying contents of,

end

of,

84,

218

450-453

(hierarchical data format),

Header

files

explanation

of,

29-30

FITS, 126, 127, 130

self-describing, 125

UNIX,

Header records,

120, 122, 145

281-284

283, 284

Heapsort
of, 304, 326
use of, 280-281, 287, 291,

explanation
311

Height-balanced

trees, 341,

382-383

Hex dump

15-18

logical, 9, 30. See also Logical

files

44-45,

44-45, 198-200,
203, 218
storage, 198-201
Frames, 56, 84

13-14

4-5

writing out in sorted order,

of,

internal,

Files

use of,

Hashing algorithms
perfect, 449, 494

properties of, 280

external, 201, 203, 218

153-154

462
with simple indexes, 234

building,
74//

explanation

for, 139, 141

historv of, 3-5, 124

Fileio.h,

and, 462-466
record access and, 488-489
record distribution and, 453-

Heap

Fragmentation

File structures

conversions
explanation

Fprintf(

of,

530-533, 536

memory

130

Flajolet, P.,

122-123

linear,

192-196

deleting,

explanation of, 145

use of, 101, 102, 118-119

names, 76-78

hashing

steps in simple,

File

446-

indexed, 493
indexing vs., 447

HDF

204

File

explanation of, 84
function of, 64, 66, 68

of, 6, 431,

448, 493
extendible. See Extendible

Fixed disk, 84
Fixed head disk drives, 572
Fixed-length fields, 96-98, 101,

method, 145
allocation table (FAT)

File-access

file

double, 483, 493

explanation

160-161
Find.new range function, 517
Find. pas, 175-176
fit,

557-559
466-

471,

using indexes, 249

First-fit

CD-ROM,

collisions and,

of,

First

123

buckets and, 471-479

and

21-22

in,

explanation

UNIX,

link, 77,

Hardware. See Computer

hardware
Hashing

Find.c,

File access

File

in,

disks, 37

collision resolution and,

contained
Filesystems

making records a predictable

number of, 101-102
reading stream of, 99-100

file

files

special,

early

Hard
Hard

physical, 8-9, 30. See also

Physical

Gray, J., 343

Grep, 115

129-

special characteristics

Fields

explanation

in,

self-describing, 125

310-311

Field structures,

285-

132
normal, 78
opening, 9-13

reclaiming space

External sorting. See also

Sorting
tapes vs. disks for,

size of,

mixing object types

133-134, 145

Extents, 43-44, 84

explanation

and

sorts

311

Gather output, 72
Get. pre,
Getrf.c,

174-175
159

explanation

of,

107-109

portabilitv and, 135

HIGH_VALUE,

265-266, 326

585

INDEX

Home

address, 447, 493

Huffman

explanation

code, 188, 218

of, 226-227, 252

keep track of addresses,

102, 103

I/O
approaches

paged, 383
primary, 237

in different

Keys
explanation

block, 46, 78-79, 83

secondary. See Secondary

indexes

character, 78, 83

selective, 248,

processing as, 14
overlapping processing and,

simple, 227-230, 234-235,

languages

16-17

to,

file

RAM

buffer space in
performing, 61
scatter/gather, 86

252

I/O buffers, 64-65, 69

I/O channels
transmission time and, 294295

Insert.c,

391-392

Insertions
+

of,

145

hashing methods and, 455456

and index content, 413-415
indexing to provide access by
'

multiple,

Inode table, 76
Input buffers, 282-283, 287
Insert( ) function, 357, 359

UNIX, 72-80

252

Indexing, hashing vs., 447

Inodc. See Index node

280-281, 283

Key held. 252

Key subblocks, 46-47, 85
KEYNOI)ES|
209-212

235-239

placement of information
associated with, 377-379
primary, 110, 111, 146

promotion

of, 383
sequence set, 430
secondary. See Secondary
keys

role in

418-421
355-360, 371-372
block splitting and, 408, 409

as separators,

in B-trees,

separators instead of,

variable-length records and,

description of, 67

236-237
random, 429

explanation

tombstones and, 481-482

use of, 311

of,

Insert. pre,

use of multiple, 309-310

I/O redirection
explanation of, 30

IBM,

examples

standardization issues and,

IEEE Standard
data

format, 137, 138

files
files

memory,

Inverted

contents of, 413

of deletions on, 417,

role

Index

of, 416, 433,

of separators
set

in,

422-425

244-248

Irreversible compression, 189

explanation

of,

218

files,

3-4

binding and, 249-250

Landis, E. M., 341

Larson, P., 528, 530. 534

LaserVision, 543-544

Leaf
of B-tree, 363, 383
Least-recently-used

(LRU)

strategy, 71

Ledger program, application of

consequential processing

Job control language (JCL), 9

model to, 268-276

Lempel-Ziv method, 189
Linear hashing, 530-533, 536

/C-way merge

Linear probing. See Progressive

421-422

Index tables, 130

Indexed hashing, 493
Indexed sequential access
explanation of, 406-407, 437
and sequence sets, 412
Indexes

added

lists

blocks

internal structure of,

size of,

437

430

373

134-135

Match

explanation of, 252

secondary index structure
and,

418

211-213, 285
pinned records and, 213-214
Keywords, 129, 130
Knuth, Donald, 301-302. 311.
limitations of,

Languages, portability and,

206

operation

Index node, 76, 77, 85

Index set

explanation

207-208

Intersection, 259. See also

335. See also B-trees

208-211, 218,

Lands, 546-547

limitations of,
of,

of,

312, 317, 342, 363, 372.

44-45, 218

minimization of, 198-200

placement strategies and, 203

use

234-235,

effect

of,

Internal sort

and, 212, 213

too large to hold in

Internal fragmentation

explanation

135-139

of,

Kcysort

289-290

56-57, 85

of,

explanation

415

explanation

Interleaving factor

3380 drive, 573

portability and

Index

399-400

explanation

430-431
413-

379-380

Interblock gap, 47

UNIX, 25-26

IBM

trees,

index,

I/O processors

balanced,

314-315

explanation of, 326

use of, 276-279, 293, 295
Kernel
explanation of, 85

and filesystems, 79-80

Kernel I/O structure, 72-76

overflow
Linked lists
explanation of. 218
use of, 192-193, 245-247

LIST program
explanation
in Pascal

of.

15.

24-25

and C, 15-18

586

INDEX

Lists

inverted,

244-248
Linked

linked. See

lists

Litwin, W., 530, 533

Loading
of CD-ROM, 555
+
of simple prefix B

maintaining

match
311,

Machine architecture, 135-136

Magnetic disks, 37. See Disks
Magnetic tape
applications for, 60-61
disks vs., 61-62
and estimated data
transmission times, 5960
organizing data on, 56-57
sorting files on, 311-318
and tape length requirements,

Makekey.c, 161

Mass storage system, 85

Match operation
of,

326

merge vs., 264-265

for names in two lists, 259
334-335, 337,
347, 348, 363, 371-372,
380
E.,

Memory
and

collisions,

index
in,

files

462-466

too large to hold

234-235

loading index files into, 231

rewriting index file from,

231-232
Merge
balanced, 312-314
cascade, 316
fc-way,

285-

290-292

explanation

of, 145
image, 128
125-126, 128

and

raster

use

of,

Mid-square method, 456, 494

Minimum hashing, 494
Minimum seek time, 572

Mod

operator, 451-452
Morse code, 188
Move mode, 71

multistep. See Multistep

merge

9-13

files,

Operating systems, portability

and, 134
Opfind function, 514, 515
Optical discs, 37

Order
of B-trees, 362-364, 383,
422-425, 437
file dump and byte, 109
of merge, 327
Overflow. See also Progressive
of,

494

508-510
Overflow records
buckets and, 473-475, 532
expected number of, 464-465
techniques for handling, 466splitting to handle,

467

310

Multistep merge

number of seeks

using, 295-298, 311

explanation of, 326
replacement selection using,
304, 306, 307

Multiway merge
consequential processing

model and, 276-279

files, 285-

for sorting large

286

Packing density. See

also

Space

utilization

average search length and,

470
buckets and, 472-473
explanation of, 462-463, 494

overflow area and, 486-487,

558
predicting collisions for
different,

463-466

Page fault, 375

Paged binary trees

Network I/O system, 78

Nodes
in B-trees, 347-349
index, 76, 77, 85, 421

Nominal recording density, 85

Nominal transmission rate

explanation

of,

343-345

structure of, 253, 352

top-down construction
345-347
128

Parallel transfer disk

explanation

Parallelism, 54

85
Nondata overhead, 47-49
Nonmagnetic disks, 37
of,

of,

Paged index, 383

Palette,

computing, 59

(PTD), 573

Pareto Principle, 488-489

Parity, 86

Parity bit, 56

276-279

multiphase, 326

75-76, 85
13, 76

overflow

use of, 315-317

Multiprogramming

decreasing

file table,

explanation

Multiphase merges
explanation of, 326

effects of,

12,

Open() function,

Opening

time involved in, 287-290,

308
of two lists, 263-266

to avoid disk bottleneck,

57-59
and UNIX, 80
Makeaddress function, 510513, 532

263
McCreight,

files,

addressing. See

Progressive overflow

Open

Metadata

LOW_VALUE, 326
LRU replacement, 375-377

explanation

Open

through, 208

files

for sorting large

Locality, 55, 252

OPEN(

264-265

vs.,

function, 514, 515, 519

OpJel function, 526

OpJir function, 522, 524

numbers of lists,
278-279

trees,

parity,

Op^add

for large

425-429
two-pass, 485-486
Locate mode, 71
Logical files
explanation of,
in UNIX, 23

Odd

order of, 327

polyphase, 316, 317, 327
Merge operation
explanation of, 326

O(l) access, 446-447

Object-oriented file access, 132133, 141, 145

Pascal

character strings

in,

119

direct access in, 117, 123

587

INDEX

hashing fold and add step in,

451
header records and, 120, 122

LIST program
opening

in,

files in,

15-18

10-11

record length and, 105

20-21
Pascal programs
btutil.prc, 400-404
driver.pas, 397-399
find. pas, 175-176
get. pre, 174-175

methods

182

Position(f), 21

Prefix, 416. See also

176-182

171-172
94-95,

writstrm.pas,

98, 99,

168-169
Pathnames, 30, 562, 563
Perfect hashing algorithm, 449,
494
files

of,

Simple

division,

452-453, 494

Process, 86

Progressive overflow
chained, 484-486
of, 466-467, 494
and open addresses, 480
and search length, 468-471,
476, 477
Promotion of key, 355-357,
383
Protection mode, 30
Pseudo random access devices,
335
PTD (parallel transfer disk), 573

of,

explanation

of,

25-26
546-547

use of,

Pixels, 128

Placement strategies
explanation of, 218
selection of, 203
types of,

201-202

chained progressive
overflow, 484-486
dangling, 213
Poisson distribution
in

applied to hashing, 460-461,

473

457-460, 494
packing density and, 463
Polyphase merge, 316, 317, 327
explanation

of,

Portability

explanation

Radix searching. See Tries

Radix transformation, 456
RAID (redundant arrays of
inexpensive disks), 573

RAM
RAM

146

disks, 55,

51-53

access,

access

196

from hashed file, 479-483

index, 233, 237-238
storage compaction and, 190192
of variable-length records,
also

vs. tape, 61,

storage

in,

304-306
545

54-55
amount of, 293-

increased use of,

increasing

294
in, 206-208, 211,
279-285, 287-290
Random hash functions, 454456
Randomization, 455, 456, 494.
See also Hashing

sorting

READ(

distribution

uniform, 454

117-121
and header records, 120, 122
methods of organizing, 101
103
that use length indicator,

103-107
Record updating, index, 233234,

238-239

Records
explanation of, 100, 101, 146
in Pascal,

96n

reading into

RAM,

287-288

record structure and length

of,

117-121

Redistribution. See also Record

C, 16

explanation

hashing functions and, 453462

Poisson. See Poisson

choosing record length and,

memory (RAM)

access time using, 2,

in
of,

526
of fixed-length records, 192

Record keys, 109-111

Record structures

buffer space, 61

Random
Random

file

and B-trees, 347-349

extendible hashing and, 520-

Redistribution

and disk

Pointers

RAM

and
sort, 304-306
Record additions. See Insertions
Record blocking, 112-113
Record deletion
+
in B
trees, 418-421
in B-trees, 366-368, 370

196-198
Record distribution. See

320, 327

of,

Pipes

Platter,

218
213-214, 235

explanation
use

Qsort(

172-173
155-156
Readstrm.pas, 169-172
Record access, 3-4
file access and, 51-53
hashing and, 488-489
patterns of, 488-489

Readrec.pas,

using replacement selection

for, 111

UNIX, 23-26

Readstrm.c, 99,

8-9, 30

Pinned records

Pits,

requirements

Prime

Readrec.c, 106, 107, 158

explanation

update. pas, 119, 120, 122, 123,

explanation

Readfieldf

explanation of, 110, 146

in index files, 244, 246, 247

397-404
insert. pre, 399-400
readrec.pas, 172-173
readstrm.pas, 169-172

sequential search and, 112

use of, 14-15, 113

and, 141-142

Primary indexes, 237

Primary keys
binding, 249

352,

Physical

136-

prefix B-trees

to insert keys into B-tree,

writrec.pas,

in Pascal,

for achieving,

141

UNIX

seeks in,

stod.pre,

134-136

factors of,

of,

distribution

588

INDEX

370-372,

in B-trees, 367,

408, 410, 425

explanation

of,

383

Redundancy reduction,
187, 188,

185,

219

Redundant arrays of
inexpensive disks (RAID),
573
Reference field, 228-229, 252
Relative block number (RBN),
423
Relative record number (RRN)
access by, 116, 204, 207
explanation of, 146
hashed files and, 476-477
in stack, 193, 194

and variable-length records,

196

Replacement
based on page height, 376377

LRU, 375-377

heapsort and, 280

Sequential search
Search length, 469. See also

for

Average search length

Secondary indexes
on CD-ROM, 556-557
improving structure of, 242248
primary vs., 237

236-237
to, 237-238
record updating to, 238-239
retrieval and, 239-241
record addition
record deletion

306-308

to,

235-236

use of,

Secondary key fields, 235

Secondary keys
binding, 249
index applications
111,

of,

Retrieval, using combinations

of secondarv keys, 239242

Rewrite statement, 10

Rotational delav, 50, 86, 572-

573
Run-length encoding
explanation of, 219
use of. 186-188

Runs

235-238

retrieval using

239-242

of,

327

298-303
285-289

length of,
use of,

Scatter/gather I/O, 86
Scatter input,
Scatter tables,

71-72
487-488

M., 534

Seagate Sable

PTD, 573

Selective indexes, 248, 252

Self-describing

files,

125. 146

Separators

explanation

of, 433, 437

and index content, 413-415

index
425

set

blocks and. 422-

413-415
430-431

instead of keys,

keys

as,

shortest, 437
Sequence checking, 327
Sequence set
adding simple index to, 411413
and block size, 410-411
blocks and, 407-410, 417-

425-429

explanation

of,

407, 433,

437

Secondary storage

Sequences, suppressing

access to, 36,

336-337

paged binary
344

trees and, 343,

repeating,

186-188

Sequential access,

explanation

3-4

of, 6,

146

indexed. See Indexed

simple indexes on, 234

Sector addressing, 46, 471

sequential access

Sectors

explanation of, 86
organization of, 86
organizing tracks by, 41-45

41-42

SEEK(

time computations and, 5253

use of, 122, 291
Sequential access device, 86

explanation of, 30-31

use of, 18-19

UNIX

Sequential processing,
tools for,

114-115

Sequential search
best uses of, 114

Seek and rotational delav, 288,

292-294

binary vs., 204-206

evaluating performance

CD-ROM,

552

explanation

of,

explanation

49-50,

86,

of,

146

use of record blocking with,

112-113

112-113
types of, 572

Serial devices,

SGML

(standard general

markup

Seeks
in C, 19-20

language), 130

131

excessive, 61

Shortest separator, 437

explanation of, 38
multistep merges to decrease
number of, 295-298, 311

Sibling, 367

in Pascal,

of,

111-112

Seek time

See'kRead(f,n), 21

explanation

merging large numbers of

278-279

lists,

421,

combinations

phvsical placement of,

Reset statement, 10, 11

248

selective indexes from.

of,

Replacement selection
average run length for, 301303
cost of using, 303-305
explanation of, 327
increasing run lengths using,
298-301
for initial run formation, 311.
312
plus multistep merging, 304,

Scholl,

Search. See Binary search;

20-21

Simple indexes
with entrv-sequenced
252

SeekWnte(f,n), 21

explanation

Selection tree

too large to hold in

explanation

of,

327

files,

227-230

234-235

of,

memory,

589

INDEX

Simple prefix

B+

also

STDIN, 24-25,

trees

STDOUT,

429-430. See

trees vs.,

trees

Stod.prc,

changes involving multiple

blocks in sequence set and,

418-421
changes localized to single
blocks in sequence set and,

417-418
explanation

of,

416-417, 437

425-429

loading,

use of, 431-432, 434

31,

Storage capacity, of CD-ROM,

552
Storage compaction, 190-192
Storage fragmentation, 198-201

Stream file, 94-96

Stream of bytes, 146
Streaming tape drive, 60-61, 86

Soft links, 77-78. See also

bottleneck, 54, 55
Stmg, 567-571
Subblocks
explanation of, 86

types of,

and cosequential processing

UNIX, 318-322
disk

files in

merging

RAM,

for large

206-208
285-

for directory,

Special

file,

527-528

Split ( ) function, 360,

361

Splitting
in B-trees, 355, 356, 360,

367, 425

Synchronization loop, 260-262,

267, 276, 327
Symbolic links, 77-78, 86

sort. See

Keysort

Tags
advantages of using, 133
explanation of, 129-131

132-133

specification of,

Tape. See Magnetic tape

Temporal locality, 376
Theorem A (Knuth), 328

deferred, 536

Tombstones

189-190

22-23

of block size on
performance, 53-54
and file dump, 108
effect

header

files,

commands, 26-27
72-80

filesystem

I/O in,
magnetic tape and, 80
physical and logical files

in,

23-26

explanation

and sequential processing,

114-115
sort utility for, 206
sorting and cosequential
processing

in,

318-322

standard I/O in, 31

Unterauer, K., 431
Update. c, 119, 120, 123, 162-

file

166

format), 130

Update.pas, 119, 120, 122. 123,

176-182
of,

480-481, 495

for handling deletions,

480-

481

Stack

explanation
use of,

in,

portability and, 141

Tag

TIFF (tagged image

explanation of, 383, 537

to handle overflow, 508-510

distribution, 454, 455

file-related

408-410
control of, 533-534

block,

balanced merging,

312

directory structure,

310-311

while writing out to file, 283,

284
Space utilization. See also
Packing density
for buckets, 526-527, 534

Two-way

compression

46-47

chained progressive
overflow, 484
explanation of, 448, 464, 494
System call interface, 74
System V UNIX, 189

311-318

UNIX

tape,

508

Turbo Pascal, 9Two-pass loading, 485-486

Uniform

Synonyms

tools for external,

of, 505-507, 537

turned into directory, 507,

explanation

Uniform, 495

file,

311

Tries

162

Striping, to avoid disk

Sorting

234

for indexes,

62-63

Sockets, 78, 86

Symbolic link
Sort, 319-320, 322, 327
Sort-merge programs, 318

553-557
382-383

height-balanced,

182

Storage, as hierarchy,

Strfuncs.c,

CD-ROM,

17, 25, 31,

of,

and insertions, 481-482

performance and, 482-483

219

193-194

Standard I/O, 31
Standardization

of data elements, 137-138

of number and text
conversion, 138-139

of physical record format,

136-137
Standish, T. A., 342
Static hashing,

STDERR,

J.,

343

Variable-length codes, 188-189,

219
Variable-length records

379-380
196-198

B-trees and,

Tracks

deleting,

of, 37-40, 87
organizing by sector, 41-45
per surface, 573
Transfer time, 51, 87

explanation of, 146

internal fragmentation and,

explanation

Tree structure

447

24, 25, 31,

Tools, pre, 167

Total search length, 469
Track capacity, 40

VanDoren,

application of, 4

199

methods of handling, 102

Variable order B-tree, 422-425,
437

590

VAX.

INDEX

135.

138-139

Veklerov, E., 534

of,

importance

of,

219

fit,

Writstrm.c,

Worst-fit placement strategies,

202

Virtual B-trees

explanation

Worst

373-377, 383
377

WRITE(

Writstrm.pas,

115

explanation

Webster. R. E.. 375, 377

White-space characters. 97-98

Writrec.c,

98, 99, 154-

94-95,

98, 99,

168-169

of, 31

use of, 15-18,

IIV.

94-95,

155

63-65

103-105, 107, 109,

156-157
Writrec.pas, 171-172

XDR

(external data

representation),

137-139

Yao, A. Chi-Chih, 371

Computer

Science/File Structures

Structures
Bill

Zoellick

Michael

Second Edition

Avalanche Development Company

Folk National Center for Supercomputing Applications

This second edition of the leading

file structures book currently on the market has been

thoroughly revised and updated to instruct readers on the design of fast and flexible file
structures. The new edition now includes timely coverage of file structures in a UNIX
environment in addition to a new and substantial appendix on CD-ROM. Other modern file
structures such as extendible hashing methods are also explored.

This book develops a framework for approaching the design of systems to store and retrieve
information on magnetic disks and other mass storage devices. It provides a fundamental
collection of tools that

any user needs

appropriate solutions to

file

order to design intelligent, cost-effective, and

structure problems.

Highlights
Discusses a "toolkit" of approaches to
retrieve file records: simple indexes, paged
indexes (e.g. B-trees), variations on paged
indexes (e.g. B + trees, B trees), and hashing
Includes a

new

chapter on extendible hashing

Uses pseudocode extensively, particularly

where the procedures are complex and where
it

important to avoid the distractions

in actual compilable code

Emphasizes the building of conceptual tools

and retrieval of information
from files
for the design

Provides complete examples

and Turbo Pascal 6.0

both ANSI C

Introduces UNIX concepts and utilities that

apply directly to file structures and file

management

inherent

Second Edition

an invaluable resource for computer science professionals

UNIX environment. It will also be of interest to professionals
interested in learning about the design of file structures and the retrieval of records. Students
majoring in computer science will benefit from this book's emphasis on fundamental concepts
and its inclusion of C and UNIX.
File Structures,

using

file

and

data. structures in a

About the Authors

Bill

Zoellick

Vice President and Chief

Scientist at the

Avalanche Development

Michael J. Folk is currently a Senior Software

Engineer at the National Center for

in Boulder, Colorado, a leading

producer of text conversion software.
Previously, he was the Director of
Technology for the Alexandria Institute, a
nonprofit organization working to resolve the

Supercomputing Applications

problems associated with electronic

Computer Science for fifteen years at

Oklahoma State and Drake Universities.

Company

publishing.
writer

a frequent lecturer

on CD-ROM

and

at the University

three years
he has been responsible for developing
general purpose scientific data file formats.
of Illinois

Urbana. For the

Prior to this, Dr. Folk

was

last

a Professor of

issues.

90000>

Addison-Wesley Publishing Company

780201"557138

ISBN D-ED1-55713-4

Model 120: High Capacity Compression Load Cells
No ratings yet
Model 120: High Capacity Compression Load Cells
2 pages
SCHED TEMPLATE For A3-E3
No ratings yet
SCHED TEMPLATE For A3-E3
1 page
Insight Into The Deactivation of Au/Ceo Catalysts Studied by in Situ Spectroscopy During The Co-Prox Reaction
No ratings yet
Insight Into The Deactivation of Au/Ceo Catalysts Studied by in Situ Spectroscopy During The Co-Prox Reaction
10 pages
AC&DC Ammeter and Voltmeter
100% (1)
AC&DC Ammeter and Voltmeter
12 pages
Manuel S. Enverga University Foundation College of Engineering
No ratings yet
Manuel S. Enverga University Foundation College of Engineering
3 pages
Detecting, Recording and Analyzing The Vocalizations of Bats, Parson & Szewczak.
No ratings yet
Detecting, Recording and Analyzing The Vocalizations of Bats, Parson & Szewczak.
11 pages
Data Sheet: Specifications
No ratings yet
Data Sheet: Specifications
3 pages
Block Diagram
No ratings yet
Block Diagram
17 pages
Lab Report #13
100% (2)
Lab Report #13
8 pages
Assignment 6 - CS1104
No ratings yet
Assignment 6 - CS1104
3 pages
HP Envy X360-A Series Quanta Y61 DA0Y61MB6E0 Rev 1A Schematics
No ratings yet
HP Envy X360-A Series Quanta Y61 DA0Y61MB6E0 Rev 1A Schematics
32 pages
Technical Guide Coordination
No ratings yet
Technical Guide Coordination
20 pages
Em 161013085654
No ratings yet
Em 161013085654
20 pages
Stel - Max223c
No ratings yet
Stel - Max223c
2 pages
Experiment 3 CIrcuits
No ratings yet
Experiment 3 CIrcuits
13 pages
Nice Cat Screen en
No ratings yet
Nice Cat Screen en
214 pages
Experiment 2
No ratings yet
Experiment 2
9 pages
Uhome3 8kw En50549 Eur
No ratings yet
Uhome3 8kw En50549 Eur
2 pages
(2485-1-HYA-CUR-002) Inrush Curve of Power Transformer (GTG) - Rev.3
No ratings yet
(2485-1-HYA-CUR-002) Inrush Curve of Power Transformer (GTG) - Rev.3
3 pages
Computer Basics For Kids: Just How Does A Computer Work?: Desktop Computer Students Interact With Computers
No ratings yet
Computer Basics For Kids: Just How Does A Computer Work?: Desktop Computer Students Interact With Computers
21 pages
IM TCAD Multijunction Solar Cell
No ratings yet
IM TCAD Multijunction Solar Cell
5 pages
Wave Behavior of Particle and De-Broglie Hypothesis and Its Testing
No ratings yet
Wave Behavior of Particle and De-Broglie Hypothesis and Its Testing
8 pages
P-HB SHFT-02e UK
No ratings yet
P-HB SHFT-02e UK
23 pages
Radiographic Interpretation Assessment: Multi - Choice Question Paper (MSR-ARI-1) Name
50% (2)
Radiographic Interpretation Assessment: Multi - Choice Question Paper (MSR-ARI-1) Name
5 pages
UDE CAN Support
No ratings yet
UDE CAN Support
10 pages
Earthing (AEMC 3640,4610) PDF
No ratings yet
Earthing (AEMC 3640,4610) PDF
44 pages
Practice Problems - 1: Department of Electrical Engineering
No ratings yet
Practice Problems - 1: Department of Electrical Engineering
2 pages
Estabilidad de Taludes - Ejemplos GEOSLOPE
No ratings yet
Estabilidad de Taludes - Ejemplos GEOSLOPE
222 pages
Guard Evolution: Operation Manual
No ratings yet
Guard Evolution: Operation Manual
38 pages
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6440)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5145)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (998)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (642)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1174)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2133)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2884)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)