0% found this document useful (0 votes)
128 views

Data Structures

Dynamic memory allows data structures to vary in size during program execution. Memory is allocated dynamically using access types (pointers) that contain the address of reserved memory blocks. Linked data structures can be created by including access type fields that point to other blocks. Memory blocks that are no longer accessible are automatically reclaimed through garbage collection.

Uploaded by

Mcaif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
128 views

Data Structures

Dynamic memory allows data structures to vary in size during program execution. Memory is allocated dynamically using access types (pointers) that contain the address of reserved memory blocks. Linked data structures can be created by including access type fields that point to other blocks. Memory blocks that are no longer accessible are automatically reclaimed through garbage collection.

Uploaded by

Mcaif
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 182

Chapter 1

BASIC MECHANISMS OF
DYNAMIC MEMORY
1.1

Introduction

Hitherto, the collections of data of the same type have been stored in arrays.
For instance, words have been stored as arrays of characters. The difficulty
that introduces such kind of representation is that some upper bound has to
be estimated in order to set the size of the array. In some circumstances, such
upper bounds can be easy to estimate, but there are applications where the
collections to store can vary considerably in size from one execution of a program
to another. In these cases, the overestimation of the array sizes can lead to an
unnecessary waste of memory space.
Dynamic memory is a mechanism that allows us to construct data structures
that vary in size along the execution of the program. The essential principle of
dynamic memory is that we can request pieces of memory to allocate variables
at run time and we get a pointer to the allocated memory space. A pointer is
some information that is equivalent to the address of the reserved block. Using
such an address we can either store values or retrieve values from that space.
Usually, these dynamically allocated blocks of memory also contain pointers
to other blocks, and the set is then called a linked structure.
Indeed 1 programming with such pointers is very error prone. Algorithms
acting on pointers are harder to understand and prove correct than algorithms
having no pointers and, consequently, it is very easy to make seriOus mistakes
when they are used. Therefore, the construction of algorithms that make use of
pointers requires a lot of care.
The main goal in the design of Ada has been to provide the highest possible
reliability to software. In particular, its strong type system is a powerful tool
that allows the compiler to detect a wide range of inconsistencies, specially if
a proper programming style is used to take benefit from that system. Pointer
types are not an exception, and no matter the information they contain is more
or less equivalent to an address of memory, its use is restricted to prevent the
programmer from making inconsistencies.
This chapter, and therefore this book, does not introduce all the ways in
which pointers can be used in Ada, but only the features that are required to

CHAPTER 1. DYNAMIC MEMORY BASICS

develop the data structures addressed in the following chapters. It is likely


that the reader can need further features in the use of pointers in the future.
For instance, the direct interaction with most operating systems often require
to provide the address of some variable of the user program. Nevertheless, we
highly recommend the reader to start mastering the features presented here. If
ever in the future he needs the use of additional features he can learn them either
from the language reference manual [16], which is usually included in electronic
form with the compiler or from a more comprehensible description such as [2].

1.2

Access types

In Ada, pointers are called access types because they provide access to other
objects. In this book we will use the two terms indistinctly.
Basic operations on access types

An access type is always bound to some data type. They are declared as follows:
type Tis ... ;
type p_T is access T;
p: p_T;

These declarations state that p is a variable of type p_T, which means that it is
able to contain the location in memory where some datum of type T is stored.
The allocation of space to store a datum of type T is achieved by the operation
p:= newT;

This operation reserves memory space to allocate data of type T and sets the
address of this place -Dr some information equivalent- to the variable p. The
expression p.al! indicates the block of memory addressed to by p and can be
used as any variable of type T. Therefore, after the declarations

type Tis ... ;


type p_T is access T;

p, q: p_T;
x: T;

the following assignments are allowed:


p:= newT;
p.all:= x;
q:=p;

The second assignment copies the value of x to the block pointed by p. The
second assignment copies the address contained at p to q. As a result, both p
and q point to the same block of memory.
A variable of an access type can also contain a special value, called null,
with the meaning that it points nowhere. This special value can be assigned to
them as in
p:= null;

1.2. ACCESS TYPES

If an access variable p is equal to null, the attempt to refer to p.all is non sense
and will raise the exception constrainLerror, exactly as when we refer to an
array component with an index which is out of bounds.
Besides the test for equality, which also applies to access types, these are the
only permitted operations to manage pointers. Any attempt to add pointers or
to add an integer value to an access type will be considered as a type inconsistent
expression and will be refused by the compiler.
Strict typing of access types
Pointers addressing different types of data are incompatible. For instance, after
the declarations
type Tl is ... ;
type T2 is ... ;
type p_Tl is access Tl;
type p_T2 is access T2;
pl, ql: p_Tl;
p2, q2: p_T2;
xl: Tl;
x2: T2;
The following assignments are type inconsistent and will be refu~ed at compile time:

pl:= new T2;- Wrong!!


pl.all:= x2; -Wrong!!
xl:= p2.all; --Wrong!!
pl:= p2;
-- Wrong!!

It
It
It
It

causes
causes
causes
causes

a
a
a
a

compile-time
compile-time
compile-time
compile-time

error.
error.
error.
error.

Instead, the null value can be assigned to any variable of any access type1 , as
in
pl:= null;
p2:= null;

Access types and record fields

In most applications, access types refer to blocks of some record type. For
instance:
type Tis
record;
a: Ta;
b: Tb;
end record;
type pT is access T;
p, q: pT;
x: T;
1 NER.~ Actually, Ada allows to declare access types for which the null value is forbidden.
However we will not make use thls feature in the scope of this book.

CHAPTER 1. DYNAMIC MEMORY BASICS

As the block pointed to by an access type variable such as p is indicated by


p.all, its components can be indicated naturally as p.all.a or p.all.b as in
p:= newT;
q:= newT;

p.all.a:= x.a;
q.all.a:= p.all.a;
p.all:= x;
However, this notation is somewhat cumbersome unless the algorithms are very
short, and can be abbreviated omitting the qualifier all. Consequently, the
following assignments have the same meaning than those above:
p:= newT;
q:= newT;
p.a:= x.a;
q.a:= p.a;
p.all:= x;

In the last assignment, the qualifier all can not be omitted because p and x are
of different types and x can not be assigned to p. It is required to specify that
the destination is not p but the object addressed by p. When a record field
appears, there is no possible confusion.
Forward declarations

As we shall see soon, we shall require quite often that some of the fields of the
records allocated in dynamic memory are also access types, pointing to other
blocks of the same type. If so, a circular dependency appears in the declarations,
which must be solved by a forward declaration, as in
type cell;
type pcell is access cell;
type cell is
record;
a: integer;
s: pcell;
end record;

The first declaration tells the compiler that cell is the name of a type. This
information is sufficient to check that the second declaration is consistent and
to process it. In particular, the size required for variables of type pcell is known.
Finally, the third declaration provides the full details of type cell and can be
properly processed because all the details for peel/ are known.
The reader should remark the strong similarity of such forward declaration
with the forward declarations required in the case of mutually recursive subprograms that appeared in section 3.6.5 of vol. III.
Linked structures

The fact that blocks in dynamic memory can reference to other blocks allow us
to build structures as the very simple one represented by the following fignre:

1.2. ACCESS TYPES

Which can be obtained by means of the following assignments, provided that p


and q are variables of type pcell:
q:= new cell; q.a:= 7; q.s:= null;
p:= new cell; p.a:= 5; p.s:= q;

These kind of structures are called linked, and we shall make a systematic
use of them throughout the remaining chapters of this volume. The reader is
expected to learn there both their usefulness and the ability to build algorithms
that manage them.
Garbage collection

Blocks that have been allocated in dynamic memory, can become unaccessible
if no access type variables point to them any longer. For instance, the blocks
in the linked structure above become inaccessible if null is assigned both to p

and q. We shall assume that the run-time environment in charge of dynamic


memory management is able to detect when a block is no longer accessible and
to bring it back to the pool of available memory. The algorithms to detect
unaccessible blocks and to keep the space of available memory as compact as
possible (ie. not fragmented) are complex and are studied in chapter 8.
Value assignment on allocation

The allocation operator new can include the value to be assigned to the newly
created block. This value must be provided as a qualified expression:
p:= new cell'(7, null);
p:= new cell'(5, p );
which create the same linked structure as above.
Indeed, record values can be build using the named forms too, as in
p:= new cell'(a=> 7, s=> null);
p:= new cell'(a=> 5, s=> p);
which is less error prone in the case that there is more than one field of the
same type.
Access types and unconstrained types

Access types may also point to cells of an unconstrained type. If so, the constraint must be set on allocation, exactly in the same way as it must be set to
declare a variable (see sections 3.2, 3.6 and 5.5 of vol. II). For instance, after
the following declarations2
type p_str is access string;
p, q: p...str;
2 The type string is declared in the package standard as an unconstrained array of characters
(see subsection 3.2.2 ofvol. II).

CHAPTER. 1. DYNAMIC MEMOR.Y BASICS

the allocation of a block can be done in any of these two forms:


p:= new string(l..lO);

-- No value is assigned to the allocated


-- block.
q:= new string'("London"); --The value ''London" is assigned-to the
-- allocated block.
Indeed, the pointers can be assigned to each other, but the attempt to assign
the pointed blocks to each other will raise a constraint_error exception unless
the lengths of the two strings are exactly the same:
p.all:= q.all; -- Will raise constraint-error because "London" is not
-- oflength 10.
- Causes no problem.
p:= q;
Access types pointing to discriminated records work likewise. Let us consider
the following declarations:

type cell;
type pcell is access cell;
type celLtype is (atom, unary, binary);
type unary_op is (minus, sin, cos, log, exp);
type binary.op is (add, sub, prod, quat);
type cell(t: ce!Ltype) is
record;

case tis
when atom=>
x:
integer;
when unary =>
u_op:
unary_op;
opnd:
peel!;
when binary =>
b.op:
binary.op;
left.opnd: peel!;
right.opnd: peel!;
end case
end record;

p, q: peel!;
The allocation of blocks to keep values of type cell must provide the discriminant necessarily, as in
p:= new cell(unary);
However, any value of type pcell, no matter what the variant of the pointed
object is, can now be assigned to p.alLopnd, as in

q:= new cell( ... );


p.u_op:= minus; p.opnd:= q;

1.2. ACCESS TYPES

Default values

In Ada, any variable of an access type has the initial value null. Nevertheless,
the algorithms presented in this volume do not take advantage of this feature
because this text is intended to serve as a handbook of algorithms, that can be
either used directly in Ada or translated to other programming languages that
do not have this feature. Therefore, whenever it is required that an access type
has the value null, it will always be explicitly assigned.
Running out of memory space

If the run time system that manages the dynamic memory is unable to provide
the space of memory requested by the allocator new, because no more space is
available, then the exception storage_error is raised.
Pointers to subprograms

For the sake of completeness, the reader should remind that access types may
also point to subprograms, as described in section 7.5 of vol. II and used in
sections 7.6 to 7.9 of the same vol.
Safety of access types rules in Ada

As shown, Ada imposes restricting rules upon the nse of access types. Although
the underlying information they contain are essentially addresses in memory, the
only way these values can be obtained is by means of the allocation operator
new. These values can later be copied, but never modified. Moreover, a pointer
to blocks of a type can never be assigned an address corresponding to a block
of a different type.
These restricting rules provide the guarantee that bits in memory representing one type can never be misinterpreted as if they represented a different
type.
Any inconsistency is detected. Most of them at compile time, which makes it
easy to fix the bug. A few kind of inconsistencies, however, can not be detected
until run time, in the form of the rise of the exception constrainLerror. This
will happen after the attempt to get data referenced by a pointer having the
value null.
The other case is the attempt to assign blocks corresponding to different
sizes or discriminant values of unconstrained types or to reference a field of a
record whlch is inconsistent with the value of the discriminant. However, these
kind of inconsistencies are not specific of access types and they may happen with
conventional variables as well. For example, the following procedure, which uses
no access types, may raise the same exception in exactly the same circumstances:

procedure copy (sl: in string; s2: out string) is


begin
sl := s2; -- May raise constraint _error if the sizes
- of the actual parameters are different.
end copy;

CHAPTER 1. DYNAMIC MEMORY BASICS

1.3

Cursors

Linked structures can also be built using indexes to arrays instead of access
types. For instance, after the following declarations
type index..cell is new integer range 0 .. 100;
type cell is
record;
a: integer;
s: index_cell;
end record;
type celLpool is array(index_cell range l..index_cell'last) of cell;
cp: cell_pool;
p: index_cell;
if p and cp have the following values:
p = 30;

cp(30) = (5, 20);


cp(20) = (7, 0);
we have a linked structure equivalent the one that served as example in the
description of access types, where the value 0 is used to represent the null
mark, as shown in the fignre

..

In this kind of implementation of linked structures, the links are called cursors, instead of pointers. The reader should remind that a very simple use of a
linked structure using cursors was already- introduced in the description of bin
sorting (see section 5. 7 of vol. III).
When cursors are used, the memory space are components of arrays and
the algorithms have to keep track of the components that are in use and those
which are available for further use. Usually, this is achieved by linking the free
components and indicating the beginning of the linked structure by an additional
index variable. Some care must be taken to remoVe a component from the pool
of free blocks, each time a new one is required, as well as to reintroduce it in that
pool of free components when that block becomes no longer accessible, because
for cursors these operations must be explicit.
Naturally, for eac;tJ. kind of block we requite, a different array of memory
space shall be declared. A careful declaration of incompatible types allows the
compiler to detect most inconsistencies at compile time, making the development
faster.
As we shall see in chapter 2, the algorithms using access types and the algorithms using cursors are very much the same. The advantages and drawbacks
of each approach will be discussed there too.

Chapter 2

LINEAR STRUCTURES
2.1

Introduction

At this stage, the reader is already familiar with the use of stacks, which were often required to convert recursive algorithms into iterative ones (see sections 3.5
to 3.7 of vol. III), although stacks can be applied to other kinds of problems
as well. In section 6.2.5 of vol. III, an algebraic specification for stacks is provided. An algebraic specification for queues was also presented in section 6.2.6
of vol. III, although no algorithm from this book has made use of them hitherto.
Both stacks and queues are pools of data elements that have a common characteristic, namely that the order in which elements are retrieved and removed
depend exclusively from the order in which they are inserted. In the case of
stacks the criterium is last in first out (usually indicated as LIFO), and in the
case of queues is first in first out (usually indicated as FIFO). Therefore, in both
cases the elements in the pool can be imagined as forming a linear chain where
the elements are inserted and retrieved at the extremes of the chain. In the
case of stacks, the extreme for retrieval is the same as the extreme for insertion,
whereas in the case of queues the extreme for retrieval is the opposite of the
extreme for insertion. Consequently, stacks and queues have been classified as
linear data structures.
In chapter 3 of vol. III, stacks were implemented by means of an array plus an
integer, which is the most straightforward way. In this chapter we will consider
also the possibility to implement stacks by means of access types and cursors,
and we will discuss the advantages and drawbacks of the different choices. Three
ways of implementing queues will be presented and analyzed too.
In computer science, the term list usually designates a linearly linked structure -ie. a structure where each component has at most one successor- in
which insertion and retrievals can be done in arbitrary positions of the chain.
Lists are therefore also linear structures and the appropriate place to study them
is besides stacks and queues.
However, in contrast with stacks and queues, lists are not abstract data
types but a technique that is extremely useful in the implementation of other
types. Any attempt to characterize the behavior of lists ignoring the implementation details (ie. the links) actually describe the type multiset (ie. a set where
more than one copy of the same element can be stored and retrieved indepen9

CHAPTER 2. LINEAR STRUCTURES

10

dently). But to implement other types, multisets do not have the advantages
derived from the flexibility that lists have. To sum up, queues and stacks will be
treated as abstract data types and different implementations will be presented
and compared. Inst!'ad, lists will be treated as a technique that can be used in
many circumstances, as we shall see in later Chapters of this volume.

2.2

Stacks

2.2.1

Specification

A Stack is a data pool where the elements are -retrieved in the inverse order
they have been stored. The algebraic specification of the data t:fpe stack has
been provided in section 6.2.5 of vol. III. From an algebraic point of view, any
operation isa function, which makes both the algebraic specification of the type
and the formal proof of the algorithms that use it reasonably simple. However,
as stated in section 6.2.1 of val. III, inimperative languages such as Ada, obvious
efficiency considerations force us to express most of the operations as procedures
instead of functions. Henceforth, instead of invoking a function push as
s:= push(s, x);
we shall achieve the same effect programming a procedure push with an inout
parameter of type stack:
push(s, x);

As stated in the algebraic specification, some operations may lead to an


error, such as attempting to get the top of an empty stack. Therefore, an
exception bacLuse must be included in the specification and raised if an operation
is invoked with inappropriate operand values. Another exception, space_ overflow
shall be raised if no storage space is available to execute an operation.
Consequently, the specification part of the package defining the abstract data
type stack becomes:
generic
type "item is private;
package dstack is
type stack is limited private;

bad_use:
exception;
space_overflow: exception;
procedure empty (s: out
stack);
procedure push
(s: in out stack; x: in item);
procedure pop
(s: in out stack);
function top
(s: in stack) return item;
function is_empty(s: in stack) return boolean;
private
-- Implementation dependent
end dstack;

2.2. STACKS

2.2.2

11

Compact implementation

The most straightforward implementation of a stack is an array and an integer.


As the size of the array has to be established, it is most appropriate to provide it
as an additional formal argument to the generic declaration. Providing a default
value makes this argument optional, making the instantiation more flexible and
compatible with the declaration of section 2.2.1.
Consequently, the representation of the type is described by the following
complement of the package <Lstack:
generic
type item is private;
max: natural:= 100;
package dstack is

: - Exactly as in section 2.2.1


private
type index is new integer range O..max;
type mem..space is array(index range l..index'last) of item;
type stack is
record
a: mem....space;

n: index;
end record;
end dstack;
The operations for this implementation are rather obvious. The attempts
to access the array out of its range will raise the exception constraint_error.
According to the circumstances, such an exception has to be converted either
to bad_use or to space_overflow.
The resulting package body becomes:

package body dstack is


procedure empty(s: out stack) is
n: index renames s.n;
begin
n:= 0;
end empty;
procedure push(s: in out stack; x: in item) is
a: menLSpace renames s.a;
n: index
renames s.n;
begin
n:= n+l; a(n):= x;
exception
when constraint_error => raise space_overfiow;
end push;

CHAPTER 2. LINEAR STRUCTURES

12

procedure pop(s: in out stack) is


n: index renames s.n;
begin
n:= n-1;
exception
when constraint-error => raise bad_use;
end pop;
function top(s: in stack) return item is
a: me~pace renames s.a;
n: index renames s.ll;
begin
return a( n);
exception
when constraint_error ::::0::> raise bad_use;
end top;
function is_empty(s: in stack) return boolean is
n: index renames s.n;
begin
return n=O;
end is_empty;
end dstack;

2.2.3

Linked implementation: pointers

In many circumstances it is not easy to foresee a safe upper bound for the size of
a stack. Consequently, we might be interested in providing an. implementation
that is only bound by the memory space effectively available for the program.
This can be achieved by a linked structure such as:

where the element on top of the stack is the one pointed by top. According to
this implementation, the type stack is represented just by this pointer and the
resulting private part becomes:
generic
type item is private;
package dstack is
- Exactly as in section 2.2.1
private
type cell;
type pcell is access cell;

2.2. STACKS

13

type cell is
record
x:
item;
next: peel!;
end-record;
type stack is
record
top: pcell;
end record;
end dstack;
The implementation of the operations is rather simple and an excellent introduction to the use of access types. They are presented here in increasing
order of difficulty.
An empty stack is represented by a chain containing no element. This is
represented by the top pointer having the value null:

procedure empty(s: out stack) is


top: pcell renames s.top;
begin
top:= null;
end empty;
Checking whether the stack is empty or not is equivalent to check whether
the pointer to the top of the stack is nulL Otherwise, there is at least one
element in the stackfunction is_empty(s: in stack) return boolean is
top: pcell renames s.top;
begin
return top=null;
end is_empty;
Getting the item on top of the stack is as simple as getting the value stored
by the cell on top of the stack. If the value of the pointer to the top of the stack
is null, the attempt to dereference it will raise the exception constrainLerror,
which has to be reconverted into the more explicit bad_use because it results
from an illegal operation:
function top(s: in stack) return item is
top: pcell renames s.top;
begin
return top.x;
exception
when constraint..error = > raise bad_use;
end top;
Let us now pay some attention to the operation push. It is achieved in
three steps. First 1 a new cell must be allocated. Second 1 the pushed item must
be copied in that celL Third, the cell must be linked at the beginning of the

14

CHAPTER 2. LINEAR STRUCTURES

structure. If the run time system does not find spac;:e to allo"cate the new cell
a storage_error will be raised. If so, this exception must be converted into the
more explicit space_overfiow 1 :
procedure push(s: in out stack; x: in item) is
top: pcell renames s.top;
r:
pcell;
begin
r:= new cell;
r.all:= (x, top);
top:= r;

exception
when storage_error
end push;

=>

raise space-Overflow;

The following fignres illustrate the three steps. Let us assume that the initial
value of the stack is:
top

The situation after -r:= new cell; is:


top

rG-{[3
The situation after r.all:=(x, top); is:

Finally, the situation after top:= r; (regardless of r, which is a local variable)


IS:

top

The reader should also be aware that if initially top=nul!, then the same
operations lead, as expected, to

top~
The elimination of the top element of the stack is rather straightforward:
1 Each instance of dstack will define its own instance of the exception. So, if dstack_ T is
an instance of dstack for items of type T, a specific exception becomes declared and its name
is dstack-T.space_overfiow.

2.2. STACKS

15

procedure pop(s: in out stack) is


top: pcell renames s.top;
begin
top:= top.next;
exception
when constraint-error => raise bad_use;
end pop;

The effect of this.assignment is illustrated by the following figures. If the initial


situation is
top

the situation after top:= top. next; is


top

The cell containing x being now inaccessible, it becomes garbage and the
space it occupies can be reused. If the stack contained initially a single item,
then the same operation leads to top=null.
Indeed, if the stack is initially empty, the attempt to consult top.ne:rt will
raise the exception constraint_error. This .exception must be reconverted into
the more precise baLuse.

2.2.4

Linked implementation: cursors

Overview
The implementation provided in section 2.2.3 has the advantage that the size
of the stacks is only bounded by the memory that is available in the computer
for the running program. However it has the hidden drawback of the work to
be done by the run-time system to detect unaccessible blocks and reorganize
the memory to reuse the released blocks2 _ This drawback is less important for
structures that only grow, or shrink occasionally, but it may become serious for
structures such as stacks, that grow and shrink continuously.
In some algorithms a bounded number of items move from one stack to
another. In such cases, it would be inappropriate to L.-nplement each stack
by an array and an index because when one stack is full we know that the
others are empty and therefore such an implementation would be wasteful of
memory space. Instead, we may implement all the stacks as linked structures
with cursors on a common array. Any component of the array has a field for
the item and a field to point to the next component of the chain, if any. Any
stack is implemented by an index pointing to the component of the array where
the top of that stack is stored.
2The algorithms for managing the dynamic memory are studied in chapter 8 and, a.s the
reader shall see, are far from simple and occasionally they may become time consuming.

16

CHAPTER 2. LINEAR STRUCTURES

The array used as support must be hidden in the package body. The available
components must be linked before any operation on stacks is executed and an
index variable named free shall point to the beginning of the linked structure
of available components.
Data representation

According to these considerations, the specification part of the package for this
implementation is:
generic
type item is private;
max: integer:= 100;
package dstack is
- Exactly as in section 2.2.1 but with pragma pure removed.
private
type index is neW integer ra:rlge 0 ..max;

type stack is
record
top: ind~;
end record;
end dstack;
and the structure of the body is:
package body dstack is
type block is
record
x:
item;
next: index;
end record;
type mem_space is array(index range l..index'!ast) of block;
ms: mem...space;
free: index;

... - Bodies of the subprograms for stack operations and other auxiliary
. - subprograms, to be detailed below.
begin
prep_mem_space;
end dstack;

Memory space management

The procedure prep_mern_space is in charge of linking all the blocks. As it is


invoked from the statements part of the package itself, it will be executed before
any other operation on the package. Its body is

2.2. STACKS

17

procedure prep_mem....space is

begin
fori in index range l..index'last-1 loop ms(i).next:= i+1; end loop;
ms(index'last) .next:= 0;
free:= 1;
end prep--I!lem....space;

and its effect is to leave aJl the components of ms linked starting at the position
indicated by free:

free~
.

ms{GLLGLG&GGLLGGGGGGLGGLLGGGGL~

We shall require two auxiliary operations, one to get a block from the list of
free blockS (ie. the operation equivalent to the allocator new for access' types),

and an operation to bring a block back to the pool offree blocks when it becomes
no longer accessible (an operation that is implicit in the case of access types,
but must be explicit in the case of cursors).
The first of these auxiliary subprograms is
function get_block return index is
r: index;
begin
if free=O then raise space_overftow; end if;
r:= free; free:= ms(free).next; ms(r).next:= 0;
return r;

end get_block;
The algorithm is quite straightforward, it selects the block pointed to by free
and removes it from the linked chain of free blocks, in a similar way as the pop
operation when the stack was implemented by access types. If the pool of free
blocks is empty, it raises the exception space_overfiow. The following figures
illustrate the operation. Initially we have

free~

~--
After r:= free; we get

fr~~--After free:= ms(free).next; the result is:

18

CHAPTER 2. LINEAR STRUCTURES

After ms(r).next:= 0; the final result is:

free~

rB-i3 hl_._ ...


.The operation to bring a block back to the pool of available blocks is straightforward too and the figures below describe the details of the operation.
procedure release_block(r: in out index) is
begin
ms(r).next:= free;
free:= r; r:= 0;
end release_block;
Initially we have

free~.

rG---tj hl_._...
After ms(r).next:= free; the state is

fr~~--After free:= r; the result is

And after r:= 0; we finally obtain

free~ ~.
r~

t:i.1L

Implementation of the stack operations


Now we can address the implementation of the operations of stack, which will
be presented in increasing order of difficulty again. the subprograms empty and
is_empty are so straightforward3 that are self explaining:

procedure empty(s: out stack) is


p: index renames s.top;
begin
p:= 0;
end empty;
3 At

least in this first version. The procedure empty will be revisited below

2.2. STACKS

19

function is_empty(s: in Stack) return boolean is


p: index renames s.top;
begin
return p=O;
end is_empty;

The operation top is straightforward as well. The only significant consideration is that in this implementation the attempt to get the top of an empty
stack has the effect to consult the ms array at position 0, which is out of its
allowed range. Consequently, a constraint_error will be raised, which must be
converted into a more explicit exception.
function top(s: in staCk) return item is
p: index renames s.top;
begin
return ms(p ).x;
exception
_when constrainL.error => raise bad_use;
end top;

The operation push is


procedure push(s: in out stack; x: in item) is
p: index renames s.top;
r: index;
begin
r:= get..block;
ms(r):= (x, p);
p:=r;
end push;

Like in the implementation with access types, this operation is achieved in


three steps. First, a block of available memory space must be obtained. Second,
the pnshed item mnst he stored in the obtained block. And third, the new block
must be linked in front of the stack. The following fignres illustrate these steps.
After r:= geLblock; we have

:~-

After ms(r):=(x, p); the result is

p~
r

...
And finally, after p:= r; the result is (regardless of r, which is a local variable):

p~-

20

CHAPTER 2, LINEAR STRUCTURES


The operation pop is

procedure pop(s: in out stack) is


p: index rena.Illes s.top;
r: index;
begin
r:= Pi

p:= ms(p ).next;


release_block(r);
exception
when constraint_error
end pop;

=>

raise bad_use;

This operation works in three steps. First, an auxiliary index keeps the
position of the element to be deleted because later we shall bring it back to
the pool of free blocks. Then this element is actually removed from the stack,
by making p point to the next element of the stack. Then, the released block
may actually be brought back to the pool of free blocks because it is no longer
accessible..
The figures below illustrate the three steps. The situation after r:= p; is

:~

The situation after p:= 171.1i(p).next; is

:~-

Then the block pointed to by r is inserted back to the pool of free blocks
and r vanishes because it is a local variable.
If the stack is initially empty, the exception constraint_error will be raised
because then p = 0 and it is out of the range of 171.1i.
The operation empty revisited

The implementation of the procedure empty presented above is fine if it is applied for the first time to the actual argument. However, if the actual parameter
is a variable of type stack that has been used previously, and its value is not
the empty stack: the current version of the procedure empty sets the top index

to 0, which leaves the blocks currently in the stack unaccessible and they can
not be used any longer.
Indeed, it would be more appropriate to bring the blocks currently in the
stack back to the pool of available blocks. However, this can not be done if the
actual argument is variable that is used for the first time, because in this case
its contents is undefined.

The way to solve this problem is to force all the variables of type stack to
have an initial value equivalent to an empty stack. This is achieved by declaring
the type stack as:

2.3. SOME GENERAL PURPOSE TECHNICAL REMARKS

21

type stack is
record
top: index:= 0;
end record;
and then,. being sure that any unused variable of the type stack has its top field
set to 0, empty can be safely implemented as
procedure empty(s: out stack) is
p: index renames s.top;
begin
release_stack(p);
end empty;
where release_stack is a memory management operation that releases all the
blocks currently in the stack by linking the last element of the stack to the first
one of the pool of free blocks:

procedure release_stack(p: in out index) is


r: index;
begin
if p/=0 then
r:=p;

while ms(r).next /= 0 loop r:= ms(r).next; end loop;


ms(r).next:= free; free:= p; p:= 0;
end if;
end release....stack;

2.3

Some general purpose technical remarks

The study of the implementations of the type stack, may be useful to make a
few considerations that are of general interest in the implementation of any data
type.

2.3.1

Implementation independence

The reader should be aware that for any of the implementations of the data
type stack studied in section 2.2, the visible part of the specification part of
the package dstack has remained unchanged. Therefore, any software using
the type stack is not affected at all by the replacement of one implementation
of the type by another. Structuring the software in modules that can change
their implementation by keeping their external behavior unchanged is essential
in making software maintenance feasible.

2.3.2

The pragma inline

The need to hide implementation details to achieve implementation indepen~


dence may force us to declare subprograms that make very small work. For instance, for stacks implemented by an array and an index, the procedure empty
just sets the index to zero. In these cases the computing time of calling the
subprogram, passing the arguments and returning from it may become many

CHAPTER 2. LINEAR STRUCTURES

22

times greater than the-involved operation itself. In these cases, the programmer
should use the pragma inline, as explainedjn section 7.8 of vol. I to force the
compiler to implement the call by expanding the code online instead of making
an actual call, so that the resulting code be as efficient as if the implementation
was directly visible by the calling program. In the case of stacks implemented
by arrays, it would be reasonable to add, in the visible part of the specification
part, pragma inline(empty, is_empty);

2.3.3

The pragma pure

If pragma pure appears in the specification part of a packsge, the user of that
package may be Sure -that the package has no internal state. As a consequence,
that package has several nice properties. On one hand, the result of the .operations depend only from their arguments, and therefore their behavior can be
algebraically specified according to the techniques presented in section 6.2 of
vol. III, and serve as a basis for the formal reasoning on algorithms that use the
type. On the other hand, they can be safely used by concurrent tasks, as it will
be discussed in more detail in vol. V. Also, testing becomes siruplified.
To grant this property, in the presence of pragma pure the compiler will
refuse any declaration of variables both in the specification part of the packsge
and in its body. Moreover, only packages declared at library level can be pure
and they shall depend only from other units that are also declared as pure. The
use of access types is rejected too, as the blocks created in dynamic memory act
as global variables.
In the case of our stack, the implementation based on an array plus an
index, would be accepted by the compiler in a pure packsge. Instead, the
linked implementations would not, no matter if the approach is access-type
based or cursor based4 . Certainly, in a sequential environment, the results of
the operations depend only from their arguments in the three cases, because the
internal state of the linked implementations is used just for memory management
purposes, but the compiler has not the means to grant it.
In as much as possible, software should be built on the basis of pure units
because they are usable in any context and it is easier to reason about them.

2.3.4

On raising exceptions

Exceptions can be raised either by explicit conditionals or by propagation of


other exceptions. According to the first approach we should write
procedure pop(s: in out stack) is
n: index rena.IIles s.n;
begin
if n=O then raise bad_use; end if;
n:= n-1;
end pop;
4 NER.- If the cursor-based approach is used in a concurrent system, the memory space
might become in an inconsistent state because there is no control on the atomicity of the
memory management operations get_block and release..block. Both the access-type based and
the cursor-based approach might work differently if the supporting environment is distributed
or not (see val. V for a detailed rationale).

2.4. QUEUES

23

According to the second one we should write


procedure pop(s: in out stack) is
n: index renames s.n;
begin
n:= n-1;
exception
when constraint_error = > raise bad_use;
end pop;
From the point of view of readability, the second approach is preferable
because the effective code, usually more complex than in this small example,
is not obscured by conditionals that only check that the arguments and the
environment conditions satisfy the properties they" are eXpected to.
From the point of view of efficiency, the second approach is also preferable
because, in normal compilation conditiOns (ie. setting compilatiOn switches to
generate code for checking ranges, overflows, etc.), the second approach makes
a single check whereas the first one makes two redundant ones. MoreoVer, it is
quite reasonable that after the testing phase, when the software is expected to
be bug free, it is compiled with switches that suppress all checks. If so, the first
approach will still make an unnecessary check.
However, in some circumstances we might be interested to keep the raising
of exceptions in tested programs. For example, we may have a bug free software
but not the absolute guarantee that any data set will fit the storage sizing.
If so, the exceptions related to bad use can be raised according to the second
approach, whereas the exceptions related to storage capacity should be raised
by explicit conditionals. In this way, the latter will survive a compilation with
suppressed checks, whereas the former will be automatically removed.

2.4

Queues

2.4.1

Specification

A queue is a data pool where the elements are retrieved in the same order they
are inserted. The abstract data type queue has been algebraically specified in
section 6.2.6 of vo!. III.
AB for stacks, efficiency considerations force us to implement some of the
operations as procedures with parameters of mode in out instead of as functions,
and also some exceptions must be declared so that they can be raised either
in the case of attempts to execute operations upon illegal values of the type,
or in the case of running out of-memory space. Consequently, the package
specification part should be as follows, where rem_first is an abbreviation for
remove first.
generic
type item is private;
package dqueue is
type queue is limited private;

CHAPTER 2. LINEAR STRUCTURES

24
bad_use:
exception;
space_overfiow: exception;

procedure empty (qu:


out queue);
procedure put
(qu: in out queue; x: in item);
procedure rem...first( qu: in out queue);
function get..:first ( qu: in
queue) return item;
function is_empty(qu: in
queue) return boolean;
private
- Implementation dependent
end dqueue;

2.4.2

Compact implementation

The most straightforward implementation of a- queue is an array and two integers, the first one pointing to the place the next incoming item must be stored
in and the second one pointing to the place of the next item that must retrieved.
AB long as the items are deleted, the components of the array become released
and can be used again. Therefore, the pointers must traverse the array in a
circular way, as shown in the figure, -where the shadow indicates the unused
components of the array. The easiest way to detect that the queue is empty or
full is to add a counter that keeps the number of items stored in the queue.
1-----n---

max

AB the size of the array has to be established, it is most appropriate to provide


it as an additional formal argument to the generic declaration. Providing a
default value makes this argument optional, so that any instance of the package
declared in section 2.4.1 also works for this new version.
Consequently, we get:
generic
type item is privatei
max: natural:= 100;
package dqueue is
- Exactly as in section 2.4.1
private
type index is new integer range l..max;
type mem..space is array(index) of item;
type queue is
record
a:
mem..space;
p, q: index;
n:
natural;
end record;
end dqueue;

25

2.4. QUEUES

The reader should remark that, although the counter will get the same values
that the indexes (plus 0), the counter is declared of an incompatible type to
prevent it from being used as an index by mistake.
The operations for this implementation are rather obvious and the resulting
package body becomes:
package body dqueue is
procedure empty(qu:
p: index renames
q: index renames
n: natural renames
begin
p:= 1; q:= 1; n:= 0;
end empty;

out queue) is
qu.p;
qu.q;
qu.n;

procedure put( qu: in out queue; x: in item) is


a: mem..space renames qu.a;
p: index
renames qu.p;
n: natural
renames qu.n;
begin
if n=max then raise space_overflow; end if;
a(p):= x; p:= p mod max+ 1; n:= n+l;
end put;

procedure rem..first(qu: in out queue) is


q: index renames qu.q;
n: natural renames qu.n;
begin
if n=O then raise bacLuse; end if;
q:= q mod max+ 1; n:= n-1;
end rern._first;
function getJirst( qu: in queue) return item is
a: mem...space renames qu.a;
q: index
renames qu.q;
n: natural
renames qu.n;
begin
if n=O then raise bad_use; end if;
return a( q);
end getJirst;
function is_empty (qu: in queue) return boolean is
n: natural renames qu.n;
begin
return n=O;
end is_empty;
end dqueue;

CHAPTER 2. LINEAR STRUCTURES

26

2.4.3

Linked implementation: pointers

In rriany circumstances, it is not easy to foresee a safe upper bound for the size
of the queues. Consequently, we might be interested in providing an implementation that is only bound by the memory space effectively available for the
program. This can be achieved by a linked structure such as:
q

where p points to the item most recently stored in the queue and q points to
the older item in the queue. According to this implementation, the type queue
is represented by these two pointers and -the resUlting private part .becomeS:
generic
type item is private;
package dque_ue is
- Exactly as in section 2.4.1
private
type cell;
type pcell is access cell;

type cell is
record
x:
item;
next: pcell;
end record;
type queue is
record
p, q: pcell;
end record;
end dqueue;

The implementation of the operations is rather simple. An empty queue


is represented by a chain containing no element. Consequently, both pointers

must have the value null:


procedure empty( qu: out queue) is
p: pcell renames qu.p;
q: pcell renames qu.q;

begin
p:= null; q:= null;
end empty;

Checking whether the queue is empty or not is equivalent to check whether


any of the two pointers is null:
function is_empty ( qu: in queue) return boolean is
q: pcell renames qu.q;

begin
return q=null;
end is_empty;

2.4. QUEUES

27

Getting the first item of the queue, ie. the oldest one, is as simple -as getting the item pointed by q. However, if q=null, the attempt to dereference it
will raise the exception constraint_error, which has to be reconverted into the
more explicit bad_use because it results from applying an operation to an illegal
argument:

function get_first(qu: in queue) return item is


q: pcell renames qu.q;
begin
retul-n q.x;
exception
when constraint....error => raise bad_use;
end get_first;
The operation put is achieved in three steps. First, -a new cell must be allocated. Second, the it"em must be copied in that celL Third,. the cell must be
linked to the current structure. If the queue was empty, the new cell becomes
the only element of the new queue and is both the first and the last element.
Otherwise, the new cell must be linked after the one that is currently the last.
If the run-time system is unable to find space to allocate the new cell, a storage...error exception shall be raised. If so, this exception must be converted into.
the more explicit space...overfiow:

procedure put(qu: in out queue; x: in item) is


p: peel! renames qu.p;
q: pcell renames qu.q;
r: peel!;
begin
r:= new cell;
r.all:= (x, null);
if p=null then -- qu is empty
p:= r;- q:= r;
else
p.next:= r; p:= r;
end if;
exception
when storage_error => raise space_overflow;
end put;
The first case is illustrated by the following figure:

q~
Initial state

Final state

For the second case, after r:= new cell; r.all:= (x, null); we have the
following situation:
q

28

CHAPTER 2. LINEAR STRUCTURES


After p.next_:= r; we get

And after p:= r; we finally get (ignoring r, which is local):

The elimination of the first element of the queue is rather straightforward:


procedure rem..first(qu: in out queue) is
p: pcell renames qu.p;

q: pcell renames qu.q;


begin
q:= q.next;
if q=null then p:= null; end if;
exception
when constraint_error => raise bad_use;
end rem.Jirst;
The following figures illustrate the general situation. Initially we have:

Mter q:= q.next; we have

The cell containing a being now inaccessible, becomes garbage and the space
of memory it occupies can be reused. The exception constraint_error will be
raised if the queue is empty, as a result of attempting to dereference a null
pointer.
If the queue contains only one element, then also p must be updated. The
following figure illustrates the situation:

qm
Initial state

Final state

2.4. QUEUES

29

2.4.4 _Linked implementation: cursors


Data representation
Queuesl like stacks, can also be implemented by cursors, with identical advantages and drawbacks. The principles of memory management are almost the
same as those studied for stacks in section 2.2.4, and the principles of chaining are the same as those studied for queues implemented by pointers in section 2.4.3. Therefore the declarations for the implementation of the type queue
are:

generic
type item is private;
max: natural:= 100;
package dqueue is
-- -- Exactly as in section 2.4.1 but with pragma pure removed.
private
type index is new integer range 0 ..max;
type queue is
record
p, q: index:= 0;
end record;
end dqueue;
Memory space management
The management of the memory space for queues is the same as for stacks.
Therefore, the declarations of the types block and mem...space, as well as the
variables ms and free are exactly the same in the package body dqueue as in the
package body dstack of section 2.2.4. The procedures prep_meTTLspace, get_block
and release_block are also the same, as well as the invocation of prep_mem_space

in the initialization part of the package.


However, release_queue can be more efficient than release_stack because the
fact of having pointers to both the beginning and the end of the queue allows the
insertion of the released blocks to the stack of free ones to be done in constant
time:
procedure release_queue(p, q: in out index) is

-Precondition: p/=0 and q/=0


begin
ms(p) .next:= free;
free:= q;
p:= 0; q:= 0;
end release_queue;
The following figures illustrate the operation. Initially we have:
qltal

free~

30

CHAPTER 2. LINEAR STRUCTURES


The state after ms(p).next:= free; is

And after free:= q; p:= 0; q:= 0; is

Implementation of the queue operations

The subprograms empty and is_empty are so simple that are self explaining:
proce!}ure empty(qu: out queue) is
p: index renames qu.p;
q: index renames qu.q;
begin
if qj"=O then release_queue(p, q); end if;
end empty;
function is_empty (qu: in queue) return boolean is
q: index renames qu.q;
begin
return q=O;
end is_empty;

The operation get-first is straightforward as well. The only significant consideration is that in this implementation the attempt to get the first element of
an empty queue has the effect to consult the ms array at' position 0, Which is
out of its allowed range. Consequently, a .constraint_error will be.raised which
must be converted into a more explicit exception.
function geUirst( qu: in queue) return item is
q: index rena:mes qu.q;
begin
return ms( q).x;
exception
when constraint_error => raise bad_use;
end geUirst;
The operation put is achieved in three steps, as in the case of the implementation using access types. First, a block of available memory space must be
obtained. Second, the pushed item must be stored in the obtained block. And
third, the new block must be linked at the end of the queue. Like in the case of
the implementation using access types too, the link operation has two different
cases, depending on whether the queue is initially empty or not.

2.4. QUEUES

31

procedure put( qu: in out queue; x: in item) is


p: index renames qu.p;
q: index renames qu.q;"
r: index;
begin
r:= get_block;
ms(r):= (x, 0);
if q==O then -- qu is empty
p:= r; q:= r;
else
ms(p).next:= r; p:= r;
end if;
end put;

For the case that the list is initially empty, it is clear that the previous
algorithm leads to

For the case of a non empty list, after r:= geLblock; ms(r):= (x, 0); the
situation is:

After ms(p).next:= r; the situation is


q

After p;= r; the situation is, regardless of r, which is a lOcal variable:


q

The operation rem_first also works in three steps. First, an auxiliary index
keeps the position of the element to be deleted because later we shall bring it
back to the pool of free blocks. Then this element is actually removed from
the stack, by making q point to the next element of the queue. Like in the
implementation using access types, if as a result q becomes null, then p must
become null as welL
Finally, the released block may actually be brought back to the pool of free
blocks because it is no longer accessible. The complete algorithm is:

32

CHAPTER 2. LINEAR STRUCTURES

procedure remJirst(qu: in out queue) is


p: index renames qu.p;
q: index renames qu.q;
r: index;
begin
r:= q;
q:= ms(q).next;
if q=O then p:= 0; end if;
release_block(r);
exception
when constrair~_t_error => raise bad_use;
end remJirst;

The following figures describe the case for a queue that initially contains
more than one component. Initially the situation is:
q

After r:= q; we get


r

After q:= m.s(q).next; the result is


r

Once the block pointed by r has been brought back to the pool of free blocks
the final state is:

If the list contains initially' only one element, the initial situatioJ?. is:

After r:= q; q:= Tns(q).next; if q=O then p:= 0; end if; the situation is

q[g]

P[gj

r~
Then the block pointed by r goes back to the pool of free blocks and r
vanishes because it is a local variable.
If the queue is initially empty, q = 0, and a constraint_error will be raised
because this value is out of the range of m.s.

2.5. LISTS

2.5

33

Lists

2.5.1

Principles

A linked list is a linked structure where each component has at most one successor. Linked lists are a powerful mechanism for the implementation of many
different abstract data types. They are so frequently used that, for simplicity,
they are often called lists, in short. The linked implementations of stacks and
queues are particular applications of lists. In chapters 4, 5 and 6 lists will be
applied to the implementation of further data types.
However, a list is not, properly speaking, an abstract data type because
the essential aspect of lists are precisely the links, which is, intrinsically, an
implementation issue. Any attempt to characterize the behavior of a list regardless of the links leads to the specification of either multisets, ie. sets that
can keep multiple instances of the same element or (unbounded) strings of elements, ie. multisets that keep the order of insertion and can- be concatenated:
HOwever, multisets can be implemented either as linked lists or by other means,
and it is rather unclear that the other implementations of multisets have the
same advantages from the point of view of memory management, flexibility and
efficiency.

The application of lists to the implementation of stacks and queues has shown
their a4-vantages from the point of view of memory management. However,
stacks and queues require that insertions and deletions are done only at the
ends of the linked structure and from the point of view ofthe operations, they
are insufficiently illustrative about the advantages of linked lists. Therefore, it
is convenient to address some attention to other frequent operations on lists.

2.5.2

Usual operations

Insertion

Lists are very flexible to make insertions in arbitrary positions. Usually, this
_operation appears from th~ requirement to keep a set of elements in some order.
Let ns assume that the list is provided, as nsual, by a pointer to the first element
of the list as in the next figure, where the list may be empty, in which case first"
will be equal to null. Let us assume too that the elements in the list are sorted
in some ascending order.

The insertion of a new element must proceed in three steps. First, the right
position for the insertion is found. Second, a new cell is created to store the
new element. And third, the created cell is linked in the list.
The first step is a search for the first element in the list that is greater than
the element to be inserted. As the insertion has to be done before ~hat position,
say p, it is required to keep a pointer to the cell that is in the position previous
to the current one, say pp (for previous top).
The linkage step has to consider the possibility that the new element has
to be inserted at the beginning of the list, either because the list was initially
empty or because the new element is lower than any other one currently in the
list.

34

CHAPTER 2. LINEAR STRUCTURES


The resulting algorithm is:

procedure insert(first: in out pcell; x: in item) is


p, pp, r: pcell;
begin
pp:= null; p:= first;
while p/=null and then p.x<=x loop
pp:= p; p:= p.next;
end loop;

r:= .new cell; r.all:= (x, null);


r.next:= p;
if pp=null then first:= r; else pp.next:= r; end if;
end insert;

The following figures illustrate the linkage in the case that pptfnu!l, assuming that the new cell must be inserted immediately before the cell containing c.
Initially we have

After r.next:= p; pp.next:= r; the result is

Then p, pp and r vanish because they are local variables. The reader should
check by himself that the algorithm also works when the insertion has to be
done either at the beginning or at the end of the list. In the first case pp=null
at the end of the search step. In the second case p=null at the end of the search
step.
Deletion
Lists are also well suited to delete components from arbitrary positions. The
usual requirement is to delete an element that satisfies some property PR. The
removal must proceed in two steps. First, the cell containing the element to
be deleted is found. Second, the links are rearranged so that that cell becomes
excluded list.
The first step is a search for the first element in the list that satisfies PR. As
for the insertion, it is required to keep a pointer to the cell that is in the position

2.5. LISTS

35

previous to the current one. In p;rinciple, we should consider the possibility that
the search is unsuccessful, in which case the loop will end with p=nulL
The linkage step has to consider the possibility that the element to be deleted
is the first one in the list.
The resulting algorithm is:

procedure remove(first: in out pcell) is


p, pp: pcell;
begin
pp:= mill; p:= first;
while pf=null and then not PR(p.x) loop
pp:= p; p:= p.next;
end loop;
if pf=null then
if pp=null then first:= p.next;
else pp.next:= p.next;
end if;
end if;
end remove;
The following figures illustrate the linkage in the case that the cell to be
removed is the one containing c. Initially we have

After pp.next:= p.next; the result is

Then p and pp vanish because they are local variables and the cell pointed
by p becomes garbage to be reused because it is no longer accessible.

Inversion
The problem of reversing the order of the elements stored in a list is illustrative
of the flexibility of this data structure. The algorithm is as simple as:

procedure invert(first: in out pcell) is


p, pp, np: pcell;
begin
pp:= null; p:= first;
while pf=nullloop
np:= p.next; p.next:= pp;
pp:= p; p:= np;
end loop;
first:= pp;
end invert;

36

CHAPTER 2. LINEAR STRUCTURES

The following fignres illustrate the step of the loop. At the beginning of the
loop's body the state is:

Mter np:= p.next; p.next:= pp; the state is

n~

And after pp:= p; p:= np; the state is

~
d

...

The reader should be aware that these steps also work at the beginning of
the list (in which case pp=null) and at the end (in which case p.next=nul!).
The principle of this algorithm is also used by the algorithms for garbage collection,- that have to make recursive traversals of linked structurespreciselywhen
there is no memory space to allocate any stack, as described in section 8.3.3.
Concatenation

The concatenation of lists is very simple if the lists are implemented by pointers
to both ends of the list, as in
type list is
record
first, last: pcell;
end record
Then the algorithm for concatenation (conceptually equivalent to a:= ab;
b:= .\;) is
procedure concatenate( a, b: in out list) is
c: list;
begin
if a.first=null then
c:= b;
elsif bJirst=null then
c:= a;
else
a.last.next:= b.first;
c.first:= a.first;
c.last:= b.last;
end if;
a:= c; b:= (null, null);
end concatenate;

2.5. LISTS

37

Traversal and search

Indeed, the elements of a list can also be traversed and searched for, and the
algorithms are so obvious that at this stage the reader should be able to write
them down by himself.
Merge
Sorted lists can also be merged using algorithms very similar to those studied
for sequential files in section 8.5 of vol. II. However, the lists have the advantage
that the main data do not have to be copied. Only the pointers need to be
updated.

2.5.3

\Taria11ts

The principles of linear linking can be applied in a very flexible way. This section
describes the most irriportant patterns.
Multilists
Sometimes it is convenient that a cell containing some data belongs to more
than one list. For instance, a cell containing data about a book can belong to
the list of books of the same author, to the list of books of the same title and
to the list of books in ascending order of ISBN. In these cases, each cell has as
many pointers as lists it may belong to:

From the previous book in asTo the next book of


cending order of ISBN
~ the same author
From. the previous book of
the same title
From the previous book of
the same author

book

. To the next book of


the same title
To the next book in ascending order of ISBN

The algorithms for constructing such lists are as simple as for conventional
lists: the cell is allocated once and then it is linked as many times as lists it
belongs to.
The algorithms for deleting a cell from a multilist require the traversal of
every list that cell belongs to because of the need to update the pointer of the
predecessor cells in every one of the lists. This drawback can be solved by the
use of doubly linked lists, which are explained below.

Doubly linked lists


A list can be organized so that eacb cell contains not only a pointer to the next
cell in the list, but also to the previous one:

38

CHAPTER 2. LINEAR STRUCTURES

Doubly linked lists have the advantage that we do not require to save the
pointer to the previously visited cell in the algorithms for insertion and deletion. This is particularly important when they are used in multilist structures,
because once a cell has been found it can be deleted from any list it belongs to
with no traversal of these lists.
Doubly linked lists have some drawbacks too. They occupy more space and
the operations for insertion and deletion take more time because more pointers
have to be updated. However, we shall see in chapter 4 some applications in
which the advantages are more important than the drawbacks.

Circular lists
Sometimes it is convenient that the last element in a list points back to the first
one, as the figure below describes. In this case, the list is called circular.

Circular lists can be used in combination with multilists, so that when a cell
is reached through the list according to some criterium (for instance when the
cell that contains data from a given book has been found following the list in
ascending number of ISBN}, a:nd then we need to obtain other cells related to
the one found (for instance, books of the same author). Indeed, the book found
through the ISBN list is not necessarily the first one in the list of books of the
same author and, consequently, it is not possible to retrieve all them unless the
list of books of the same author is circular.
Sentinels
Lists can be extended with dummy cells at either end, that make some algorithms simpler. These dummy cells are called sentinels.
For instance, if a list which is sorted in ascending order is extended with
a dunnny cell at the end that contains a value that is higher than any of the
possible values of the significant items, then the search and merge algorithms
become as mucb simplified as those corresponding to sequential files described
in chapter 8 of vol. II. The following figure illustrates the structure, where the
sentinel appears shadowed:

A dummy cell at the beginning of a list makes the insertion and deletion
algorithms simpler because the first (significant) cell is no longer a special case,
because it also has a predecessor cell (the sentinel). The figure below illustrates
the structure, where the sentinel appears shadowed:
first

2.5. LISTS

39

Indeed, the sentinels consume memory space and its use is- more appropriate
for lists that are expected to be so long that this amount is small in relation
with the total amount consumed by the list.

Variants combined

All the previous variants can be combined in any way: lists can have. sentinels
either at one end or at both ends, can be circular and also doubly linked, can
be circular and include- a sentinel at the .beginning too, etc. In all the cases
the algorithms that manage them are quite obvious and the reader can develop
them by himself if he has got a good understanding of the algorithms on linked
structures desc:r:ibed in the previous sections of this chapter.

2.5.4

Final remarks

The organization of data in linked lists is very useful to implement very different
kinds of types. In this chapter we have applied them to implement stacks and
queues._ In the next chapters they are applied again to implement sets, cartesian
products and graphs.
From the point of view of reorganizing data, the use of links has the advantage that actual data, which is usually large, does not have to be copied back..
and forth once .it has- been stored in a cell. Instead, any reorganization requires
to update only the pointers, which is much more cost effective.
Lists can be used in very flexible manners and with plenty of variants. For
this reason it is easier for a programmer to adapt the technique to each special
circumstance than using general purpose libraries that will seldom be completely
adapted to each particular case just to save the rather simple operations on links.
This volume addresses, with a few exceptions, pointers in primary storage.
Nevertheless, the concept of pointer can be extended also to secondary storage.
In this case, a pointer Can indicate either a record in a direct file, a physical
page of a disk or a record that is stored in some position on a physical page of a
disk. Most algorithms studied in this book that manage linked data in primary
storage can be easily extended to data in secondary storage and the reader is
expected to be able to make that adaptation by himself if he needs to.
Multilists, circular lists and doubly linked lists have been used extensively in
the implementation of database systems based in either the hierarchical model
or the network model. These kinds of lists are less extensively used nowadays
as a consequence of the substitution of the hierarchical and network database
models by new system based on the relational model.

40

CHAPTER 2. LINEAR STRUCTURES

Chapter 3

TREES
3.1

Introduction

The concept of tree was introduced in section 3.7 of vol. III. The reader should
find there the general terminology, the definitions of ordinary and binary trees,
the way how ordinary trees can be represented by binary trees, the recursive algorithms for the different kinds of traversals and the derivation of the equivalent
iterative versions.
However, in vol. III, trees are considered as a conceptual tool to help the understanding and the analysis of algorithms, as well as to help the transformation
of recursive algorithms into iterative ones in the case of compound recursion.
No materialization of trees as a data structure was considered there.

In this. volume trees are considered as true data structures that represent
hierarchical data such as algebraic expressions, the result of syntax analysis of
programming languages, directories of file systems or menus having submenus
in graphic user interfaces. Indeed, the implementation of trees is addressed too.

3.2

Binary trees

3.3

Specification

Section 6.2. 7 of vol. III provides the algebraic specification of binary trees, which
in turn can be used to represent ordinary trees.
Like for stacks and queues 1 efficiency considerations force us to implement
most of the operatioDB of that specifi~ation as procedures with parameters of
mode in out instead of actual functions. Also 1 some exceptions must be declared
to be raised either in the case of attempts to execute operations upon illegal
values of the type 1 or in the case of running out of memory space.
After these oonsiderations, the package specification part should be as follows:

41

42

CHAPTER 3. TREES

generic
type item is private;

package dbinarytree is
type tree is limited private;
bad_use:
exception;
space_overflow: exception;

procedure empty (t: out tree);


function is_empty(t: in tree) return boolean;
procedure graft
(t: out tree; lt, rt: in tree; x: in item);
procedure root
(t: in tree; x: out item);
procedure left
(t: in tree; lt: out tree);
procedure right
(t: in tree; rt: out tree);
private
- Implementation dependent
end dbinarytree;
The management of binary trees would be easier if this package is extended
with the additional operations atom, e_lefi and e..right having the following
equivalences:
atom(x) = graft(empty, x, empty); (builds a tree of a single node)
. (there exists a non empty left child)
e.left(t) = is_empty(left(t));
(there exists a non empty right child)
e_right(t) = is_empty(right(t));
We leave this extension as an exercise to the reader.

3.4

Implementation of binary trees

The natural way of implementing binary trees is by means of pointers towards


the left and right children of each node, as illustrated by the fignre below:

Indeed, the pointers can be either access types or cursors. The implementation by access types is described here whereas its conversion to cursors is left
as an exercise to the reader. The declaration of the type is:
generic
type item is private;
package dbinarytree is
-- Exactly as in section 3.3
private
type node;
type pnode is access node;

3.4. IMPLEMENTATION OF BINARY TREES

43

type node is
record
x: item;
1, r: pnode;
end record;

type tree is
record
root: pnode;
end record;
end dbinarytree;

The operations empty and is_ empty are so simple that are self explaining:
procedure empty(t: out tree) is
p: pnode renames t.root;
begin
p:= null;
end empty;
function is"empty(t: in tree) return boolean is
p: pnode renames t.root;
begin
return p=null;
end is_empty;
The operation for building a new tree having the provided item at the root
and the provided trees as left and right children is simple too:
procedure graft(t: out tree; lt, rt: in tree; x: in item) is
p: pnode renames t.root;
pl: pnode renames lt.root;
pr: pnode renames rt.root;
begin
p:= new node;
p.all:= (x, pi, pr);
exception
when storage_error => raise space_overflow;
end graft;

perm

The following figure illustrates the effect of the operation:


are joined as

~&
The operation to get the data stored at the root node, and the operations
to get the children are straightforward as well. In all these operations, the
exception constraint_error will be raised if p=null, ie. if the actual parameter
is an empty tree, and it must be converted into a more explicit one:

CHAPTER 3. TREES

44
procedure root(t: in tree; x: out item) is
p: pnode renames t.root;
begin
x:=p.x;
exception
when constraint_error => raise bad_use;
end root;

procedure left(t: in tree; lt: out tree) is


p: pnode renam.es t.root;
pl: pnode renames lt.root;
begin
pl:= p.l;
exception
when constraint_error => raise bad_use;
end left;
procedure right(t: in tree; rt: out tree) is
p: pnode renames t.root;
pr: pnode renames rt.root;
begin
pr:= p.r;
exception
when constraint_error
end right;

3.5
3.5.1

=>

raise bad_use;

Use of trees: a case study


General considerations

As described in section 3.7 of vol. III, binary trees can be used to represent ordinary trees and, consequently, they are a rmiversal tool for the implementation of
hierarchical structures. Nevertheless 1 their implementation is so simple that the
use of a general purpose package does not provide much saving effort. Instead,
in many cases it is more practical to adapt the concept of tree and its basic
linking mechanisms to each particular circumstance. This section is devoted to
develop a case study that illustrat~s the approach.

3.5.2

Statement of the problem

Exercise 1 Given an arithmetic expression and a variable name, we want to

construct an algorithm that writes the derivative of the expression with respect
to that variable. The expression may contain variables, constants, operators and
function calls. It may also contain parenthesis up to any depth and it is expected
to end with a semicolon.
The operators allowed are addition, subtraction, product and division, with
their usual representation, plus the power operator, which is indicated by the
symbol '~'. The functions allowed are exp, ln, sin and cos.
In order to make the algorithms simpler, we may assume that the variables
have all single-letter names. Moreover, the input expression does not contain

3.5. USE OF TREES: A CASE STUDY

45

neither spac'es nor any other separating character. Besides these constraints, the
input and the output must be according to the usual syntax of Ada expressions.

3.5.3

Expressions: specification

The management of expressions requires to have the means to represent them


in a suitable manner. As they have a hierarchical structure, their representation
should be hierarchical as well. For instance, the expression x- y + z- 2 * cos(x)
can be represented by a hierarchy such as

~~

/ \y
X

*
/'-.......COS
/\ I
Z
X
I
2

Certainly, this hierarchy may be represented by a binary tree, where the


data contained by each node can_be either a constant, a- variable or an operator,
as the functions considered can be dealt with as unary operators.
Nevertheless, the fact that a node has zero, one or two children is not independent of the kind of information that a node contains. Moreover, for nodes
having a single child there is no need to distinguish whether it left or right.
If we use a general purpose binary tree, the compiler can not prevent us from
building inconsistent data by mistake. Instead, we can apply the basic principles Of binary trees to provide a specific type to represent expressions so that
no inconsistent data can ever be built.
Several benefits result from the use of such a tailored type. First, the attempt
to build inconsistent data can be detected at compile time, and consequently
it becOmes easy to identify and remove, reducing the development time. And
second, the resulting code is more readable, and consequently less error prone
and easier to maintain, as we shall see.
The package that declares the type expression should also declare the auxiliary types that characterize the different kinds of expressions and operators.
Indeed, an expression can be either a variable, a coristant, a compound expression rooted by a unary operator or a compound expression rooted by a binary
operator. From this point of view, the unary minus and the binary minus have
to be considered as different operators.
A special type of expression, the null expression, is convenient to characterize
the result of reading and expression when errors are found in the data supplied
to the program.
Regarding the operations, it must provide those required to build the different kinds of expressions and to get the components from a given expression.
Indeed, the basic principles of the operations for building expressions are the
same as those from atom and graft for binary trees.
According to these considerations, we obtain the following package specific
cation:

46

CHAPTER 3. TREES

package d_expr is
type expression_type is (e..null, e_const, e_var, e_un, e_bin);
type un_op
is (neg, sin, cos, exp, ln);
type bin_op
is (add, sub, prod, quat, power);
type expression

is private;

- Operations to build expressions:


function b..null
function b_constant(n: in integer)
function b_var
(x: in character)
_function b_un_op
(op: in un_op; esb: in expression)
function b_bin_op
(op: in bin_op; el, e2: in expression)

return expression;
return expression;

return expression;
return expression;

return expression;

- Operations to get components from expressions:


function e_type (e: in expression) return expression_type;
function g_const(e: in expression) return integer;
function g_var (e: in expression) return character;
procedure g_un
(e: in expression; op: out un_op; esb: out eXpression);
procedure g_bin
(e: in expression; op: out bin_op; esbl, esb2: out expression);
private
- Implementation dependent
end d_expr;

3.5.4

Expressions: implementation (access types)

The implementation of the type expression is similar to the implementation of


binary trees, with the difference that the contents of a cell is a discriminated
record and that the number of pointers depends on the type of the discri.nllnant.
The subtype pnode is just to create a synonymous of expression in the "internal scope of d..expr, for the sake of improving the readability of the programs
that use the package: externally we want that the name of tbe type is expression,
regardless of whether the implementation is by access types. or not. Instead, internally, we want to use the name pnode to make it clear that the variable is a
pointer and that it refers to a particular node in the structure.
The discriminant of the type node does not have a default value because none
of the operations to implement requires to change the value of that discriminant.
As a consequence, each time that space is allocated for a new node, its size will
never be greater than that strictly required. Consequently, we get:
package d_expr is
. - Exactly as in section 3.5.3
private
type node;
type expression is access node;
subtype pnode is expression;

3.5.

USE OF TREES: A CASE STUDY

47

type node(et: expression_type) is


record
case et is
when e..null =>

null;
when e_const =>
val:
integer;
when e_var =>
var:
character;
when e_un =>
opun:
un_op;
esb:
pnode;
when e_bin =>
opbin:
bin_op;
esbl, esb2: pnode;
end case;
end record;
end d_expr;
The implementation of the operations is straightforward. In order to focus
on the essential aspects of the problem, the details of raising exceptions in the
case of lack of memory space or bad use have been omitted. It is clear that an
inconsistency between the type of an expression and the operands we expect to
get from it is a case of bad use, which in this implementation will have as a
consequence the raising of a constraint_error exception. The reader is referred
to chapter 2 for a detailed discussion of these aspects.
package body d_expr is
function.b_null return expression is
p: pnode;
begin
p:= new node'(et=> e..null);
return p;
end b_null;
function b_constant(n: in integer) return expression is
p: pnode;
begin
p:= new node'(e_const, n);
return p;
end b_constant;
function b_var(x: in character) return expression is
p: pnode;
begin
p:= new node'(e_var, x);
retur:n p;
end b_var;

48

CHAPTER 3. TREES

function b_un....op(op: in un_op; esb: in expression) return expression is


p: pnode;
begin
p:= new node'(e_un, op, esb);
return p;
end b_un_op;
function b_bin_op
(op: in bin_op; el, e2: ~n expression) return expression is
p: pnode;
begin
p:= riew node'(e_bin, op, el, e2);
return p;
end b_bin_op;
function e_type(e: in expression) return expression_type is
begin
return e.et;
end e_type;
~nction g_cc;mst(e: in expression) return integer is

begin
return e. val;
end g_const;
function g_var(e: in expression) return character is
begin
return e.var;
end g_var;
procedure g_un
(e: in expression; op: out un_op; esb: out exPression) is
begin
op:= e.opun;
esb:= e.esb;
end g_un;
procedure g_bin
(e: in expression; op: out bin_op; esbl, esb2: outexpression) is
begin
op:= e.opbin;
esbl:= e.esbl;
esb2:= e.esb2;
end g_bin;

end d__expr;

3.5. USE OF TREES: A CASE STUDY

3.5.5

49

The overall solution

Once we have discussed the type that is essential to solve the problem and its
relation with trees, we can address the overall structure of the solution of the
exercise proposed, to show the usefulness of the proposed type. Indeed, the
main program must read the input expression from a file and build the appropriate internal representation. Then it has to obtain the derivative; simplify it
and write the result down. Consequently, we need three variables of the type
expression: one for the source, one for the derivative and one for the simplified
result. Conse-quently, the main program becomes:
with Ada.text...io, Ada.commandJine, algebraic, d_exPr;
use Ada.textJ.o, Ada.commandJine, algebraic, d_expr;
procedure manip_algeb is
f:
file_type;
e, de, sde: expression;
x:
. character; -- the variable to derive with respect to
begin
open(f, in_file, argurnent(l)&" .txt");
read(f, e);
close(f);
x:= argurnent(2)(1);

de:= derive(e, x);


sde:= simplify(de);
create(f, out _file, "d_" &argument(l )&".txt");
write(f, sde);
close(f);
end manip_algeb;
The package algebraic captures the essential operations:
with Ada.text_io; use Ada.text_io;
with d_expr;
use d_expr;
package algebraic is
syntax_error: exception;
procedure read
(f: in file_type; e: out expression);
procedure write (f: in file_type; e: in expression);
,; v"'~ c~
~~~"~
~ ~
e+ ,\ reuU.t.u
+ - - exp.~.ess1on,

... 1on d e......


............ ,__...,..P
....... ss.._o.u,
....... I.._.. ......u.aJ.avue.L;
f unc.
function simplify(e: in expression)
return expression;
end algebraic;

,_

3.5.6

- "" "

Building expressions

The procedure read builds the expression from the text file following an algorithm that is very similar to the one studied in section 3.6.5 of vol. III to evaluate
arithmetic expressions. That algorithm was made up of three mutually recursive
procedures that d,ealt with expressions, terms and factors recursively. Each one

50

CHAPTER 3. TREES

provided as a result an output parameter that was the value of that expression,
term or factor respectively.
As the source data for the derivation problem is very much the same, the
structure of the algorithm that reads the expression from the source file must be
similar too. In this case, four mutually recursive procedures are required because
the addition of the power operator involves one more level in the hierarchy. The
result that each one provides is an output parameter of type expression that
corresponds to the part read of the expression.
Only the relevant procedures are reproduced below because the auxiliary
operations add nothing to the understanding of trees we are addressing in this
chapter.
procedure read(f: in file_type; e: out expression) is
c: character;
procedure read_expr (e: out expression);
procedure read_term (e: out expression);
procedure read_factor (e: out expression);
proce~ure read...factorl(e: out expression);
function scanjd
return un_op;
function scan....num return integer;
procedure read_expr(e: out expression) is
neg_term: boolean;
op:
character;
el:
expresswn;
begin
neg_term:= c='-'; if neg_ter_m then get(f, c); end if;
read_term(e);
if neg_term then e:= b_nn_op(neg, e); end if;
while c='+' or c='-' loop
op:= c; get(f, c); read_term(el);
if op='+' then e:= b_bin_op(add, e, el);
else e:= b_bin_op(sub, e, el);
end if;
end loop;
end reacLexpr;
procedure read_term(e: out expression) is
op: character;
el: expression;
begin
read_factor(e);
while c='*' or c=' /' loop
op:= c; get(f, c); read_factor(el);
if op='*' then e:= b_bin_op(prod, e, el);
else e:= b_bin..op(quot, e, el);
end if;
end loop;
end read_terrn;

3.5. USE OF TREES: A CASE STUDY


procedure read..factor(e: out expression) is
el: expression;
begin
read_factor 1 (e);
if c='"' then
get(f, c); read_factorl(el);
e:= b_bin_op(power, e, el);
end if;
end read_factor;

procedure read_factorl(e: out expression) is


t: un_op;
x: character;
n: integer;
begin
if c = '(' then
get(f, c); read_expr(e);
if cf=')' then raise syntax_error~ end if;
get(f, c);
elsif 'a'<=c and c<='z' then
x:= c;
t:= scan.id; -- identify the function name
if t = neg then - by convention, the identifier is a variable name
e:= b_var(x);
else

if c/='(' then raise syntax_error; end if;


get(f, c);
read_expr(e);
if cf=')' then raise syntax_error; end if;
get(f, c);
e:= b_un_op(t, e);
end if;
elsif 'O'<=c and c<'9' then
n:= scan_num;
e:= b_coustant(n);
else
raise syntax_error;
end if;
end readJactorl;
function scan_id return un_op is .. .
function scan_nurn return integer is .. .
begin
get(f, c); read_expr(e);
if cJ=';' then raise syntax_error; end if;
exception
when syntrocerror => e:= b_null; raise;
end read;

51

52

CHAPTER 3. TREES

For all the procedures that read an entity (ie. an expressionr a term or. a
factor), the precondition is that c is the first character of that entity and the
postcondition is that c is the first character that is not from the recognized
entity, and moreover the output parameter is the internal representation of the
recognized entity.

3.5.7

Deriving expressions

The rules for deriving the expressions proposed in the statement of the problem
are:

Dxc
DxY
Dxx
Dx(-f(x))
Dx sin(f(x))
Dx cos(f(x))
Dxef(x)

D f(x)
x g(x)
Dx(f(x))"
Dx(f(x))g(x)

if c is a constant
if y#x

-Dxf(x)
cos(f(x)) Dxf(x)
- sin(f(x)) Dxf(x)
=

ef(x) Dxf(x)
1
f(x) Dxf(x)

=
=

Dxf(x) + Dxg(x)
Dxf(x)- Dxg(x)
(Dxf(x)) g(x) + (Dxg(x)) f(x)
(Dxf(x)) g(x) - (Dxg(x)) f(x)
(g(x))2
cf(x)c-I Dxf(x)
Dx(eg(x) ln(f(x)))

Dxlnf(x)
Dx(f(x) + g(x))
Dx(f(x)- g(x))
Dx(f(x) g(x))

0
0

The featnres of the internal representation for expressions allow these rules
to be directly applied and make the algorithm for derivation straightforward.
Indeed, the rules to apply depend on the kind of expression (ie. constant, variable, rooted by a unary operator or rooted by a binary operator):
function derive(e: in expression; x: in character) return expression is
de: expression;
begin
case e_type(e) is
when e..null => de:= b..null;
when e_const =>de:= b~constant(O);
when e_var => de:= derive_var(e, x);
when e.un
=>de:= derive_un (e, x);
when e_bin =>de:= derive_bin(e, x);
end case;
return de;
end derive;
where

3.5. USE OF TREES: A CASE STUDY


function derive_var(e: in expression; x: in character) return expression is
de: expression;
begin
if g_va.r(e) = x then de:= b_constant(l);
else de:= b_constant(O);
end if;
return de;
end derive_var;
function- derive_un(e: in expression; x: in character) return expressiori is
uop:
un_op;
esb, de, desb: expression;
begin
g_un(e, uop, esb);
-Get components
desb:= derive(esb, x); --Derive subordinate expression
case uop is
when neg=>
de:= b_un_op(neg, desb);
when sin=>
de:= b_un_op(cos, esb);
de:= b_bin_op(prod, de, desb);
when cos=>
de:= b_un_op(sin, esb );
de:= b_un_op(neg, de);
de:= b_bin...op(prod, de, desb);
when exp =>
de:= b_bin...op(prod, e, desb);
when ln =>
de:= b_constant(l);
de:= b_bin_op(quot, de, esb);
de:= b_bin_op(prod, de, desb);
end case;
return de;
end derive_un;
function derive_bin(e: in expression; x: in character) return expression is
bop:
bin_op;
esbl, esb2:
expression;
de, desbl, desb2: expression;
expression;
a, b, c, d:
begin
g_bin( e, bop, esbl, esb2);
-- Get components
desbl:= derive(esbl, x);
-- Derive subordinate expressions
desb2:= derive(esb2, x);
case bop is
when add=>
de:= b_bin_op(add, desbl, desb2);
when sub=>
de:= b_bin_op(sub, desbl, desb2);

53

54

. CHAPTER 3. TREES

when prod=>
a:= b_bin_op(prod, desbl, esb2);
b:= b_bin_op(prod, desb2, esbl);
de:= b_bin_op( add, a, b);
when quot =>
a:= b_bin_op(prod, desbl, esb2);
b:= b_bin_op(prod, desb2, esbl);
c:= b..bin_op(sub, a, b);
d:= h-constant(2);
d:= b_bin_op(power, esb2, d);
de:= b_bin_op(quot, c, d);
when power =>
if not contains(esb2, x) then
a:= b_constant(l);
a:= b_bin_op(sub, esb2, a);
a:= b_bin_op(power, esbl, a);
a:= b_bin_op(prod, esb2, a);
de:= b_bin_op(prod, a, desbl);
else- f(xrg(x) = exp(ln(f(xrg(x)) = exp(g(x)ln(f(x)))
a:= b_un_op (ln, esbl);
a:= b_bin_op(prod, esb2, a);
a:= b_un_op (exp, a);
de:= derive(a,x);
end if;
end case;
return de;
end deri ve_bin;

The auxiliary function contains checks if the provided variable appears in


an expression, which is important for the choice between the two last formulae:
function contains(e: in expression; x: in character) return boolean is
cont:
boolean;
un_op;
uop:
bin_op;
bop:
esb, esb1 1 esb2: expression;
begin
case e_type(e) is
when e..null I e_const = >
cant:= false;
when e_var =>
cont:= (g_var(e) = x);
whene_un =>
g_un (e, uop, esb);
cont:= contains(esb, x);
when e_bin =>
g_bin(e, bop, esbl, esb2);
cont:= contains(esbl, x) and then contains(esb2, x);
end case;
return cont;
end contains;

3.5. USE OF TREES: A CASE STUDY

3.5.8

55

SimplifYing expressions

The expression resulting from the derivation process should be simplified but
simplifying an expression is, in principle, a complex matter because there is no
absolute concept of a_ normal form, so the equivalent non reducible form is not
unique. Moreover, some kinds of reduction imply the application of distributive
and commutative properties, which may lead to non terminating algorithms.
As this exercise Only addresses to illustrate the usefulness of trees and how
easy is to use them, only the following elementary rules:

-(- f(x))

f(x)

-0
sin(O)
cos(O)

0
0

eo

1
1

In(1)

0+ f(x)
f(x) + 0

f(x)
f(x)

c1

+ c2

C3

0- f(x)
!('!')- 0

-f(x)
f(x)

C1- C2

C3
-c3

c1 - C:2

0 f(x)
f(x) 0
1 f(x)
f(x) 1

0
0

c1 c2

C3

0/ f(x)

f(x)/1
Of(x)
(f(x))o
1f(x)
(f(x))1

(the computed result)

(if c1 ? c2)
(if c, < c2)

f(x)
f(x)
f(x)
0
1
1

f(x)

The features of the internal representation for expressions allow the application of these rules to be as direct as the rules for derivation, and the following
algorithm results, which recursively simplifies the subordinated expressions first
and then applies the rules above to the result:
function sirnplify(e: in expression) return expression is
s: expression;
begin
case e_type(e) is
when e_null I e_const I e_var => s:= e;
when e_un => s:= simplif'y_un (e);
when e_bin => s:= simplif'y_bin(e);
end case;
returns;
end simplifY;

56

CHAPTER 3. TREES

function simplify_un(e: in expression) return expression is


uop, uops: un_op;
s, esb, esb2: expression;
begin
g_un(e, uop, esb);
esb:= simplify(esb);
case uop is
when neg=>
if e_type(esb) = e_un then
g_un(esb, uops, esb2);
if uops=neg then
s:= esb2; ---(-E)-> E
else
s:= e;
end if;
elsif e_type(esb )=e_const and then g_const(esb )= 0 then
s:= esb; -- -0 -> 0
else
s:= b_un_op(uop, esb);
end if;

when sin=>
if e_type(esb)=e_const and then g_const(esb)=
s:= esb; -- sin(O) -> 0
else
s:= b_un_op(uop, esb);
end if;
when cos=>
if e_type(esb)=e_const and then g_const(esb )=
s:= b_constant(1);- cos(O) -> 1
else
s:= b_un_op(uop, esb);
end if;
when exp =>
if e_type(esb )=e_const and then g_const(esb )=
s:= b_constant(1); -- exp(O) -> 1
else
s:= b_un_op(uop, esb);
end if;
when In=>
if e_type(esb )=e_const and then g_const(esb )=
s:= b_constant(O); -- ln(1) -> 0
else
s:= b_un-Dp(uop, esb);
end if;
end case;
returns;
end simplify_un;

0 then

0 then

0 then

1 then

3.5. USE OF TREES: A CASE STUDY


function simplify_bin(e: in expression) return expression is
bop:
bin_op;
s, esbl, esb2: expression;
begin
g_bin(e, bop, esbl, esb2);
esbl:= simplify(esbl);
esb2:= simplify(esb2);
case bop is
when add => s:= simplify...add (esbl, esb2);
when sub => s:= simplify..sub (esbl, esb2);
when prod => s:= simplify_prod (esbl, esb2);
when quat => s:= simplify_quot (esbl, esb2);
when power=> s:= simplify_power(esbl, esb2);
end case;
returns;
end simplify _bin;
function simplify_add(esbl, esb2: in expression) return expression is
s: expression; nl, n2: integer;
begin
if e_type(esbl)=e_const and then g_const(esbl)= 0 then
s:= esb2;
- O+E -> E
elsif e_type(esb2)=e_const and then g_const(esb2)= 0 then
s:= esbl;
-- E+O -> E
elsif e_type(esbl.)=e_const and e_type( esb2)=e-const then
nl:= g_const(esbl); n2:= g_const(esb2);
--compute addition
s := b_constant(nl+n2);
else
s:= b_bin...op(add, esbl, esb2);
end if;
returns;
end simplify...add;
function simpli:fy_sub(esbl, esb2: in expression) return expression is
s: expression; nl, n2: integer;
begin
if e_type(esb2)=e_const and then g_const(esb2)= 0 then
s:= esbl;
-- E-0 -> E
elsif e_type(esbl)=e_const and then g_const(esbl)= 0 then
s:= b_un_op(neg, esb2);
-- 0-E -> -E
elsif e_type(esbl )=e_const and e_type( esb2)=e_const then
' ,....,
. sub' t raet"1on
n.t:= g_const. esO.LJj n.t;;= g_const(eSD.t;);
--compute
nl:= nl-n2;
if nl>=O then s:= b_constant(nl);
else s:= b_constant(-nl); s:= b_un_op(neg, s);
end if;
else
s:= b_bin...op(sub, esbl, esb2);
end if;
returns;
end simplify..sub;
,

'(

_, \

"

57

58

CHAPTER 3. TREES

function simplify_prod(esbl, esb2: in expression) return expression is


s: expression; nl, n2: integer;
begin
if e_type(esb1)=e_const and then g_const(esb1)= 0 then
s:= esb1;
-- OE- > 0
elsif e_type(esb2)=e-const and then g_const(esb2)= 0 then
s:= esb2;
- EO -> 0
elsif e_type(esb1 )=e_const and then g_const(esb1 )= 1 then
s:= esb2;
-- l<E - > E
elsif e_type(esb2)=e_const and then g_const(esb2)= 1 then
s:= esb1;
-- E1 -> E
elsif e_type(esb1)=e_const and e_type(esb2)=e_const then
n1:= g_const(esb1); n2:= g_const(esb2);
-- compute product
s:= b_constant(nl<n2);
else
s:= b_bin_op(prod, esb1, esb2);
end if;
returns;
end simplify_prod;

function simplify_quot(esb1, esb2: in expression) return expression is


s: expression;
begin
if e_type(esb1)=e_const and then g_const(esb1)= 0 then
s:= esb1;
-- 0/E -> 0
elsif e_type( esb2)=e_const and then g_const(esb2)= 1 then
s:= esb1;
- E/1 -> E
elsif esb1=esb2 then
s:= b_constant(1);
- E/E -> 1
else

s:= b_bin__op( quot, esb1, esb2);


end if;
returns;
end simplify_quot;
function simplify_power(esbl~ esb2: in expression) return expression is
s: expression;
begin
if e_type(esbl)="e_const and then g_const(esb1)= 0 then
s:= esb1;
--WE-> 0
elsif e_type( esb2)=e_const and then g_const( esb2)= 0 then
s:= b_constant(1);
- E'O -> 1
elsif e_type(esb1)=e_const and then g_const(esb1)= 1 then
s:= esb1;
-- 1'E -> 1
elsif e_type( esb2)=e_const and then g_const( esb2)= 1 then
s:= esb1;
-- E'1 -> E
else
s:= b_bin_op(power, esb1, esb2);
end if;
returns;
end simplify_power;

3.5. USE OF TREES: A CASE STUDY

59

Indeed, the simplification algorithm can be more sophisticated than this one.
For instance, this algorithm is unable to convert 7( -3) into -21, nor to reduce
according to elnx = x or In ex=- x, but the reader should be able to extend the
algorithm to address such reductions by himself.
In contrast, there is a relevant matter regarding the simplification rules based
on the equality of expressions, such as

f(x)- f(x)
f(x)/f(x)

sin2 (f(x)) + cos2 (f(x))

0
1
1

It is clear that the equality test for the type expression provided by the
compiler will- co;mpare the_ pointers, which is not the appropriate comparison,
because the result will be true only if the two pointers refer to the same tree.
Instead what is required for simplification purposes is whether two different
trees are algebraically equivalent or not.
There are two ways to prevent a user of the package d_expr from a wrong
use of the equality test provided by the compiler. One is to declare the type
expression as limit~d .private. However this choice implies that the assignment
is forbidden too, something that would make the derivation and simplification
algorithms quite uncomfortable to read. The second way is to keep the type
expression as (non limited) private, and override the equality test by adding
function ''="(el, e2: in expression) return boolean;

in the visible part of d_expr.


However, providing an algorithm to implement this function is not simple.
Algebraic equivalence is an undecidable' problem. A more realistic approach
would be to restrict the concept of equality to literal equality, ie. two atoms are
equal if their type (constant, variable) is the same and th$l,,values they contain
(the constant, the variable) are the same too, and two non atoms are equal if
the operator is the same and the corresponding subordinate trees are literally
equivalent too. Indeed, such a test has a worst case execution time proportional
to the number of nodes of the expression.
In section 3.5.10 a different implementation for the type expression is provided that, among other advantages, makes the literal equivalence test to be

0(1).

3.5.9

Writing expressions

The output of the expression is essentially an inorder traversal of the tree. For
binary operators, the left subexpression must be written first, then the operator
and finally the right subexpression. For unary operators, the operator must be
written first and then the subordinated expression.
So the algorithm is quite simple and the only difficulty is to save unnecessary
parenthesis. A look at the figure below shows that this decision requires a
detailed casuistic: it has to do with the precedence of operators, the rules for
the left operator are different than the rules for the right operator and some
unary operators always require parenthesis, whereas others do not.
1The concept of undecidability is explained in chapter 11 of vol. II.

CHAPTER 3. TREES

60

/'\
a
*
/'\

b c
a+bc

*
/'\
a +
/'\

b c
a.(b+c)

/'\
+ c
/'\

b
a
a+b-c

/'\
a +
/'\

b
c
a-(b+c)

/'\
a +
/'\

b c
a+b+c

I
a
-a

sm

sin( a)

The casuistic is deferred to the functions left...parenLreq, righLparenth..req


and parenth_req_un, with the respective meanings that either the left, the right or
the unique operator have to be enclosed between parenthesis, and we leave them
as an exercise to the reader. The resulting algorithm for. writing expressions
becomes:
procedure writeO(f: in file_type; e: in expression) is
pr:
boolean;
uop:
un_op;
bop:
bin..op;
esb, esbl, esb2: expression;
begin
case e_type(e) is
when e_null =>
null;
when e_const =>
put_const(f, integer'image(g._const(e)));
when e_var =>
put(f, g_var(e));
whene_un =>
g_un (e, uop, esb);
put_uop(f, uop);
pr:= parenth..req_un(e);
if pr then put(f, '('); end if;
writeO(f, esb );
if pr then put(f, ')');end if;
when e_bin =>
g_bin(e, bop, esbl, esb2);
pr:= left_parenth..req( e);
if pr then put(f, '('); end if;
writeO(f, esbl);
if pr then put(f, ')');end if;
puLbop(f, bop);
pr:= right_parenth...req(e);
if pr then put(f, '(');end if;
writeO(f, esb2);
if pr then put(f, ')'); end if;
end case;
end writeO;
The procedure writeO, which is recursive, is invoked by write, which just
adds a semicolon at the end.

3.5. USE OF TREES: A CASE STUDY

3.5.10

61

Expressions: implementation (cursors)

An important amount of memory conld be saved if the data structure to represent expressions save common subexpres$ions only once. For instance, the
expression (a+ b) * (a+ b) can be stored by a structure in which the root node
has its left and right pointers referring to the same node that represents the
common subexpression, ie.:

*
()

+
/'\

/~

+
/'\

instead of

/'\

Such an implementation would save 3 out of the 7 nodes required by the


tree. Moreover, the calculus of the derivatives generates a huge amount of
common subexpressions, which in turn could be saved only once. For instance,
the derivative of (x + 1) 2 is 2(x + 1). So, the common subexpression of both
expressions could be saved once too, as shown by the figure below, saving 4 out
of 10 nodes:

de-*
"-e
c...'<:-/~
+ 2

/'\

In order to avoid the allocation of space for nodes that already exist, it is
required that the allocation algorithm has easy access to all the currently created
nodes. This can be more easily achieved if the memory space is organized by
means of cursors instead of access types. If so, the request for the allocation
of a new node is preceded by a search for a preexisting node with the same
contents. If found, the pointer to the preexisting node is used, instead of actually
allocating a new one.
According to this principle, the new version of the specification part of package d_expr is

package d_expr is
: - Exactly as in section 3.5.4.
private
max: constant integer:= 200;
type expression is new integer range 1. .max;
subtype index is expression;
type block(et: expression_type:= e_null) is
reCord
case et is
when eJ1ull =>
null;
when e_const =>
val:
integer;
when e_var =>
var:
character;

62

CHAPTER 3. TREES

when e_un =>


opun:
un_op;
esb:
index;
when e_bin =>
opbin:
bin_op;
esbl, esb2: index;
end case;
end record-;
end d_expr;

The management of the memory space is simple because the blocks are never
released, so the structure only grows. Consequently, a counter with the number
of used blocks is sufficient and the memory management part of the body of
. d...expr looks as:

package body d_expr is


type expression_space is array(index) of block;
s: expression....space;
ne: index;

procedure allocate(b: in block; e: out index) is


begin
s(ne):= b; e:= 1;
while b/=s(e) loop e:= e+l; end loop;
if e=ne then ne:= ne+l; end if;
end allocate;
: - Operations that manage expressions.

begin
ne:= 1;
end d_exprj
The allocation algorithm is a simple linear search that uses the sentinel
technique. The algorithms for building expressions are straightforward too:
function b...null return expression is
b: block; e: index;
begin
b:= block'(et=> e..null);
allocate(b, e);
return e;
end b...null;
function b_constant(n: in integer) return expression is
b: block; e: index;
begin
b:= (e_const, n);
allocate(b, e);
return e;
end b_constant;

3.5. USE OF TREES: A CASE STUDY

63

function b_var(x:in character) return expression is


b: block; e: index;
begin
b:= (e_var, x);
allocate(b, e);
return e;
end b_var;
function b_un_op (op: in un_op; esb: in expression) return expression is
b: block; e: index;
begin
b:= (e_un, op, esb);
allocate(b, e);
return e;
end b_un_op;

function b_bin_op(op: in bin_op; el, e2: in expression) return expression is


b: block; e: index;
begin
b:= (e_bin, op, el, e2);
allocate(b, e);
return e;
end b_bin_op;
And the operations for getting the components of an expression are straightforward as well:
function e_t)rpe (e: in expression) return expression_type is
begin
return s(e).et;
end e_type;

function g_const(e: in expression) retli.rn integer is


begin
return s(e).val;
end g_const;
function g_var(e: in expression) return character is
begin
return s(e)_ var;
end g_var;
procedure g_un(e: in expression; op: out un_op; esb: out expression) is
begin
op:= s(e) .opun;
esb:= s(e) .esb;
end g_un;

64

CHAPTER 3. TREES

procedure g_bin
(e:
in expression;
op:
out bin_op;
esb1, esb2: out expression) is
begin
op:= s(e).opbin;
esb1:= s(e).esb1;
esb2:= s(e).esb2;
end g_bin;

In addition to memory savings, this implementation has the advantage that


two expressions are literally equivalent if and only if the pointers to their roots
are equal. So, the literal equivalence test, that in the previous implementation
required O(n) time (being n the number of nodes of the trees to be compared),
is now 0(1). Certainly, there is a price for this improvement: the allocation of
a new node that in the previous implementation was 0(1) is now O(n), being
n the total number of existing nodes among all the created trees. However, this
poor performance can be improved up to 0(1) using the techniques presented
in section 4.11, as shown in section 7.2, where the approach that has been
presented here is revisited.
Indeed, the package would be more flexible if the constant that determines
the size of the memory space is a generic parameter. We leave this conversion
as an exercise to the reader.
A solution that avoids the allocation of repeated nodes and allows the equivalence test to be 0(1) can also be achieved using access types too, provided that
all the allocated nodes are linked as a list. But the approach has two drawbacks.
First, it requires an additional pointer per node. Second, the execution time of
the search is necessarily linear because the techniques described in section 4.11
can not be applied.

3.5.11

Final remarks

Several lessons can be drawn from the development of this long exercise. First,
we have seen the usefulness of trees to represent and manage hierarchical data
such as an arithmetic expression. 'fr_ees can be used to manage any other type of
hierarchical data in a similar way, such as syntax trees of programming languages
which are widely used by compilers, decision trees used in artificial intelligence,
file directories used by operating systems and component hierarchies manaded
by graphic user interfaces.
Second, we have seen that any hierarchical structure can be represented
either by a binary tree of by a specifically tailored tree, the latter choice having
several advantages from the point of view of readability and strong type checking
by the compiler.
Third, that neither the use of a hierarchy nor the implementation of a treestructured data type have special difficulties.
And .fourth, that a clear separation between specification and implementation allows us to safely improve the implementation of a type. For instance,
changing the implementation of the type expression from that provided in section 3.5.4 to that provided in section 3.5.10 does not require the update of

3.6. OTHER TREE-LIKE STRUCTURES

65

package algebraic even by a single comma. The same will happen if the implementation is still improved by using the techniques described in section 4.11
to make the building operations O(n). This principle is essential to make high
quality software that is easy to maintain, adapt and improve.
The fact that some of the subtrees are shared makes the implementation of
expressions to look as directed acyclic graphs, a structure that will be developed
in chapter 6. In practice, however, they are actually trees, to the extent that an
abstract data type is characterized by the external operations. In the case of
expressions, the traversals for deriving or for writing imply a complete traversal
of all the branches, as for a tree, no matter they have been already traversed
as a consequence of being shared. Instead, the algorithms for the traversal of a
graph do not visit any node twice.

3.6

Other tree-like structures

There are other ways of structuring data that share some aspects with trees.
For instance, a binary tree can be stored in an array on a positional basis: the
root is at position 1 and the left and right children of a node at position i are
placed at position 2i and 2i + 1 respectively:
T

Ia I

2i

/\

In this case, given a node it is easy and efficient to retrieve both its children
and its parent. Instead, the approach is ill suited to build trees according to the
operations specified in section 6.2.7 of val. III (see also section 3.3 of this vol.).
Moreover it leads to an unaffordable waste of space unless the tree is completely
balanced. However, they are best suited to implement priority queues, as it will
be shown in section 5.2.
Another possibility, that can be used for the recording of ordinary trees, is
to keep the nodes i,; an array, with a pointer (ie. an index) to the parent.

/1\
/t I

In this case, it is easy and efficient to follow the paths from one node towards
the root, but the structure is ill suited to retrieve the children of a node.
In either case, strictly speaking, these structures can not be considered as
implementations of trees, to the extent that the operations they provide are
different than those that specify the type tree (binary or ordinary). Instead,
these structures appear as solutions for the implementation of data types such
as priority queues or equivalence relations that, in principle, have nothing to
do with a hierarchy of dependencies. They will be studied with respect to the
problems they can be applied to in chapter 5. Indeed, some general properties
of trees, in particular, the relationship between the height and the total number
of nodes, will be useful to discuss the cost of the involved operations.

66

CHAPTER 3. TREES

Chapter 4

SETS AND MAPPINGS


4~1

Introduction: sets, dictionaries and mappings

The set is a fundamental concept in mathematics. Also in computer science,


the concept of set captures the essential aspects of data management. As an
abstract data type, the set is characterized by two constructor operations, one
to build the empty set and another to add an element to a set, pins a consult
operation to check whether an element is in the set. Additionally, we may
require operations to remove an element from a set and to perform the usual
operations for union, intersection and difference. An algebraic specification of
the type set is provided in section 6.2.8 of val. III.
A mapping is a correspondence between two sets, in which each element of
the first set has a unique associated value of the second set. As an abstract
data type, a mapping can be considered as a set whose elements are pairs of
values. Henceforth, the type of the first component will be called key and the
type of the second component will be called item. The operations for building
a mapping are em:pty, which is a set having no correspond~nce at all and put,
that adds a correspondence (ie. an association between a key and an item) to
a given mapping. Naturally, mappings have an additional operation to get the
item associated ~o a given key. Additionally they may have an operation to
remove an association from a mapping and another operation to update the
item associated to a key.
A dictionary is a type that has all the operations of a set but union, intersection and difference.
Sets, mappings and dictionaries are consequently very similar concepts and
the techniques for implementing them have only slight differences. Regarding
the operation to add an element, the specification for sets usually requires that if
x E A then adding x again to A leaves A unchanged, as in this case AU {x} = A.
Likewise, the attempt to remove an element that is not contained in a set leaves
the set unchanged too, as if x c A then A - { x} = A. Instead, the attempt
to add a second association to a key in a mapping must be dealt with as an
error because only one association is allowed for each key value. Regarding the
consult operation, for sets it returns a boolean, indicating whether that element
is in the set or not, whereas for a mapping it returns the item associated or

67

68

CHAPTER 4. SETS AND MAPPINGS

a warning if no item is currently associated to that key value in the mappillg.


Likewise, the attempt to remove data associated to a non recorded key from a
mapping should issue a warning, because it is usually the result of wrong input
data.
Many algorithms in mathematics and compiling have to deal with sets. The
algorithms involved in the tools for automatic generation of lexical and syntactic
analyzers for compilers make extensive use of sets and mappings. Name tables
and symbol tables in compilers are mappings. The database of words that word
processors use for spell checking are dictionaries. Tables, relations and indexes
in database systems are mappings.
There are many techniques to implement sets and mappings, and a professional programmer must be familiarized with all them to choose the most
appropriate to each particular requirements, which essentially depend on the
types of the keys, the relative frequencies of the operations and whether the
implementation is intended for primary or secondary storage.
This chapter describes the techniques to implement sets and mappings in
a unified way. First, the techniques are presented regardless of the operations
union, intersection and difference. The way to extend the presented techniques
to implement these latter operations is discussed in a unified way at the end of
the chapter.
To avoid boring redundancies, only the technique that is presented first is
developed both for sets and mappings. The other techniques are developed only
for mappings. The changes required to adapt them to implement sets oonsists
in storing only the keys, and it is left as an exercise to the reader.

4.2

Specification

The specification part of a package that implements the type set (actually a dictionary, as we do not address union, intersection and difference for the present)
IS:

generic
type elem is private;
package d..set is
type set is limited private;
space_overfl.ow: exception;
procedure empty (s: out
set);
procedure put
(s: in out set; x: in elem);
function isJ.n
(s: in
set; x: in elem) return boolean;
procedure remove (s: in out set; x: in elem);
function is_empty(s: in
set) return boolean;
private
- Implementation dependent.
end d...set;
Some applications require an additional operation to select an arbitrary element from a non empty set. This kind of operations is discussed in section 4.12.
For a mapping, the specification part of the package is:

4.3. ARRAYS INDEXED BY KEYS

69

generic
type key is private;
type item is private;
package d_set is
type set is limited private;
already _exists: exception;
does....not_exist: exception;
space_overflow: exception;
procedure empty (s:
procedure put
(s:
procedure get
(s:
procedure remove(s:
procedUre update(s:
private
--- -- Implementation
end d....set;

4.3
4.3.1

out
in out
in
in out
in out

set);
set; k:
set; k:
set; k:
set; k:

in
in
in
in

key; x: in item);
key; x: out item);
key);
key; x: in item);

dependent.

Arrays indexed by keys


Sets

When the elements of a set belong to a discrete type, and the sets are expected
to contain almost as many elements as possible values of its range, the most
practical representation of a set is an array of boolean components in which
a( x) =true means that x E A.
These conditions are usual for algorithms that manage sets of states from
finite automata, such as for the automatic generation of lexical analyzers, alld
also for algorithms that manage sets of (numbered) variables in the code optimization part of a compiler.
Unless otherwise stated, a compiler will use more than one bit to store
a boolean value, usually a byte or two, depending on the addressing mode
of the underlying computer. However, if memory saving is more important
than execution time, we can force the compiler to store the array of boolean
components on the basis of a single bit per component using the pragma pack.
This pragma forces the compiler to use the minimum possible memory space
to allocate a type. Indeed, if this pragma is used, the code to access a single
component of the array will be slower, but still 0(1) _ According to this principle,
the specification part of the package becomes:
generic
type elem is ( <> ); -- Restricts elem to be a discrete type.
package d_set is
- - --Exactly as in section 4.2 (sets).
private
type set is array(elem) of boolean;
pragma pack(set);
end d_set;
The implementation of the operations is simple too:

70

CHAPTER 4. SETS AND MAPPINGS

procedure empty(s: out set) is


begin
fur x in elem loop s(x):= false; end loop;
end empty;
procedure put(s: in out set; x: in elem) is
begin
s(x):= true;
end put;
procedure remove(s:
begin
s(x):= false;
end remove;

in out set; x: in elem) is

function isj_n(s: in set; x: in elem) return boolean is


begin
return s(x);
end isJ.n;

function is_empty(s: in set) return boolean is


x: elem;
begin
x:= elern 'first;
while not s(x) and x<elem'last loop x:= elem'succ(x); end loop;
return not s(x);
end is_empty;
The operations empty and is_empty are both O(n), being n the range of
possible values of the type elem. The other operations are all 0(1). It would be
quite reasonable to add a pragma inline for all the operations that are 0(1)
to make them several times faster.

4.3.2

Mappings

If additional data has to be associated to the elements stored in the set, the
representation of section 4.3.1 has to be extended with an array of items:

generic
type key is ( <> ); -- Restricts key to be a discrete type.
type item is private;
package d.Bet is
-Exactly as in section 4.2 (mappings).
private
type existence is array(key) of boolean;
type contents is array(key) of item;

type set is
record
e: existence;
c: contents;
end record;
end d....set;

4.4. ARRAYS OF ELEMENTS

71

The implementation of the operations is trivial too:


procedure ernpty(s: out set) is
e: existence renanies s.e;
begin
fork in key loop e(k):= false; end loop;
end empty;
procedure put(s: in out set; k: in key; x: in item) is
e: existence renatnes s.e;
c: contents renames s.c;
begin
if e(k) then nrlse already_exists; end if;
e(k):= true; c(k):= x;
end put;
procedure update(s: in out set; k: in key; x: in item) is
e: existence renames s.e;
c: contents renames s.c;
begin
if not e(k) then raise does..not_exist; end if;
c(k):= x;
end update;
procedure get(s: in set; k: in key; x: out item) is
e: existence renames s.e;
c: contents renames s.c;
begin
if note(k) then raise does..not_exist; end if;
x:= c(k);
end get;

procedure remove(s: in out set; k: in key) is


e: existence renam.es s.e;
begin
if not e(k) then raise does..not_exist; end if;
e(k):= false;
end remove;
All the operations are 0(1) but empty, which is O(n), being n the number
of possible values of the type key. The application of pragma inline for all the
operations that are 0(1) is also appropriate in this case.

4.4
4.4.1

Arrays of elements
Overview

It may happen that the range of the keys is much larger than the actual number
of elements to be stored in the sets. For instance if the keys are passport
numbers and the sets have to store basket teams or to keep data about their

CHAPTER 4. SETS AND MAPPINGS

72

members. In those cases, the implementation described in section 4.3 leads to


an unaffordable waste of memory space.
Then, it may be preferable to keep the data in an array having its size
bounded by the maximum number of elements to be stored, not by the range
of key values, which is expectedly much higher, as described by the figure:

max

where n is the number of eleme:q.ts currently in the set and max is the maximum
number of elements that a set may contain.
It may happen that there the type of the keys has an ordering relationship.
If so, the elements can be stored in ascending order providing faster means for
retrieval.

4.4.2

Unsorted arrays

The implementation of the operations for unsorted arrays is so simple that it


is left as an exercise to the reader. It is clear that empty is 0(1), put is O(n)
because it has to be checked that a key is not inserted twioe, get and update are
0(n) because only linear search is possible, and remove is 0( n) too, although it
is more time consuming because first the location of the element to be removed
has to be found by linear search. Then, the element at position n may be placed
at the released location.

4.4.3

Sorted arrays

The bodies of the operations for sorted arrays are also simple and they are left
as an exercise to the reader too. It is clear that empty is 0(1), put is O(n)
and more time consuming than in the unsorted version because after checking
that there is no element in the set with the same key, and locating the position
to make the insertion, which takes O(logn) time if binary search is used (see
section 5.3.2 of vol. II), then all the elements having a key greater than that of
the inserted element have to be shifted right. The operations get and update
are O(log n) because the position of the element can be found by binary search,
and remove is O(n) too, because the position of the element to be removed can
be located by binary search, but then the elements with keys greater than that
of the removed element have to be shifted left.

4.5
4.5.1

Linked lists
Overview

The implementation of mappings by means of arrays has two drawbacks. First,


a maximum size has to be foreseen, which may be either too constraining or
too wasteful. Second, when the size of the data associated to a key is large,
the operations that require shifting elements right or left are unacceptably time
consuming. An alternative approach are linked lists, as shown by the figure:

4.5. LINKED LISTS

73

first

...

This kind of implementation does not require to foresee any upper bound for
the size of a set and each set occupies no more memory space that it is required.
As in the case of arrays, it may happen that the type of the keys has an
ordering relationship. If so, the elements can be stored in sorted order too
(usually ascending) but, as binary search can not be applied, no matter the list
is sorted or not, empty is 0(1), and the other operations are O(n) .. Nevertheless,
a sorted implementation may be convenient to achieve that the operations of
union, intersection and difference become efficient (see section 4.13), or simply
to retrieve an ordered listing of the elements (see section 4.12).
The strong similarity between sorted and unsorted lists makes it worthless to
describe both in detail. So, for unsorted lists only the basic hints are provided.

4.5.2

Unsorted lists

The insertion of a new_ element into an unsorted list is most easily done at
the beginning of the list, like for pushing an element on top of a stack (see
section 2.2.3). Naturally, the list must be scanned first to avoid duplicated
keys. Linear scan is also required for the operations get, remove and update.

4.5.3

Sorted lists

The type declarations to implement mappings by means of sorted lists are:


generic
type key is private;
type item is private;
with function"<" (kl, k2: in key) return boolean;
package d_set is
-- Exactly as in section 4.2 (mapping).
private
type cell;
type pcell is access cell;

type cell is
record
k:
key;
x:
item;
next: pcell;
end record;
type set is
record
first: p cell;
end record;
end d....set;
The operation to make an empty set is trivial:

74

CHAPTER 4. SETS AND MAPPINGS

procedure empty(s: out set) is


first:- pcell renames s.first;
begin
first:= null;
end empty;
The operation to add an element to the set works in three phases. First, it
checks that no other element with the same key is in the set and locates the
place for the insertion. To do so, it scans the list until either a key that is
greater or equal than k is found or the end of the list is reached. In this scan,
the previously visited cell is recorded in pp (abbreviation for previous top) .
.Second, a new cell is created and the provided key and item are stored in it.
Finally, the created cell is linked at its appropriate place, which is after the cell
pointed to by pp and before the cell pointed to by p. Indeed, if pp=null the
insertion has to be done at the beginning of the list:
procedure put(s: in out set; k: in key; x: in item) is
first: pcell renames s.first;
pp, p, q: pcell;
begin
pp:= null; p:= first;
while p/=null and then p.k<k loop pp:= p; p:o= p.next; end loop;
if pf=nnll and then p.k=k then raise already_exists; end if;

q:= new cell; q.all:= (k, x, p);


if pp=null then first:= q; else pp.next:= q; end if;
exception
when storage_error => r~ise space_overflow;
end put;

After the first phase, the state is:

with k' < k < k". Mter the allocation and linking phases, the state becomes

PP[jJ
~
,

The variables p, pp and q being local, they will vanish on completion of the
procedure.
The algorithms to consult and update items are straightforward and self
explaining:

4.5. LINKED LISTS

75

procedure get(s: in set; k: in key; x: out item) is


first: pcell renam.es s.first;
p:
pcell;
begin
p:= first;
while p/=null and then p.k<k loop p:= p.next; end loop;
if p=null or else p.k/=k then raise does..noLexist; end if;
x:= p.x;
end get;
procedure update(s: in out set; k: in key; x: in item) is
first: pcell renames s.first;
p:
pcell;
begin
p:= -first;
while p/=null and then p.k<k loop p:= p.next; end loop;
if p=nnll or else p.k/=k then raise does_not...exist; end if;
p.x:=x;
end update;
The algorithm to delete an element from the set has two phases: one to
locate the element to be removed and one to change the links. Like in the
algorithm for insertion, pp points to the cell previous to that pointed by p:
procedure remove(s: in out set; k: in key) is
first: pcell renames s.first;
pp, p: pcell;
begin
pp:= null; p:= first;
while p/=null and then p.k<k loop pp:= p; p:= p.next; end loop;
if p=null or else p.k/=k then raise does..not...exist; end if;
if pp=null then first:= p.next; else pp.next:= p.next; end if;
end remove;
After the first phase, the state is:

and after pp.nexi:= p.next; the state becomes

On return from the procedure, the local variables p and pp vanish. Consequently, the cell containing key k is no longer accessible and becomes garbage)
ie. the space it occupies may be reused.

CHAPTER 4. SETS AND MAPPINGS

76

4.6

Binary search trees

Overview

Linked lists have the advantage over arrays that they do not require to foresee a
size that may become either too constraining or too wasteful. Moreover, sorted
linked lists have the advantage over sorted arrays that the stored elements do
not have to shift right or left on insertions and removals, something that can
be very time consuming when the size of the type item is large. Nevertheless,
sorted linked lists have the drawback, in contrast with sorted arrays that only
linear search is possible and in any case -linked lists or arrays, sorted or notthe operations of insertion and removal take O(n) time, which is unacceptable
for large sets.
Binary search trees are an organization technique that can be applied to
keys that have a defined order relation. They provide the same flexibility than
linked lists, from the point of view of the sizing constraints, but insertion and
removal can be much more efficient and searching to retrieve or to update the
stored data can be as efficient as for sorted arrays.
The principle is simple: the set is represented by a binary tree in which
each node contains one stored element (in the case of mappings, a key pins
its associated item). Any node p of the tree satisfies the property that all the
elements stored in its left child have a key lower than p.k and all the elements
in its right child have a key greater than p.k:

If the data are organized this way, the search can be very efficient because
either the key we are looking for is equal to that of the root or we can discard
one of the two children, and so on recursively.
The type declarations

The generic and private parts of package d_set to implement mappings by means
of binary search trees are:
generic
type key is private;
type i tern is private;
with function"<" (kl, k2: in key) return boolean;
with function "> 17 (kl, k2: in key) return boolean;
package cLset is
... -Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;

4.6. BINARY SEARCH TREES

77

type node is
record
k:
key;
x:
item;
lc, rc: pnode;
end record;

type set is
record
root: pnode;
end record;
end d....set;

In principle, only the function '' <" should be required, because the order
relationships are assumed to be total, and so ">" is equivalent to not "less
than" and not equal. However, if the function 11 >" is used too, the inherent
symmetry of the algorithms becomes textually reflected, which improves their
readability.

Empty
The empty set is trivially represented by an empty tree:
procedure empty(s: out set) is
root: pnode renrunes s.root;
begin
root:= null;
end empty;

Put
The addition of a new element in the set has two cases. The first case is that the
tree (or subtree) is empty, in which case a single-node tree (subtree) is allocated.
The second case is that the tree (or subtree) is non empty. Then three subcases
appear depending on whether the key to be inserted is equal, greater or lower
that the key of the root's tree (or subtree). If equal, the attempt of insertion
creates a conflict. If it is lower, it is clear that the new element has to be inserted
in the left subtree and if it is greater, in the right subtree.
Indeed, inside the package body d..set the algorithms must see the pointers
explicitly, for this reason the procedure put that is externally visible and receives
a set, invokes a local procedure that receives a pointer to do the job. One way
to solve the potential inefficiency that may derive from this duality is to apply
the pragma inline to the externally visible procedure (after the renaming of
the internal one). A better approach is mentioned below.
procedure put(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
begin
put( root, k, x);
end put;
where

78

CHAPTER 4. SETS AND MAPPINGS

procedure put(p: in out pnode; k: in key; x: in item) is


begin
if p=null then
p:= new node; p.all:= (k, x, null, null);
else
if
k<p.k then put(p.lc, k, x);
elsif k>p.k then put(p.rc, k, x);
raise already _exists;
else
end if;
end if;
exception
when storage_error = > raise space_overflow;
end put;
A note on recursion elimination

This algorithm is simply recursive and, consequently, it can be easily converted


into an iterative one according to the techniques described in section 3.5 of
vol. III. In addition to improve its efficiency, such a transformation would make
the duality of the two put operations no longer necessary.
In this case, as in any of the remaining recursive algorithms that may appear
in this volume, we assume that at this stage the reader is skilled enough to do
these transformations by himself. ConseqUently, whenever the recursive version
of an algorithm is easier to understand and discuss, only such version will be
developed and analyzed.
Get and update
The procedure get is straightforward:

procedure get(s: in set; k: in key; x: out item) is


root: pnode renames s.root;
begin
get(root, k, x);
end get;
where
procedure get(s: in pnode; k: in key; x: Out item) is
begin
if s=null then
raise does_not_exist;
else
if
k<s.k then get(s.lc, k, x);
elsif k>s.k then get(s.rc, k, x);
else
x:= s.x;
end if;
end if;
end get;
The procedure update is just the same, but replacing x:= s.x; by s.x:= x;

,p

4.6. BINARY SEARCH TREES

79

Remove
The elimination of a node from a binaty tree requires a more careful analysis.
First, the node containing the key to be eliminated has to be found, which is
as easy as for the operations get and update. Once the node has been found, if
it is a leaf, the elimination is quite simple: it is sufficient to replace the pointer
currently addressing to that node by null.

If the node to be deleted (ie. having key k) has only one cl-Jld, the deletion
is simple too. If it is a left child of a parent with key kp, as in the figure below,
then any node of k's unique child (either left or right) has a key lower than kp
Therefore, this only child can become the new left child of the node having key
kp. If the node to be deleted is a right child, the discussion is symmetricaL In
either case, the pointer currently addressing to the node having key k has to be
updated to address the only child of the node to be deleted.

If the node to be deleted has two children the situation is more complicated
because some node must still be in that position to keep the treeJs structure.
Therefore, it has to be replaced by some other node. There are two candidates:
the leftmost node of its right child and the rightmost node of its left child.
These candidates have two properties. First, they have at most one child, and
so they can be easily dropped off. And second, they are greater than any other
element in the left subtree of the node to be removed and smaller than any other
element in the right subtree of the node to be removed. Being symmetric, there
is no preference for any of the two choices and we shall choose, arbitrarily, the
leftmost element of the right subtree.
The following figure illustrates the two phases of the deletion in this case.
First, the leftmost element of the right subtree is dropped off and reserved.
Second this node replaces the node to be deleted. This can be done either
copying its contents and keeping the links unchanged or changing only the links.
In principle, the second option is preferable because the size of a pointer is
usually much smaller than the size of the stored data.

80

CHAPTER 4. SETS AND MAPPINGS

The algorithm that results from these considerations is as follows: first,


the element to be deleted has to be seaxched for, in the same way as for the
operations get and update. Once it has been found, it has to be actually removed:
procedure rernove(p: in out pnode; k: in key) is
begin
'
if p=null then raise dOes_not_exist; end if;
if
k<p.k then remove(p.lc, k);
elsif k>p.k then remove(p.rc, k);
else
actual..remove(p);
end if;
end remove;

The actual remove considers the three cases described above: it is a leaf, it
has only one child, which can be either left or right, or it has two children. In
the latter case, the leftmost node of the right child is removed by rerrLlowest.
On return from this calling, the variable plowest points to the node just removed
from the right child. Then the pointers are adjusted so that this node replaces
the node to be deleted in the tree structure.
procedure actual..remove(p: in out pnode) is
- Prec.: k=p.k
plowest: pnode;
begin
if
p.lc =null and p.rc =null then
p:= null;
elsif p.lc =null and p.rc/=null then
p:= p.rc;
elsif p.lc/=null and p.rc =null then
p:= p.lc;
else-- s.lc/=null and s.rc/=null
reiiLlowest(p.rc, plowest);
plowest.lc:= p.lc; plowest.rc:= p.rc; p:= plowest;
end if;
end actuaLremove;
Indeed, if p.lc=null then p:= p.rc; works no matter p.lc is null or not.
So, the previous algorithm can be slightly optimized by evaluating two boolean
expressions instead of three. We have maintained the four cases explicitly to
keep an exact correspondence between the case analysis and the algorithm, in

4.7. BALANCED TREES I: AVL

81

the hope to make it easier to understand by the reader. After an appropriate


understanding, the reader can make these kind of improvements by himself.
The removal of the leftmost node of a tree is as simple as:
procedure remJowest(p: in out pnode; plowest: out pnode) is
-- Prec.: pf=null
begin
if p.lcf=null then rem.lowest(p.lc, plowest);
else plowest:= p; p:= p.rc;
end if;
end rern.lowest;
The cost of the operations
lf the elements are inserted in the set at random, we may gness that the tree
will grow ronghly balanced. If so, as studied in section 5.4.4 of vol. III, a
complete bina._ry tree of height h has n = 2h+l - 1 nodes and, consequently,
h = log2 (n + 1) - 1. With the exception of empty, which is clearly 0(1), the
other operations follow a path starting at the root and ending on a leaf in the
worst case. Therefore, the total number of steps of each of these operations is
bounded by the height of the tree and their "l'ecution time is O(logn), being n
the total number of elements in the set.
Unfortunately, there are some insertion orderings that make the binary tree
to grow even extremely unbalanced. The figure below shows some orderings of
insertion and the trees that result in each case:

5
1,2,3,4,5

1
5,4,3,2,1

\
5
I
2
\
4
I

3
1,5,2,4,3

I
1
\

\
4
I

3
5,1,2,4,3

For these degraded cases, the binary search trees work as bad as unsorted
linked lists. All the operations, but empty, are O(n).
It is easy to prove that the number of input orderings leading to these degraded cases is 2n-l whereas the total number of possible orderings is n!. AB
li!Iln~oo 2n-l fn! = 0, we might expect that the problem is not serious for large
values of n. Nevertheless, the fact that some input orderings lead to a degraded
tree is particularly worrying because, in principle, we may not make any assumption about them. The next two sections are devoted to improvements of
binary search trees that avoid such degraded cases.

4. 7

Balanced trees I: AVL

Overview

A binary searcll tree is said to be completely balanced if, for each node, the
number of nodes of its left and right subtrees differ at most by 1. Unfortunately,

82

CHAPTER 4. SETS AND MAPPINGS

there are no practical algorithms_ to build completely balanced trees. However,


we can achieve the same efficiency, in asymptotic terms, weakening the balancing
condition a little bit: a binaxy search tree is said to be AVL-bafanced 1 if for
every node the heights of its left and right subtrees differ at most by 1.

The type declaration


The algorithms that make a tree grow and shrink keeping the AVL-balancing
condition require that the nodes have an additional field, named bl (for balancing
factor). This field keeps the difference between the heights of the right and left
subtrees of that node. Indeed, the value of bl is always in the range [-1,+1]. If
positive, it means that the right subtree is higher than its left sibling.
Accordingly, the specification paxt of the package dset for a balanced tree
implementation is:
generic
- Exactly as in section 4.6.
package d_set is
-Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;

type balance_factor is new integer range -1..1;

type node is
record
k:
key;
x:
item;
bl:
balanceJactor;
lc, rc: pnode;
end record;
type set is
record

root: pnode;
end record;
end d....set;

Empty, Get and Update


These operations are exactly the same as for simple binary search trees.

Put
As for simple binary search trees, the operation put is implemented by an internal one. However, in this case, the internal put operation has an additional out
boolean parameter h. It indicates that, as a result of the insertion, the height
of the tree has increased (by one, indeed).
1 The

prefix AVL comes from the inventors of this technique: Adelson) Velslrii and Landis.

4. 7. BALANCED TREES I: A VL

83

procedure put(s: in out set; k: in key; x: in item) is


root: pnode renames s.root;
h:
boolean;
begin
put( root, k, x, h);
end put;

The actual insertion algorithm is essentially the same as for binary search
trees. The differences are that h is set to true after converting an empty tree
into a single node tree (because the previous height was 0 and the new is 1),
and that on return of the insertion on a subtree, if the subtree's height has
increased, balancing must be considered.
procedure put(p: in out pnode; k: in key; x: in item; h: out boolean) is
begin
if p=null then
p:= new node; p.all:= (k, x, 0, null, null);
h:= true;
else
if
k<p.k then
put(p.lc, k, x, h);
if h then balance.left (p, h, insert..mode); end if;
elsif k>p.k then
put(p.rc, k, x, h);
if h then balance..right(p, h, insert..rnode); end if;
else - k=p.k
raise already _exists;
end if;
end if;
exception
when storage_error => raise space_overfiow;
end put;

As we shall see below, rebalancing operations after insertions and after deletions have very few differences, so only a single balancing procedure is provided,
which has a parameter that tells whether it works after an insertion or after a
deletion. The type for this parameter, which is declared in the package's body,
is
type mode is (insert_mode, remove_mode)i
We suggest the reader to attempt the understanding of the balanciug algorithm focusing only on the insertion mode for the present. Later, after the
discussion about removal, he should read this algorithm again, then focusing
only on the removal mode.
In some circumstances, the increase in height of a subtree does not require
the tree to be reorganized. It may have become even more balanced than before.
Only in the case that the subtree that has increased its height was higher than
its sibling before the insertion the tree needs some reorganization to become
balanced again:

84

CHAPTER 4. SETS AND MAPPINGS

procedure balance.left(p: in out pnode; h: in out boolean; m: in mode) is


-- Either p.lc has increased in height one level (because of an insertion)
or p.rc has decreased in height one level (because of a deletion)
begin
if p.bl=l then
p.bl:= 0;
if m=insert_mode then h:= false; end if; -- else h keeps true
elsif p.bl=O then
p.bl:= -1;
if m=remove_mode then h:= false; end if; -- else h keeps true
else-- p.bl=-1
rebalance.left(p, h, m);
end .if;
end balance.left;
The figure below illustrates the three cases. The shadowed area indicates
the difference in height before and after the recent insertion.

In the first case, the increase in height of the left subtree has left the subtree
rooted at p even more balanced. Moreover, the new height of the subtree rooted
at p is the same as before the insertion. In the second case, even though the left
subtree has increased in height, the subtree rooted at p is still AVL-baJanced
because the new difference in height of its two children is not greater than one.
However, the new height of the subtree rooted at p is greater than before the
insertion. So, although no reorganization is required at this level, h must keep
being true and we may not discard that some reorganization is required at higher
levels, on return from this recursive call to put. In the third case, the growth
of the left subtree has left the subtree rooted at p definitively unbalanced, as
now the difference in height between its two subtrees is now 2. Consequently,
some reorganization is mandatory at this level. The procedure rebalance_lejt is
in charge to do it.
The reorganization of nodes to leave a subtree balanced is one of the rare,
and unfortunate circumstances in which it is required to consider more than two
levels of a tree, which makes the algorithms cumbersome. Two cases appear,
depending on whether the balancing factor of the subtree that has increased in
height is greater than 0 or not. The following figure illustrates the first case
before and after the rebalancing.
b

c
L - - . l ........................................................ L - - . 1

4.7. BALANCED TREES I: AVL

85

In this figure, the dashed lines mean that the subtree can have that additional
level of height or not. Being a binary search tree, it is clear that initially
k1 < k 2 , all the elements in subtree A have keys lower than k 1 , all the elements
in subtree B have keys greater than k 1 and lower than k2 and all the elements
in subtree C have keys greater than k2 . Therefore, the tree that results from
the rearrangement of nodes is a binary search tree too. Moreover, the resulting
tree is also balanced, as the difference in height of the subtrees of nodes having
keys k 1 and k2 is, in both cases, at most 1. Depending on the height of B,
the resulting tree can be higher than before the insertion or not. If it is still
higher, the parameter h must keep being true, and we can not discard that
further rearrangements are required at higher levels of the tree on return of the
recursive call. The names a, b and b2 correspond to the meariing of the local
variables used in the procedure that makes the transformation.
Indeed, if B is initially higher than A, the tree that results from this transformation would be unbalanced. Therefore, a second case has to be considered,
which is illustrated by the figure below:
a

L __ -1 ................................................................................................................

Being a binary search tree, it is clear that initially k1 < k2 < k3, all the
elements in subtree A have keys lower than k1, all the elements in subtree B1
have keys greater than k 1 and lower than k2, all the elements in subtree B2 have
keys greater than k2 and lower than k3 and all the elements in subtree C have
keys greater than k3. Therefore, the tree that results from the rearrangement of
nodes is a binary search tree too. Moreover, the resulting tree is also balanced,
as the difference in height of the subtrees of nodes having keys k 1 , k 2 and k 3
is, in all the cases, at most 1. In this case, no matter what the heights of B1
and B 2 are, the height of the subtree rooted at p after the insertion is the same
as before, so the parameter h must be set to false. The names a, b, c, c1 and
c2 correspond to the meaning of the local v-ariables used in the procedure that
makes the transformation.
In either case, the algorithms work in three steps. First, pointers are addressed to the relevant nodes. Second, the links are updated, and third the
balancing factors and the parameter h are updated, if required. The resulting
algorithm is:
procedure rebalance_.left(p: in out pnode; h: out boolean; m: in mode) is
- Either p.lc has increased in height one level (because of an insertion)
or p.rc has decreased in height one level (because of a deletion)

86

CHAPTER 4. SETS AND MAPPINGS

a:
pnode; -- the node initially at the root
b:
pnode; - left child of a
c, b2: pnode; -- right child of b
c1, c2: pnode; -- left and right children of c, respectively
begin
a:= p; b:= a.lc;
if b.bl<=O then - promote b
-- assign name
b2:= b.rc;
a.lc:= b2; b.rc:=a; p:= b;
-- restructure
if b.bl='O then
- update bl and h
a.bl:= -1; b.bl:= 1;
if m=removunode then h:= false; end if; -- else h keeps true
else- b.bl=-1
a.bl:= 0; b.bl:= 0;
if m=inserLmode then h:= false; end if; -- else h keeps true
end if;
else - promote c
c:= b.rc; cl := c.lc; c2:= c.rc;
-- assign names
b.rc:= cl; a.lc:= c2; c.lc:= b; c.rc:= a; p:= c;
-- restructure
if c.bl<=O then b.bl:= 0; else b.bl:=1; end if; -- update bl and h
if c.bl>=O then a.bl:= 0; else a.bl:= 1; end if;
c.bl:= 0;
if m=inserLmode then h:= false; end if; -- else h keeps true
end if;
end rebalanceJeft;

The procedures balance-right and rebalance_:right are symmetrical and are


left as an exercise for the reader.
Remove

AB it happens for the insertion, the algorithm to remove an element from a


balanced tree is essentially the same as for simple binary search trees. The only
difference is that it has an additional output parameter that tells whether the
removal has had the effect of reducing the tree's height (one level). Then, after
a normal insertion, in the case that the height of the subtree that contained the
deleted element has decreased, balancing must be considered.
Consequently, the procedure that is externally viewed calls the internal procedure that has the additional parameter:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
h:
boolean;
begin
remove(root, k, h);
end remove;

Then, the element having a key equal to that provided has to be looked for.
If found, the element has to be actually removed:

4. 7. BALANCED TREES I: A VL

87

procedure remove(p: in out pnode; k: in key; h: out boolean) is


begin
if p=null then raise does...not_exist; end if;
if k<p.k then
remove(p.lc, k, h);
if h then balance-.right(p, h, remove_rnode); end if;
elsif k>p.k then
remove(p.rc, k, h);
if h then balanceJ.eft (p, h, remove_rnode); end if;
else- k=p.k
actual-.remove(p, h);
end if;
end remove;-

The re<>-der should be aware that the operations for balancing the tree are
the same_ as for insertion, but with inverted roles. Therefore, after insertion on
a left child, the procedure bal.ance_lejt is invoked. Instead, after removing an
element from a left child, the procedure invoked is balance...right.
The procedure for actually removing an element has exactly the same four
cases as for simple binary search trees. Indeed, the removal of a node with less
than two children makes the tree to decrease one level. The removal of the
leftmost element of the right child may have the effect of decreasing the right
child's height or not, which is indicated by the output parameter h:
procedure actual-.remove(p: in out pnode; h: out boolean) is
-- Prec.: p.k = k
plowest: pnode;
begin
if
p.lc= null and p.rc= null then
p:= null; h:= true;
elsif p.lc =null and p.rc/=null then
p:= p.rc; h:= true;
elsif p.lc/=null and p.rc =null then
p:= p.lc; h:= true;
else-- s.lc/=uull and s.rc/=null
remJ.owest(p.rc, plowest, h);
plowest.lc:= p.lc; plowest.rc:= p.rc; p:= plowest;
end if;
end actuaLremove;
where
procedure rem_lowest(p: in out pnode; plowest: out pnode; h: out boolean) is
-- Pre c.: p /=null
begin
if p.lc/=null then
remJ.owest(p.lc, plowest, h);
if h then balance_right(p, h, remove_rnode); end if;
else
plowest:= p; p:= p.rc; h:= true;
end if;
end re:rn.lowest;

88

CHAPTER 4. SETS AND MAPPINGS

The algorithm to check if actual rebalancing is required, is the same as the


one for insertion, although the reasoning is different. Now, the three cases are
described by the figure below, where the shadowed areas describe the part of
the tree's height that has vanished as a result of the deletion:

According to this new figure, the reader should check that the algorithm
balance_lejt described in page 84 is also correct when the mode is for removing.
The procedure rebalance..left described in page 85, that reorganizes the nodes

to keep the tree balanced after an insertion can also be applied to reorganize the
nodes after a deletion. Only the update of the parameter h must be reconsidered,
because its meaning is different in the two cases: in insertion mode it meant
that the tree's height had increased, whereas in remove mode it means that the
tree's height has decreased. The reader should review that algorithm from the
point of view of the adjustment of the value of h.

4.8

Balanced trees II: red-black

Overview

A binary search tree is said to be red-black balanced (in short RE-balanced) if


the longest path from the root to a leaf is at most twice as long as the shortest
one. Indeed, this property implies that any path from the root to a leaf has
length 8(logn) 2 . An RE-balanced tree is also called a red-black tee or an REtree.

To achieve the RE-balancing property, each node of aRE-tree has a singlebit mark, which is commonly called the color of the node. By convention, the
two possible values of this mark are called red and black.
To grant that the tree grows and shrinks in a balanced way, each operation
on the tree must leave it satisfying two fundamental properties:
Pl.
P2.

The parent of a red node is always black.


All paths from the root to the leaves have the same number of black
nodes. This number is called the black-height of the tree.

As a result of these two properties, the longest possible path from the root
to a leaf has alternatively colored nodes, whereas the shortest possible one has
black nodes only. Therefore P 1 and P2 imply that the longest path is at most
twice as long than the shortest one, as desired.
To make both the algorithms and the discussion easier, two complementary
properties are added to the fundamental ones:

P3.
P4.

The root is always black.


Any empty tree (in particular the children of a leaf node) is considered
to be black, by convention.

2A more thorough analysis (see [5]) may grant that the maximum height of a RB-tree of n
nodes is hma.x :s;;: 2log2(n + 1):

4.8. BALANCED TREES II: RED-BLACK

89

The type declaration


According to the principles above, the specification part of the package dset for
a balanced tree implementation is:
generic
-- Exactly as in section 4.6.
package d.Bet is
-Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;

type color is (red, black);


type node is
record
k:
key;
x:
item;
c:
color;
lc, rc: pnode;
end record;
type set is
record
root: pnode;
end record;
end d...set;
Empty, Get and Update
These operations are exactly the same as for simple binary search trees.
Put
The insertion of an element is quite simple. First the element is inserted as in
a conventional (ie. non balanced) binary search tree. The newly inserted node
is colored as red. In this way, property P2 keeps being satisfied. Therefore, on
return from recursion we shall take care only of Pl: if a red node has a red
parent, some reorganization of the tree is required.
As the tree was RB before the insertion of the new element, afterwards at
most two conflicting nodes may exist in the tree. The reorganization having to
be achieved at one level higher of the two conflicting consecutive red nodes, there
are only four possible cases for a node -pointed by p------ where the reorganization
has to take place. The figure below illustrates these possible cases. In that figure,
as in the remaining figures of this section, the color of the nodes is indicated as:
Qred

Qblack

Indeed, there is a fifth possible case, not illustrated by the figure: the one in
which no conflicting red nodes exist and does not require any reorganization.

90

CHAPTER 4. SETS AND MAPPINGS


p

D
D

B
Case LL

Case LR

Case RL

From any of these situations, the subtree rooted at p must be reorganized as:

p~

~~
BC D

As a result of this reorganization, the subtree rooted at p has become RB


and its black height is the same as before the reorganization. Therefore, the
property P2 keeps being satisfied for the full tree and the property Pl is satisfied
from p downwards. However, asp was initially black and now has turned red
it is possible that it becomes in conflict with its parent, which may be red.
Consequently, we may not discard having to make further reorganizations at
higher levels.
So, the algorithm for insertion has two parts: first, the conventional insertion
o;o. a binary search tree, plus coloring the new node as red, and second, on return
from the recursion, the due rebalancing operations. In turn, a rebalancing
operation is subdivided in two steps, namely: the identification of the case
and the actual reorganization according to the case identified. On top of this
recursive process, the root is colored unconditionally black. It is only at this
point that the black height of the tree may increase. This happens when the
root, which was initially black, has turned red as a result of a rebalancing at
root level. The action on top is as simple as:
procedure put(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
begin
put(root, k, x); root.c:= black;
end put;
The conventional insertion on a binary search tree is:
procedure put(p: in out pnode; k: in key; x: in item) is
begin
if p=null then
p:= new node; p.all:= (k, x, red, null, null);
else
if
k<p.k then put(p.lc, k, x);
elsif k>p.k then put(p.rc, k, x);
else
raise already _exists;
end if;
if p.c=black then balance(p); end if;
end if;
end put;

4.8. BALANCED TREES II: RED-BLACK

91

The rebalancing algorithm uses an enumeration type for recording the cases:
type baLcase is (none, 11, lr, rl, rr);

which is declared local to the body of the package d_set. It invokes the function
ULbaLca.se to identify the reorganization case, and then changes the links and
colors accordingly:
procedure balance(p: in out pnode) is
a, b, c: pnode;
begin
case i<LbaLcase(p) is

when none=>
null;
whenll =>
c:= Pi b:= c.lc; a:= b.lc;
c.lc:= b.rc;
bJc:= 8.; b.rc:= Cj p:= b;
a.c:= black; c.c:= black; b.c:=
when lr =>
c:= p; a:= c.lc; b:= arc;
c.lc:= b.rc; a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=
whenrl =>
a:= Pi c:= a.rc; b:= c.lc;
c.lc:= b.rc; a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=
when rr =>
a:= p; b:= a.rc; c:= b.rc;
a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=

red;

assign names
link low descendants
arrange top
recolor the involved nodes

red;

red;

red;

end case;
end balance;

The reader should realize that the last two lines of each branch are common
for all the cases, but the none case. So they could have been moved down and
appear only once. Nevertheless, we have preferred to leave it in this way to help
the reader to understand each case in isolation from the others.
The function id_bal_case is a little bit tedious but it has no difficulty. The
drawback of the number of checks per node to be done (six in the worst case)
is reduced to three in the iterative version of the algorithm because the stack
provides the parent's and grand parent's data of the conflicting node.
AB the procedure put is simply recursive, obtaining its iterative equivalent
can be easily achieved through the techniques presented in chapter 3 of val. III.
As the binary search .trees are linked downwards, the parent function can not
be calculated and a stack is required. If so, each time a node becomes colored
red, identify whether it has entered in conflict can be achieved by a single check,
as the pointer to its parent is just one position below the top of the stack, and

92

CHAPTER 4. SETS AND MAPPINGS

the identification of the reorganizing case (for the grand parent of the node just
colored red) requires only two additional checks if the information whether a
node is a left child or not is stored in the stack too, as the information relative
to the grand parent is just two positions below the top of the stack.
Such an optimization is left an exercise to the reader and the recursive
version, which is far more readable, is presented here to help him to understand
the main principles.
function id_baLcase(p: in pnode) return bal...case is
a: pnode; be: baLcase;
begin
if p.lc/=null and then p.lc.c=red then
a:_= plc;
if a.lc/=null aud then a.lc.c= red then
be:= ll;
elsif a.rc/=nuil and then a.rc.c= red then
be:= lr;
else
be:= none;
end if;
end if;
if bc=none and p.rc/= null and then p.rc.c=red then
a:= p.rc;
if a.lc/=null and then a.lc.c= red then
be:= rl;
elsif a.rc/ =null and then a.rc.c= red then
be:= rr;
else
be:= none;
end if;
else
be:= none;
end if;
return be;
end id_baLcase;
Remove
Removing an element from an RB-tree is more complicated than inserting it
due to the greater variety of cases to be considered. As for the case of put, the
algorithm that removes an element from the tree starts by a conventional binary
search tree deletion and afterwards, the tree is rebalanced if required.
Indeed, the rebalancing affects only the shape of the tree. So, in the case
that the element to be removed has two children and is replaced by the leftmost
element from its right subtree, the rebalancing must start at the position of
this latter node, that is where the tree has changed its shape, and then proceed
backwards to the root. As a consequence, from the point of view _of rebalancing,
we shall consider that the vanishing node has at most one non empty subtree.
Let us now address the different possibilities.

4.8. BALANCED TREES II: RED-BLACK

93

Case 1. The node deleted is red. In this case, the removal keeps the properties Pl and P2 unchanged. So, no reorganization is required:

Case 2. The node deleted is black with a red child. In this case, the child
replaces the hole created by the parent's removal. Then it is sufficient to color
it black to achieve that both Pl and P2 keep being satisfied:

p-y>
9<,

p~
=*A

Case 3. The node deleted is black with a black child. In this case, the
child replaces the hole created by the parent's removal. However, in order to
keep P2 being satisfied, the root of the new subtree mnst be colored as double
black (ie. its black weight accounts for two). This is an unstable situation that
implies that some rebalancing has to be done at higher levels. In the figure
below, which illustrates the situation, the double black node is indicated by an
asterisk. In the algorithm this fact is not be recorded in the node, but indicated
by an output parameter of the subprogram in charge of the removal.

p-y>
~

p-1{

==>-A

VVhen, as a result of a removal, the root of a subtree has become double


black, that node, henceforth called D, may appear in several different positions:
Subcase 3.1 The double black node is the root. In this case, D can become
simple black. The change applies to any path from the root to the leaves, and
therefore P2 is preserved. This is the only case in which the black height of the
tree decreases. In practice, this is achieved simply by not rebalancing the tree
at the root level.
Subcase 3.2 The double black node is not the root. In this case the tree has
to be rebalanced in some way so that P2 is preserved. As Dis not the root, it is
granted that it has a parent P. Moreover, as Dis double black and its siblingS
mnst have the same black height than D, S may not be empty and therefore
the situation can be represented as

Even though it is granted that S is not empty, SL and SR may be empty,


and the reader should remember that, if so, they are considered, by convention,
as black (property P 4).

94

CHAPTER 4. SETS AND MAPPINGS

The analysis that follows assumes that D is a left child. When D is a right
child the discussion is symmetrical and is left as an exercise to the reader.
The cases that appear correspond to the possible colors of the nodes involved.
These possibilities are summarized in the table below and the reader should
realize that it copes all the possibilities.

s
R
B
B
B

Colors of
SR SL
R
B

B
B

B
B

Subcase 3.2.R S is red. In this case P is black because Pl establishes that


no red node has a red parent. For the same reason SL and BR are black too,
either if they are empty or not. The figure below illustrates the situation befote
and after the rebalancing operation:

After this arrangement, the number of black nodes in the path from D to
the root is the same as before the rebalancing. Therefore, it has to be still
considered as double black. The advantage of the change is that now D has a
black sibling, and the resulting subtree can now be further processed according
to the cases that are studied below.
Subcase 3.2.BR S is black and SR is red; P can be either red or black.

l
In this transformation SL keeps its color unchanged, whatever it is, and S
gets the color that P had before the changes. In the resulting structure, the
paths from the root to the leafs of D have one more black node than before and
therefore, for P2 to be satisfied, D needs not to be double black any longer.
The number of black nodes in the paths from the root to the leafs of both
SL and SR is the same as before the changes. Therefore, P2 is satisfied for all
the involved branches. Moreover, whatever the colors of P and SL are initially,
no red node in the resulting structure has a red parent and henceforth P 1 is
satisfied too.
Subcase 3.2.BBR Both S and SR are black and SL is red. As empty trees
are considered black, the fact that SL is red implies that it is not empty. The

4.8. BALANCED TREES II: RED-BLACK

95

(possibly empty) children of SL are black, as Pl is satisfied. Therefore, it is


possible to restructure S as illustrated by the figure below:

It is clear that in the new hierarchy Pl is satisfied, as no red node has a red
parent, and also that all paths from the root to the leafs of a, b and SR have
the same number of nodes than before the reorganization, so P2 is preserved as
well.
Indeed, this transformation does nothing upon the double black nature of D.
However, the new arrangement of nodes now satisfies the properties required by
subcase 3.2.BR and that transformation shall be applied afterwards.
Subcase 3.2.BBBR S, SR and SL are all black and P is red. As Pl is
satisfied, the fact that P is red implies that its (possibly empty) children are
black. In these circumstances, the tree can be transformed as shown by the
figure below:
p

In this transformation no link is modified. Only the colors of P and S are


switched. The subtrees SL and SR have been drawn just to emphasize that its
black nature is essential for the preservation of Pl.
Nevertheless the color switch is sufficient to increase in one the number of
black nodes in the paths from the root to the leafs of D and therefore it is no
longer required that it accounts as double black to preserve P2. Instead, the
number of black nodes in the paths from the root to the leafs of S is the same
as before the switch. Therefore, it has released the double black nature of D
while preserving Pl and P2.
Subcase 3.2.BBBB S, SR, SL and P are all black. In this case the tree
can be transformed as indicated by the figure below:
p*

As in the subcase 3.2.BBBR, no link is modified. Just the color of S is


switched. The reader should realize that such a change only can be done provided that both SL and SR are black, otherwise the property Pl would not be
preserved. The effect of the change is to decrease by one the number of black

96

CHAPTER 4. SETS AND MAPPINGS

nodes from the root to the leafs of S. As a result, this number is the same as
for the paths to the leafs of D. Therefore, the double black nature of D has to
be moved up to P and further rearrangements at higher levels shall be expected
to solve the double black nature of this latter node.
As a result of the previous discussion, the algorithm for remove starts with
the conventional remove operation for binary search trees and the rebalancing
is achieved on return from the recursive invocations. As the root plays a special
role, because it is the place where the black height increases (by setting black
a root colored red by the rebalancing algorithm) or decreases (by ignoring the
double black nature the root may get as a result of a deletion), the recursive
algorithm is invoked from a special procedure on top. The recursive algorithm
returns an output parameter indicating if the node provided as actual parameter
has turned double black as a result of the removal.
procedure remove(s: in out set; k: in key) is
root:
pnode renames s.root;
dbiack: boolean;
begin
remove(root, k, dblack); --ignore dblack (subcase 3.1)
end remove;
The removal for the general case does the conventional removal on binary
search trees. If, the node provided as the actual parameter has become double
black, the appropriate rebalancing is applied, depending on whether the node
is a left or a right child:

procedure remove(p: in out pnode; k: in key; dblack: out boolean) is


begin
if p=nnll then
raise does_not_exist;
else
if k<p.k then
remove(p.lc, k, dblack);
if dblack then left_dblack_bal(p, dblack); end if;
elsif k>p.k then
remove(p.rc, k, dblack);
if dblack then right_dblack_bal(p, dblack); end if;
else
actuaL.remove(p, dblack);
end if;
end if;
end remove;
The actual removal has the same cases as for conventional binary search
trees. The cases in which the actual node to be removed has less than two
children are rather simple: the child, if any, covers the hole left by the removed
node. If the node to be removed has two children, the hole has to be covered by
the leftmost node of its right child. So, in the latter case, the point at which the
shape of the tree changes is not the place of the node containing the element to
be deleted but the place of the leftmost node of its right child.

4.8. BALANCED TREES II: RED-BLACK

97

For the simple cases, in addition to the actual removal, it has to be decided,
depending of the oolors of the removed node and its replacing child, whether
the replacing node has to become double black to preserve P2. For the oomplex
case, in addition to the replacement by the leftmost node of its right child,
some rebalancing has to be considered. Although the crossing cases may make
the algorithm to look somewhat complex, after a careful look the reader will
discover that it is quite simple.
procedure actuaLremove(p: in out pnode; dblack: out boolean) is
Prec.: k=p.k
plowest: pnode;
begin
if p.lc= null and p.rc= null then
if p.c=red then
pis .red (case 1): freely remove
dblack:= false;
else
p is black with no red child (case 3)
dblack:= true;
end if;
p:= null;
elsif p.lc=null and p.rc/ =null then
if p.c=red then
- p is red: freely remove
p:= p.rc; dblack:= false;
(case 1)
elsif p.rc.c=red then
p is black with a red child
p:= p.rc; p.c:= black; dblack:= false;
(case 2)
else
p is black with no red child
p:= p.rc; dblack:= true;
(case 3)
end if;
elsif p.lc/=null and p.rc=null then
if p.c=red then
- p is red: freely remove
(case 1)
p:= p.lc; dblack:= false;
elsif p.lc.=red then
- p is black with a red child
p:= p.lc; p.c:= black; dblack:= false;
(case 2)
else
- p is black and no red child
p:= p.lc; dblack:= true;
(case 3)
end if;
else
- s.lc/=null and s.rc/=null
rem_lowest(p.rc, plowest, dblack);
plowest.lc:= p.lc; plowest.rc:= p.rc; plowest.c:= p.c;
p:= plowest;
if dblack then right.dblack_bal(p, dblack); end if;
end if;
end actuaLremove;

The procedure rerrLlowest, that actually removes the leftmost node of a


subtree, is very similar to its equivalent for conventional binary search trees.
When the leftmost node is found, in addition to the removal, it has to be
decided whether the replacing node has to become double black or not, in a way
that is similar to the simple cases considered in the procedure above.
On return from the recursion, if the actual parameter has become double
black, some rebalancing has to be applied.
Consequently, the algorithm below follows:

98

CHAPTER 4. SETS AND MAPPINGS

procedure remJ.owest
(p: in out pnode; plowest: out pnode; dblack: out boolean) is
-Free.: sf=null
begin
if p.lc/=null then
rem.lowest(p.lc, plowest, dblack);
if dblack then left_dblack..bal(p, dblack); end if;
else
plowest:= p;
if p.c=red then
- p is red: freely remove (case 1)
p:= p.rc; dblack:= false;
elsif p.rc/=null and then p.rc.c=red then
--pis black with a red child (case 2)
p:= p.rc; p.c:= black; dblack:= false;
else
--pis black with no red child (case 3)
p:= p.rc; dblack:= true;
end if;
end if;
end remJ.owest;

The two procedures to rebalance the tree when a node has become double
black are symmetrical and here only left_dblack...bal is described. The other one
is left as an exercise to the reader.
The procedure lefi-dblack...bal invokes the function id._case to identify the
particular case it has to deal with, which is identified by the enumeration type
left_db_case declared in the body of the package d_set as follows:
type left_db_case is (R, BR, BBR, BBBR, BBBB);
to describe the subcases of case 3.2 using the same convention as in the text
above, ie. the colors of S, SR, SL and P respectively. Then, it applies the
transformations that correspond to each case. The resulting algorithm is:

procedure left_dblack..bal(p: in out pnode; dblack: out boolean) is


ide: left_db_case;
n, s, sl, sr, a, b: pnode;
begin
n:= p.lc; s:= p.rc;
ide:= id_case(n, p, s);
dblack:= false;
case ide is
whenR=>
-- restructure p so that it can be reduced to either BR, BBR or BBBR
a:= s.lc; s.lc:= p; p.rc:= a;
p.c:= red; s.c:= black;
p:= Sj

- then apply either BR, BBR or BBBR


left_dblack_bal(p.lc, dblack);
-- no need to check dblack because it is known to be false

4.9. B+ TREES

99

whenBR=>
a:= s.lc; s.lc:= p; p.rc:= a; sr:= s.rc;
s.c:= p.c; p.c:= black; sr.c:= black;
p:=

Sj

whenBBR=>
- restructure s so that it can be reduced to case BR
sl:= s.lc; b:= sl.rc; s.lc:= b; sl.rc:= s;
s.c:= red; sLc:= black;
s:= sl;
-- then apply BR
a:= s.lc; s.lc:= p; p.rc:= a; sr:= s.rc;
s.c:= p.c; p.c:= black; sr.c:= black;
p:=s;

when BBBR => -- case DB4


p.c:= black; s.c:= red;
when BBBB => - case DB3
s.c:= red; dblack:= true;
end case;

end left_dblaclcbal;
The algorithm for itLcase is trivial and it is left as an exercise to the reader.

4.9

B+ trees

Overview
Retrieving data from balanced binary trees is very efficient, as it requires about
log2 n key comparisons. So, finding a key on a set having 10,000 keys requires
about 14 key comparisons. A detailed analysis can prove that in the very worst
case it will never require more than 19 in AVL-trees and 28 in RB-trees.
However, if the set is to be stored in a disk, the technique for implementing
sets must be different. A physical access to a disk requires that the disc arm is
moved upon the proper track, then to wait for the proper sector to pass under
the arm's heading and finally transferring all the data contained in that sector.
The size of a sector is a characteristic of the disk and can not be changed. Usual
sizes range from 1 Kbyte to 8 Kbyte. In this section we will use the word page
as a synonym of disk sector.
Indeed, the time of comparing a key is insignificant compared with the sum
of the seek time plus the rotational delay plus the transfer time (an access to a
fast disk may take about 10- 2 s, whereas a key comparison on a fast processor
may take about 10- 6s). Consequently, the aim of a techPique for implementing
mappings upon secondary storage must be to minimize the number of physical
accesses to a disk, not the actual key comparisons.
The family of data structures known as B-trees is intended to achieve this
goal. This section describes in detail the most widespread variant of B-trees,
ie. the B+ trees. The other variants of the family are summarized at the end of
the section.
A B+ tree has two kinds of nodes: the leaves and the interior nodes. The
leaves contain the items and its corresponding keys, whereas the interior nodes
contain only keys and pointers. Interior nodes have a similar principle as interior

100

CHAPTER 4. SETS AND MAPPINGS

nodes of binary search trees, ie. to redirect the search towards the appropriate
subtree, and look like:

The elements stored in each of the subtrees satisfy that

'l(k,x) E To: k < k 1


'l(k,x) ETi :k; '(k <k;+l
'I (k, x) E Tn : kn '( k

(1 <i <n)

As the user shall see, there is one more pointer than keys. It is also important
to realize that the key equality is satisfied for the subtree "at the right" of the
corresponding separator key.
The leaf nodes look as

Each node of the tree is located in a different disk page. So, in principle,
each access to a node requires a physical access to a disk3 .
The search principle of B+ trees is similar to that of binary search trees,
but to the extent that each node has many more subtrees, therefore B+ trees
are much more flat and the operations to insert, remove and retrieve require a
much lower number of steps, ie. physical accesses to a disk. Indeed, the more
items that fit into a leaf node, the lower is the number of leafs that are required.
Consequently, the sizes of both leafs and interior nodes should be adjusted to
best fit the size of a disk page. Moreover, as we shall see below, the algorithms
that manage the structure guarantee that each node, but the root, is used at
least to one half of its capacity.
Even though the search principle of B+ trees is similar to that of binary
search trees, the way they grow and shrink is completely different. Binary
search trees grow at the leaves. Instead, B+ trees grow in height at the root:
when no more elements (either items or pointers) can be allocated in the_ root,
a new sibling is created for the old root and a new root is allocated on top of
them. Likewise, B+ trees decrease in height at the root: when as a result of
a deletion, the root has only one child, its only child becomes the new root.
This way of growing and shrinking in height has as a result that all the leaves
are always at the same depth. In other words, B+ trees are always completely
balanced.
For the sake of keeping a uuified notation with the rest of the. book, the
algorithms below describe the management of nodes as if they where allocated
in dynamic memory. In any practical implementation, the reader should replace
any operation of the type p:= new node(leaf); or p:= new node( interior); in
the algorithms below by a request to the operating system for a new disk page
and its formatting to keep either an interior node or a leaf.
3 0perating systems and database management systems use buffering techniques that keep
the most recently accessed pages in primary memory, but for the present we may ignore such
kind of improvement.

4.9. B+ TREES

101

The type declaration


According to the previous description, the representation of a B+ tree is:
generic

-- Exactly as in section 4.6.


package d..set is
-- Exactly as in section 4.2 (mappings).
private

type node;
type pnode is access node;
type nodetype is (leaf, interior);
maxJeaf: constant natural:= 52;
maxJ.nt: constant natural:= 340;

type elem is
record

k: key;
x: item;
end record;
type t_pointers is array(natural range O..ma.xJ.nt) of pnode;
type t_keys
is array( natural range l..maxint) of key;
type t_elems is array( natural range L.ma.xJ.eai) of elem;
type node(t: nodetype) is
record
n: natural;
case tis

when leaf=>
te: t_elems;
when interior

=>

tk: t..keys;
tp: t_pointers;
end case;
end record;

type set is
record
root: pnode;
end record;
end d....set;

Indeed, the constants max_leaf and max...int must be adjusted to the sizes
of disk pages, keys and items. A ocmpletely general purpose package should
receive the disk page size as an additional generic parameter, and compute the
appropriate values for these constants by formulae that make use of the page
size plus the sizes of the types pnode, key and item. The latter can be obtained
by the attribute size, which provides the size in bits for any type. A general

102

CHAPTER 4. SETS AND MAPPINGS

purpose package should also receive the procedures to request a new disk page
to the operating system and to release it back when it becomes no longer used.
However, at this stage these technicalities would serve only to distract the reader
from the essential aspects ofthe data structure under study. Anyway, the values
proposed are typical, assuming a page of 4 Kbytes, a key of 8 bytes, a pointer
of 4 bytes and an item of 70 bytes.

Empty
The representation of the empty set in B+ trees is as simple as:

procedure empty(s: out set) is


root: pnode renames s.root;
begin
root:= null;
end empty;
Put
As stated earlier, the root has an exceptional nature in B+ trees because it is the
only node that is allowed to be filled under one half of its capacity and because
it is at the root level that the tree grows and decreases in height. Consequently,
the insertion of a new element is achieved by means of two procedures: put,
which deals with the root and may make the tree grow in height, and putO
which deals with arbitrary nodes and may make the tree grow in width.
The algorithm for put is:

procedure put(s: in out set; k: in key; x: in item) is


p: pnode renames s.root;
ps: pnode; -- sibling of p
kps: key;
-- smallest key in elements pending from ps
q: pnode; -- root on top of p (if the tree increases its height)
begin
if p=null then
p:= new node(leaf);
p.n:= 1; p.te(p.n):= (k, x);
else
putO(p, k, x, ps, kps);
if psj=null then-- the tree increases in height
q:= new node(interior);
q.n:= 1; q.tp(O):= p; q.tp(l):= ps; q.tk(1):= kps;
p:=q;
end if;
end if;
end put;
If the set is initially empty, the insertion of a new element requires to create a
new node, which will be the root of the tree. Otherwise, the element is inserted
in the tree through the procedure putO. It may happen that as a result of the
insertion made by putO there is no sufficient space in the ancient root to allocate
the required elements or pointers. If so, putO creates a new right sibling for the

4.9. B+ TREES

103

node provided, in which case, on return, the value of ps (an abbreviation for p's
sibling) is different than null and kps is the lowest key of the elements currently
allocated in the subtree rooted at ps. In this situation, the tree must grow in
height, by the creation of a new root on top of the ancient root. The following
figures illustrate this growth. The initial situation is

p-..j

After the insertion of the new element in this tree, the tree is reorganized.
Consequently, the contents of the node pointed to by p may be updated and
occasionally split by a new sibling. If so, the situation becomes:

P -IL_~A"-'---'1 ps-JL_-'A"'-'-'- - '


After the creation of the new root on top of the previous one and updating
p to point to the new root, the situation becomes finally:

The procedure putO inserts the element (k,x) into the subtree rooted at p.
As a result, it may happen that there is not enough space in p to allocate either
the element itself or the pointers to newly allocated pages at lower levels. If so,
the elements of p plus any new contents are redistributed between p and a new
right sibling of p, returned as parameter ps. In this case, on return psfnul! and
kps is the key of the lowest element in the subtree rooted at ps. If no sibling
has been created, on return ps#null and the value of kps is meaningless.
The code for putO is:

procedure putO
(p: in pnode; k: in key; x: in item; ps: out pnode; kps: out key) is
q: pnode; -- child selected
iq: natural; -- index of the child selected (ie. q = p(iq).k)
qs: pnode; -- new immediate right sibling of q, if required
kqs: key;
-- key of the lowest element pending from qs
begin
ps:= nuH;
case p.t is
when leaf=>
if p.n < maxJ.eaf then
insertJ.nJ.eaf(p, k, x);- --may raise already_exists
else-- p.n = maxJeaf => - tree increases in width at leaf level
if is_in(p, k) then raise already-exists; end if;
ps:= new node(leaf);
distribute_inJeafs(p, ps, k, x, kps);
end if;

104

CHAPTER 4. SETS AND MAPPINGS

when interior = >


iq:= select_child(p, k); q:= p.tp(iq);
putO(q, k, x, qs, kqs);
if qs/=null then
if p.n < max.leaf then
insert_in.interior(p, qs, i% kqs);
else-- p.n = max:Jeaf => -- tree increases in width at interior level
ps:= new node(interior);
distribute.in_interior(p, ps, q, qs, iq, kqs, kps);
end if;
eiid if;
end case;
end putO;
Let us consider separately the cases that the node pointed by p is a leaf or
an interior node. If it is a leaf and there is still enough space to allocate the
new element in the node, it is inserted in that node, as in a set implemented by
an ordered array. The procedure in.serLirdeaf is in charge of the details. But if
the leaf is already full, the elements must be redistributed between the current
node and a new right sibling. The procedure distribute_in_leafs is in charge of
the details and, on return, it provides, as an output parameter, the key of the
lowest element stored in the new right sibling. Indeed, after this redistribution,
neither p nor ps are occupied under one half of its potential capacity. Obviously,
if an element with the same key already existed in the set, the node must keep
unchanged and the appropriate exception has to be raised.
If the node addressed by p is an interior node, the things are slightly more
complicated. First, the appropriate child has to be selected. This child is:

tp(O)
tp(i)
tp(n)

if
if
if

k < kl

< k;+l
kn .;; k

k; .;; k

(1 <i <n)

The function selecLchild is in charge of doing it. Second, the element is inserted
recursively into the selected child. As a result, a new right sibling for the selected
child may (or may not) be generated. If so, the situation is illustrat~d by the
figure below:

qs-:7\
(kqs)LA:'i
If there is room enough to allocate the new couple of key and pointer in the
node pointed by p, the insertion is achieved by shifting one position right all
the couples (key, pointer) starting from position q + 1, leading to

P4L----~~~~k~q~~i~q~kq~s~~kB~~~------~
&~11

4.9. B+ TREES

105

The procedure insert_in_interior is in charge of the details. However, it may


occur that the node pointed by p is completely full and there is no possibility
to allocate the couple kqs and qs in it. In this case, a new right sibling for p
must be created, and the pointers and keys currently in p, plus the couple kqs
and qs have to be distributed between p and its new right sibling ps. It is clear
that after such a distribution each node will be filled at least at one half of its
capacity. The following figures illustrate the transformation. On return from
the recursive call to puto, we have
p

qs-:7\
(kqsJoo
As in this case there is no more room in p to allocate the couple (qs, kqs),
a new right sibling has to be created and the existing couples plus the new one
must be evenly distributed between the two siblings:

The key of the lowest element stored in the subtree rooted at ps.tp(O) (in the
illustrating figures, kc) has to be promoted up, ie. passed as the out parameter
kps, because it will be needed at the upper level of recursion to complete the
appropriate structure:

P-ic__ _ _ _____,/'

~!'t'-c----------'

Indeed, depending on the value of iq, the couple ( qs, kqs) shall go either to
p, or to ps or even become split: qs at position zero of ps and kqs promoted
up, in which case kps=kqs. The procedure distribute-in-interior is in charge of

these details.
The pending auxiliary subprograms are described later. At this stage it is
more important to focus on the essential aspects of the main operations.
Get and Update
The algorithms for retrieving and updating are quite simple. They just have
to follow the path towards the appropriate leaf, which is directed by the keys
stored in the interior nodes, and then look for the searched element in that leaf.
The procedure get invokes an internal procedure that receives a pointer type as
a parameter and does actually the job:

106

CHAPTER 4. SETS AND MAPPINGS

procedure get(s: in .set; k: in key; x: out item) is


p: pnode renames s.root;
begin
getO(p, k, x);
end get;
where

procedure getO(p: in pnode; k: in key; x: out item) is


q: pnode; -- child selected
iq: natural; -- index of the child selected (ie. q = p(iq).k)
begin
case p.t is
when leaf=>
get..frorn.leaf(p, k, x);
when interior =>
iq:= selecLchild(p, k); q:= p.tp(iq);
getO(q, k, x);
end case;
end getO;
The procedure get..from..leaf makes a simple search on the (sorted) array of
elements stored in that leaf. The details are discussed below.
The procedure update is exactly the same as get, but replacing get..from.leaf
by update..leaf, with exactly the same parameters, although in the latter case
the mode of x is input.
Remove

The operation to remove an element from a B+ tree is divided into two procedures: remove, which deals with the exceptional nature of the root and may
make the tree to shrink in height, and removeO, which deals with arbitrary nodes
and may make the tree to shrink only in width. The algorithm for remove is:
-procedure remove(s: in out set; k: in key) is
p: pnode renames s.root;
kp:
key;
-- irrelevant at root level
kpu, unbal: boolean; - irrelevant at root level
begin
if p=null then raise does_not_exist; end if;
removeO(p, k, kp, kpu, unbal);
if p.n=O then -- the tree decreases in height
case p.t is
when leaf
=> p:= null;
when interior=> p:= p.tp(O);
end case;
end if;
end remove;
If the tree is empty, it is obvious that the element can not be removed.
Otherwise, the elimination is delegated to the procedure removeO. If, as a

4.9. B+ TREES

107

result, p. n=O and the root node is also a leaf, that leaf is the only node of the
tree and moreover it contains no element. Therefore the node can be released.
This case is so simple that does not require any illustrating figure.
1f the root node is an interior node and p. n=O, this means that only p.tp(O)
contains a significant pointer. If so, the node addressed by this pointer can
become the new root, and the ancient root can be released. In either case, the
tree's height decreases one level. The figure below illustrates the transformation:

The algorithm for the procedure removeO is:


procedure removeO
(p: in pnode; k: in key; kp: out key; kpu, unbalp: out boolean) is
q:
pnode;
- child selected
natural; -- index of the child selected (ie. q = p(iq).k)
iq:
kqu:
boolean; - tells if kq has been updated
kq:
key;
-- key of the lowest element in q, if kqu is true
unbalq: boolean; -- tells if q is below one half of its capacity
pls, prs: pnode;
- left and right siblings to distribute elements
kprs: key;
-- key associated to prs.tp(O)
begin
kpu:= false; unbalp:= false;
case p.t is
when leaf=>
remove_from_leaf(p, k, kp, kpu);
unbalp:= p.n <= maxJeaf/2;
when interior =>
iq:= select_child(p, k); q:= p.tp(iq);
removeO( q, k, kq, kqu, unbalq);
if kqu then
if iq>O then p.tk(iq):= kq; else kpu:= true; kp:= kq; end if;
end if;
if unbalq then -- redistr. between q and an immed. sibling
if iq=p.n then iq:= iq-1; end if;-- if no right one: left one
pls:= p.tp(iq); prs:= p.tp(iq+l); kprs:= p.tk(iq+l);
distribute(pls, prs, kprs);
if prs.n>O then -- there are still two siblings
p.tk(iq+l):= kprs;
else -- prs.n=O: all the former eiements in (pls, pre) are in pis
remove_child(p, iq+l);- the tree decreases in width
unbalp:= p.n < maxint/2;
end if;
end if;
end case;
end removeO;
The procedure removeD receives as input parameters k, the key of the element to be removed, and p, a pointer to the root the subtree where the element

108

CHAPTER 4. SETS AND MAPPINGS

is to be removed from. It provides as output parameters a boolean unbalp (abbreviation of unbalanced p) which is true if and only if, as a result of the removal,
the occupation of the node p is below one half of its capacity. For interior nodes,
the capacity is measured in the number of pointers, which is one more than the
number of keys. Additionally, it returns a seoond boolean, kpu (abbreviation
of kp updated) which tells if, as a result of the removal, the key of the lowest
element stored in the subtree rooted at p has been updated. If true, the output
parameter kp is the key of the new lowest element stored in the subtree rooted
at p.
The case that pis a leaf is rather simple. It removes the couple (k,x) from
the sorted array in the node. Indeed, the procedure remove_froTTLleaJ, which
is in charge of the details, may raise the exception does_not_exist, if no such a
couple is found. If, after the removal, the node has become filled under one half
of its capacity, the appropriate output parameter must be set.
The case that pis an interior node is more complex. First, the child that may
contain the element to be removed has to be selected. The function select...child
is in charge of the details. Then, the removal proceeds recursively upon the
selected child q. On return, two adjustments are required. On one hand, if the
lowest key of the elements now stored at the subtree rooted at q is different
than before the removal, the new key must be placed at the appropriate switch
position. If the position of the selected child q in pis greater than 0, this position
is the node p. The figure below illustrates the situation:

However, if iq=O, then kq' must be propagated upwards because now kq 1 is


also the lowest key in the subtree rooted at p. Therefore, kq must be copied on
kp and kpu set to true:

On the other hand, if the removal has left the node q filled below one half
of its capacity, its elements have to be redistributed with those of one of its
immediate siblings. In principle, the algorithm selects arbitrarily the right one,
unless there is no such right one. In either case, pls and prs indicate the left
and right siblings whose elements will be distributed. The procedure distribute
is in charge of redistributing evenly the elements currently in pis and prs and
to return the lowest key in prs after the redistribution. However, if the sibling
selected was also at one half of its capacity, all the elements from both siblings
can be placed in a single node. If so, the procedure distribute puts all them in
the left sibling and, on return, prs.n=O.
If, on return from distribute, two siblings still exist, the switching key must
be placed at the proper position in p:

4.9. B+ TREES

109

prs

pls

where k's is the key returned by the procedure distribute as kprs.


Instead, if prs has become empty, all the keys and pointers at the right of
prs must be shifted left to fill the gap and prs can be released. On return from
distribute, we-have
p

where the subtree marked as empty (ie. X) has a single node that contains no
pointers. After filling the gap, the situation is:

This is the way the tree shrinks in width. The procedure remove_child is
in charge of the details. Indeed, after such a shift, p may become filled under
one half of its capacity and this fact must be propagated upwards through the
parameter unbalp. The reader should realize that leafs are considered to be
below one half of their capacity if n <:; max_leaf /2, whereas for interior nodes
the condition is n < max_int/2. This is because the number of pointers in
interior nodes is max_int + 1.
The auxiliary operation selecLchild
This operation is used by putO, getO, updateO and removeO to switch towards
the appropriate child. It receives a pointer p to an interior node plus a key k
and returns the index of the child of p that has any chance to contain an item
with key k or else the child where a new element with key k must be inserted
in. The algorithm, which uses binary search, is:
function selecLchild(p: in pnode; k: in key) return natural is
n: natural renames p.n;
tk: t_._"l(eys rena..TAes p.tk;

i, j, m: natural;
begin
i:= 1; j:= n;
while i<=j loop
m:= (i+j)/2;
if tk(m)>k then j:= m-1; else i:= m+l; end if;
end loop;
return j;
end select_child;

110

CHAPTER 4. SETS AND MAPPINGS

The auxiliary operation is_in

This operation is used by puto. It receives a pointer p to a leaf plus a key k


and returns true if there is an element with key k stored in p. It makes binary
search too:
function is_in(p: in pnode; k: in key) return boolean is
n: natural renames p.n;
te: t_elems renames p.te;
i, j, m: natural;
found: boolean;
begin
i:= 1; j:= n; found:= false;
while i <=j and not found loop
m:= (i+j)/2;
if
te(m).k > k then j:= m-1;
elsif te(m).k < k then i:= m+ 1;
else
found:= true;
end if;
end loop;
return found;
end isJ.n;

The auxiliary operation insert_in_leaf

This operation is used by putO. It receives a pointer p to a leaf which is not


full, a key k plus its associated item x, and inserts them in the node pointed to
by p. It may raise already_exists. The algorithm is

procedure insert.in..leaf(p: in pnode; k: in key; x: in item) is


n: natural renames p.n;
te: t_elems renames p.te;
i, j, m: natural;
found: boolean;
begin
i:= 1; j:= n; found:= false;
while i<=j and not found loop
m:= (i+j)/2;
if
te(m).k > k then j:= m-1;
elsif te(m).k < k the..'""l i:= m+ 1;
else
found:= true;
end if;
end loop;
if found then raise already_exists; end if;
fori in reverse j+l..n loop te(i+1):= te(i); end loop;
teQ+1):= (k, x);
n:= n+l;
end insertJ.nJ.eaf;

4.9. B+ TREES

111

The auxiliary operation distribute_in_leafs

This operation is used by putO when the leaf where a new element has to be
inserted is completely full and moreover no element in that leaf has a key equal
to that of the element to be inserted. It receives a pointer p to a leaf which is
full, a pointer ps to an empty leaf that is to become the new right sibling of
p plus the element to be added, ie. (k,x). It distributes evenly the elements
currently in p plus the couple (k,x) into p and ps. Moreover, it returns as the
output parameter kps, which is the key of lowest element allocated in ps.
procedure distributeJ.nJ.eafs

(p, ps: in pnode; k: in key; x: in item; kps: out key) is


ip: natural; -- index over components of p
ips: natural; -- index over components of ps
begin
ps.n:= (p.n+l)/2;
ip:= p.n; ips:= ps.n;
while ips>O and p.te(ip ).k>k loop
ps.te(ips):= p.te(ip);
ips:= ips-1; ip:= ip-1;
end loop;
if ips>O then ps.te(ips):= (k, x); ips:= ips-1; end if;
while ips>O loop
ps.te(ips):= p.te(ip);
ips:= ips-1; ip:= ip-1;
end loop;
kps:= ps.te(1).k;
if k<kps then-- (k, x) hasn't been inserted in ps
while ip>O and then p.te(ip ).k>k loop
p.te(ip+l):= p.te(ip);
ip:= ip-1;
end loop;
p.te(ip+l):= (k, x);
end if;
p.n:= p.n + 1 - ps.n;
end distributeJ.nJ.eafs;
The auxiliary operation inserLin_interior
This operation is used by putO when a new couple of a key and a pointer has
to be inserted in an interior node that is not completely full yet. It receives
as parameters p (a pointer to the interior node where the insertion has to be
done), iq (the position of the child selected for the insertion), and the couple
to be inserted, made up by a pointer to a subtree qs and its associated key
kqs. This means that, as a result of inserting an element into a child q of p
-being q = p.tp(iq)-, a right sibling for q, qs, has been created and that the
key of the lowest element in the subtree rooted at qs is kqs. The procedure
inserl..in_interior inserts the couple kqs and qs in p immediately at the right
of q, shifting the higher couples one position right. It is granted that there is
room enough for the insertion.

112

CHAPTER 4. SETS AND MAPPINGS

procedure insert_in_interior(p, qs: in pnode; iq: in natural; kqs: in key) is


n: natural renames p.n;
tk: t.keys
renames p.tk;
tp: t_pointers renames p.tp;
begin
for i in reverse iq+ l..n loop
tk(i+l):= tk(i); tp(i+l):= tp(i);
end loop;
tk(iq+l):= kqs; tp(iq+l):= qs;
n:= n+l;
end insert...in_interior;
The auxiliary operation distribute_in_interior
This operation is used by putO when a new couple of a key and a pointer has to
be inserted in an interior node p that is completely full and, consequently, the
existing couples in p plus the new couple must be distributed evenly between p
and a new immediate right sibling of p.
It receives as parameters two pointers p and ps, respectively, the node that is
full and an empty interior node to become its innnediate right sibling. Also iq,
the position in p of the child q that has got the new sibling, that new sibling qs
and its associated key kqs.
The procedure distribute_in_interior distributes evenly the couples that are
in p plus the new couple (kqs,qs) into p and ps. On return, the output parameter kps indicates the key of the lowest element stored in the subtree rooted
at ps.
Usually, (kqs,qs) will be placed immediately at the right of q, but it may
happen that q becomes the last pointer in p, in which case q will be the pointer
placed at position zero in ps. If so, kps=kqs. The algorithm considers three
casesl depending on whether the new value of p.n is lower, equal or greater
than iq.
procedure distribute_ill_lnterior
(p, ps, q, qs: in pnode; iq: in natural; kqs: in key; kps: out key) is
ip: natural; -- index over components of p
ips: natural; -- index over components of ps
begin
ps.n:= max_lnt/2; p.n:= max_int-ps.n;
if p.n<iq then-- both q and qs must go to ps
ips:= ps.n; ip:= max_int;
while ip>iq loop
ps.tk(ips):= p.tk(ip); ps.tp(ips):= p.tp(ip );
ips:= ips-1; ip:~ ip-1;
end loop;
ps.tk(ips):= kqs; ps.tp(ips):= qs; ips:= ips-1;
while ips>O loop
ps.tk(ips):= p.tk(ip); ps.tp(ips):= p.tp(ip);
ips:= ips-1; ip:= ip-1;
end loop;
kps:= p.tk(ip); ps.tp(O):= p.tp(ip);

4.9. B+ TREES

113

elsif p.n=iq then-- q keeps in p, qs goes to ps.tp(O)


ips:= ps.n; ip:= max_int;
while ip>iq loop
ps.tk(ips):= p.tk(ip); ps.tp(ips):= p.tp(ip);
ips:= ips--1; ip:= ip-1;
end loop;

kps:= kqs; ps.tp(O):= qs;


else --p.n>iq then - both q and qs keep in p
ips:= ps.n; ip:= max_int;
while ips>O loop
ps.tk(ips):= p.tk(ip); ps.tp(ips):= p.tp(ip);
ips:= ips-1; ip:= ip-1;
end loop;
kps:= p.tk(ip); ps.tp(O):= p.tp(ip); ip:= ip-1;
while ip>iq loop
p.tk(ip+l):= p.tk(ip ); p.tp(ip+ 1):= p.tp(ip );
ip:= ip-1;
end loop;

p.tk(ip+l):= kqs; p.tp(ip+l):= qs;


end ifj
end distribute_in_interior;
The auxiliary operation geLfrom._leaf
This operation is used by getO to retrieve the item associated to the given key
from a leaf node.
procedure get.from..leaf(p: in pnode; k: in key; x: out item) is
n: natural renames p.n;
te: Lelems renames p.te;
i, j, m: natural;
found: boolean;
begin
i:= 1; j:= n; found:= false;
while i<=j and not found loop
m:= (i+j)/2;
if
te(m).k > k then j:= m-1;
elsif te(m).k < k then i:= m+1;
else
found:= true;
end if;
end loop;
if not found then raise does..not_exist; end if;
x:= te(m).x;
end get..from..leaf;
The auxiliary operation update_leaf
This operation is used by updateO to replace the item currently associated to
the given key with the new provided one. The algorithm is exactly the same as
for get._frorrdeaf with the exception that x is an input parameter and that the
last sentence is te(m).x:= x; instead of x:= te(m).x.

114

CHAPTER 4. SETS AND MAPPINGS

The auxiliary operation remove_from_leaf


This operation is used by removeO to delete an element from a leaf. It receives
as parameters a pointer p to a leaf node and a key k. The output parameter
kpu (for kp updated) indicates if the key of the lowest element in node p after
the deletion is different than before.
procedure remove...fromJeaf
n: natural renam.es p.n;
te: t_elems renam.es p.te;
i, j, m: natural;
found: boolean;
begin
i:= 1; j:= n; found:= false;
while i<=j and not found loop
m:= (i+j)/2;
if
te(m).k > k then j:= m-1;
elsif te(m).k < k then i:= m+ 1;
else
found:= true;
end if;
end loop;
if not found then raise does..noLexist; end if;
fori in m .. n-11oop te(i):= te(i+1); end loop;
n:= n-1;
kp:= te(l).k; kpu:= (m=l);
end remove_fromJ.eaf;

The auxiliary operation distribute


This operation is used by removeD when, as a result of the removal of an element
from a child of a node, this child becomes occupied below one half of its capacity.
Then, its elements have to be redistributed with those of some of its immediate
siblings. lf all the elements fit into a single node, then one of the siblings is left
empty and will be released by removeO. Otherwise, the complete set is evenly
distributed between the two siblings.
It receives as parameters the pointers to two adjacent siblings, being pls
and prs the left and right ones respectively. If the current oontents of the two
nodes can be placed into a single one, then all them are moved to pls, and prs is
left empty. Otherwise, the contents of both nodes is evenly distributed between
the two nodes. In the latter case, the output parameter kprs is the key of the
lowest element in the subtree rooted at prs after the distribution.
The algorithm co:usiders separately the distribution is between interior nodes
and leaf nodes.

procedure distribute(p!s, prs: in pnode; kprs: in out key) is


begin
case pis. t is
when leaf
=> distribute.leaf
(pis, prs, kprs);
when interior=> distribute_interior(pls, prs, kprs);
end case;
end distribute;

4.9. B+ TREES

115

The procedures distribute_leaf and distribute_interior consider three cases:


if all the elements can be put into a single node, if the left sibling has a greater
number of elements, and if the right sibling has a greater number of elements.
In the first case, all the elements in prs are moved to pls. As all them have keys
greater than those of the elements in pls, they can be simply appended to those
stored in pls. In the second case, some elements are moved from pls to prs. To
do so, the elements in prs have to be previously shifted right as many positions
as elements to be moved, and after the elements from pls are actually moved to
prs. In the third case, some elements are moved from prs to pls. To do so, the
elements moved are appended to those in pls and after the elements remaining
in prs are shifted left as many positions as elements have been moved to pls.

procedure clistribute.leaf(pls, prs: in pnode; kprs: in out key} is


nt:
natural; -- total number of elements
ne: natural; -- number of elements to move
il, jr: natural; -- indexes over pls and prl respectively
begin
nt:= pls.n + prs.n;
if nt <= max.leaf then -- all the elements from prs can move to pls
il:= pls.n;
for ir in l..prs.n loop
il:= il+l; pls.te(il}:= prs.te(ir};
end loop;
pls.n:= nt; prs.n:= 0;
elsif pls.n < prs.n then- (and nt > rnax..leaf}: move from r to l
ne:= (prs.n-pls.n}/2; -- nurn. of elements to move
il:= pls.n; -- move first ne elements from prs to pls
for ir in l .. ne loop
il:= il+l; pls.te(il}:= prs.te(ir};
end loop;
pls.n:= il;
jr:= 0; -- move the remaining elems in prs
for ir in ne+ l..prs.n loop - to the initial positions of prs
jr:= jr+l; prs.te(ir):= prs.te(ir};
end loop;
prs.n:= jr;
else -- nt > max.leaf and pls.n > prs.n : move from l to r
ne:= (pls.n-prs.n}/2; -- num of elements to move
for ir in reverse l..prs.n loop-- move current elems in prs
prs.te(ir+ne}:= prs.te(ir}; -- ne positions right
end loop;
prs.n:= prs.n + ne;
il:= pls.n; -- move last ne elerns from pls to prs
for ir in reverse l..ne loop
prs.te(ir}:= pls.te(il); il:= il-l;
end loop;
pls.n:= il;
end if;
kprs:= prs.te(l) .k;
end distribute.leaf;

116

CHAPTER 4. SETS AND MAPPINGS

procedure distributejnterior(pls, prs: in pnode; kprs: in out key) is


nt: natural; - total number of children
ne: natural; -- number of elements to move
il, jr: natural; - indexes over pis and prl respectively
begin
nt:= pls.n + prs.n + 2; -- Total number of children
if nt <= maxjnt + 1 then -- all the children from prs can move to pis
il:= pls.n + 1;
pls.tp(il):= prs.tp(O); pls.tk(il):= kprs;
for ir in l..prs.n loop
il:= il+l;
pls.tp(il):= prs.tp(ir); pls.tk(i!):= prs.tk(ir);
end loop;
pls.n:= il; prs.n:= 0;
elsif pls.n < prs.n then -- (and nt > maxjnt+ 1): move from r to I
ne:= (prs.n-pls.n)/2; - num. of elements to move
- move the first ne elements from prs to pis
il:= pls.n+ 1;
pls.tp(il):= prs.tp(O); pls.tk(il):= kprs;
for ir in l..ne-1 loop
il:= il+l;
pls.tp(il):= prs.tp(ir); pls.tk(il):= prs.tk(ir);
end loop;
pls.n:= il;
-- move the remaining elems in prs to the initial positions of prs
jr:= 0;
prs.tpGr):= prs.tp(ne); kprs:= prs.tk(ne);
for ir in ne+ l..prs.n loop
jr:= jr+l; prs.tpGr):= prs.tp(ir); prs.tkGr):= prs.tk(ir);
end loop;
prs.n:= jr;
else -- nt > maxjnt+ 1 and pls.n > prs.n : move from l to r
ne:= (pls.n-prs.n)/2; -- num of elements to move
-- move current elems in prs ne positions right
for ir in reverse l..prs.n loop
prs.tp(ir+ne):= prs.tp(ir); prs.tk(ir+ne):= prs.tk(ir);
end loop;
prs.tp(ne):= prs.tp(O); prs.tk(ne):= kprs;
prs.n:= prs.n + ne;
-- move last ne elems from pis to prs
il:= pls.n;
for ir in reverse l..ne-1 loop
prs.tp(ir):= pls.tp(il); prs.tk(ir):= pls.tk(il);
il:= il-l;
end loop;
prs.tp(O):= pls.tp(il); kprs:= pls.tk(i!);
il:= il-l;
pls.n:= il;
end if;

end distributeJnterior;

4.9. B+ TREES

117

The auxiliary operation remove_child

This operation is used by removeD when a child from an interior node has
become empty because all their elements have been moved to its left sibling by
a distribute operation. If so, all the keys and pointers at the right of the empty
child have to be shifted left to fill the gap. It receives as parameters a pointer of
an interior node p and iq, the position in p of the child that has become empty.
This position is never zero.

procedure remove_child(p: in pnode; iq: in positive) is


n: natural renames p.n;
tk: tJ<eys
renames p.tk;
tp: Lpointers renames p.tp;
begin
fori in iq .. n-1loop

tk(i):= tk(i+l); tp(i):= tp(i+l);


end loop;
n:= n-1;
end remove_child;
The cost of the operations

It is rather obvious that the execution time of all the operations but empty is
O(logn) and the operation empty is 0(1). However, it is important to realize
that B+ trees are actually very flat. Let us consider, for illustrating purposes,
a data set of n = 6, 000, 000 items having d'!ta sizes for keys, data and pointers
leading to the realistic values of max_leaf=52 and max_int=340.
In the best case, there would be
= fn/521 = 115, 358 leafs. The root
would keep 341 pointers and the nodes at the intermediate levels would keep
341 pointers too. Therefore, only one intermediate level is required because
341 341 = 116,281 >
Consequently, an item is found in d~ = 3 disk
accesses, one for the root, one for the intermediate level and one for the leaf.
In the worst case, the leafs would be filled up to one half of their capacity.
So, the number of leafs in the worst case is n = 2n1 = 230,770. Moreover,
the root may contain only two pointers and the nodes at the intermediate levels
must contain at least 171 pointers. Therefore, one more intermediate level is
required because
2171171
58,482 < n
but 2 171171171 = 10,000,422 > n

n'

n'.

So, if a third intermediate level existed, some of the nodes in the i:Dtermediate
levels should contain less than 171 pointers, which may not happen. Consequently, an item is found in d~ = 4 disk accesses, one for the root, two for the
intermediate levels and one for the leaf. In a fast device, these four accesses
may cost about 40 milliseconds.
Indeed, the operations for insertion and deletion can co_st a bit more because
some siblings may become involved too along the path. In the very worst case
this implies at most twice as many nodes. Moreover, the updated nodes have
to be written back to the disk. Therefore, in the very worst case that all the
nodes in the path from the root have to be merged or distributed with a sibling,
the total coSt of an insertion or a removal can be up to four times the cost of a
retrieval.

CHAPTER 4. SETS AND MAPPINGS

118

Variants of B+ trees: the family of B trees


In their origins, B trees were inspired in binary search trees and had a single
type of node because the internal nodes kept items too. B+ trees were an
improvement over simple B trees because the fact that interior nodes- only keep
keys and pointers allows them to allocate a higher number of pointers, which
makes the trees to become much more flat. The principle of B' trees is the same
as that ofB+ trees but the nodes (but the root) are filled at least to 2/3 of their
maximum capacity. To achieve this goal, the operations of distribution involve
three siblings instead of two.
The 2-3 trees are an adaptation of the principles of B+ trees for primary
storage applications: the leaves have a single item and the interior nodes have
at least one key and at most two (ie. two or three pointers respectively). The
general B+ tree algorithms adapted to these particular constraints lead to a
structure that is roughly equivalent to a perfectly balanced binary search tree
where a retrieval requires to make one key comparison along a path of log2 n
nodes and in the worst of the cases it requires to make two key comparisons
along a shorter path of log3 n nodes, although the difference is not great because
2log3 n = 1.26log2 n. Indeed, the space of memory reserved for keys in interior
nodes is an additional space consumption compared to binary search trees.

4.10

Tries

Overview
For alphanumeric keys or, in general, for keys that have the form of arrays
having their oomponents into a small range (ie. the same kind of keys that
allow the Binsort algorithm studied in section 5.7 of vol. III to be applied),
there is the possibility to organize the set according to the key components. For
instance, the set {blind, blow, blue, tooth, train, training, traverse, tree} can
be represented by a tree where all the keys having the same initial letter pend
from the same subtree of the root, then each of those subtrees are organized
likewise according to the second component of the key and so on. Indeed, each
leaf corresponds to a key in the set, but some interior nodes may represent keys
as well, if a key is a proper prefix of another one. -To make the appropriate
distinction, an additional child must be appended to each node that actually
represents a key, associated to a special symbol that acts as an end mark, as in
the fignre below:

b
0

u
e
$

r
e
v

e
e $
r

n
g

s
e

4.10. TR1ES

119

Such a structure is called a trie, which is a neologism that results from the
contraction of tree and retrieval. To avoid the confusion with other trees, it is
usually pronounced as though it rhymes with pie. It is clear that finding if a key
is stored in a trie requires at most as many steps as symbols are in the key. As
the key lengths are usually bounded and, in any case, they are independent from
the number of keys stored, we can expect that the execution time for searches,
as well as most of the other operations, is 0(1).
The description of tries below assumes that they implement pure sets (ie. not
mappings) and also that the keys are strings of symbols. This is reasonable
because it simplifies the description of the algorithms and also because the
most frequent application of tries are spell checkers. However, at the end of the
section, the ways to extend the technique to implement mappings and to apply
it to more complex types of keys is discussed.

The type declaration


Each node of a trie is itself a mapping from a symbol to a pointer. In the usual
case of alphanumeric keys, the symbols are characters, often in a small range
such as digits or lower case letters. In those cases, t4~- ~asiest implementation
of the mapping is by arrays indexed directly by those symbols.
It is essential to distinguish whether a node has a child labelled by $, which
means that the node represents an actual key value, or it does not have such a
child, which means that that node corresponds only to a proper prefix of other
keys. However, children labelled by$ are leafs, and, as the leafs of a trie contain
no information at all, it would be a wasteful choice to use memory space to
allocate them. A better convention is to associate the symbol $ to a null value
if no such child exists and to set a pointer to the node itself otherwise.
Indeed, the generic part of the package must specify that the key can not
be an arbitrary type but an array whose components are of a discrete type.
According to these principles, the appropriate declaration is:
generic
type key_component is ( <>);
--any discrete type
type keyJ.ndex
is range < >;
-- some range of integers
type key
is array(key_index) of key_component;
package d_set is
--Exactly as in section 4.2 (sets).
private
type node;
type pnode is access node;

type node is array(key_component) of pnode;


type set is
record
root: pnode;
end record;
end d...set;

120

CHAPTER 4. SETS AND MAPPINGS

The keys
When variable length keys are stored in a fixed length array, some kind of end
mark is required. As for the radix sort, we shall assume that the special symbol
that works as end mark is key_component'first. We shall assume also that the
keys are well constructed, ie. that such a mark is always present in the provided
keys.
To improve the readability of the algorithms, a few constants are declared
in the body of d..set, that will stand as shortcuts for their wordy equivalents:
mk: constant key_component:= key _component'first; -- end mark
lastc: constant key_component:= key _component'last;
iO:
constant key_index
:= key .index'first;
Empty
The empty set is represented by a tree that has only a root node. Such a tree
represents the acceptance of the empty string as a prefix. However, as the empty
string is not a key, the root has no child labelled $. Consequently,
procedure empty(s: out set) is
root: pnode renames s.root;
begin
root:= new node;
root.all:= (others=> null);
end empty;

Is in
The search for a key in a trie is a traversal of a tree's branch, going down the
tree from the root along the corresponding path, switching according to the
corresponding symbols in the key. Indeed, if for a given symbol in the key, no
corresponding branch is found, the traversal is interrupted. If the end mark of
the key is reached, it must be checked that the corresponding node has actually
a child labelled $, ie. that the pointer that corresponds to $ points to the node
itself:
function isJ.n(s: in set; k: in key) return boolean is
root: pnode renames s.root;
p: pnode;
i: key _index;
c: key _component; -- a copy of k(i)
begin
p:= root; i:= iO; c:= k(i);
while c/=mk and p.all(c)/= null loop
p:= p.all(c); i:= i+l; c:= k(i);
end loop;
return c=mk and p.all(mk)=p;
end is...in;

4.10. TRIES

121

Put
The insertion algorithm is very much like the search algorithm. However, in
this case when no child exists for the current symbol in the key, a new child is
created for that symbol and the traversal goes on up to the end of the key. The
node that corresponds to the last significant symbol in the key is pointed to
itself to indicate that that node actually represents a key, and therefore it has,
conceptually, a child labelled $:

procedure put(s: in out set; k: in key) is


root: pnode renames s.root;
p, r: pnode;
1:
key _index;
c:
key _component; - a copy of k(i)
begin
P = l'"QQt. i= iQl ,.,-.. = k(il
while c/=rrlk loop
if p.all( c )=null then -- No child exists for c: create that child.
r:= new nodei
r.all:= (others=> null);
p.all(c):= r;
0

-,

AO

AA

-,,

end if;

p:= p.all(c); i:= i+1; c:= k(i);


end loop;

p.all(rrlk):= p; -- if p.all(rrlk)=p previously, the key already existed


exception
when storage_error => raise space_overfiow;
end put;
Remove
The simplest approach to remove is to reproduce the search algorithm and, in
the case of successful search, replace in the last reached node the pointer that
identifies that that node actually corresponds to a key by null:
procedure remove(s: in out set; k: in key) is -- Preliminary version
root: pnode renames s.root;
p: pnode;
i: key _index;
c: key _component; -- a copy of k(i)
begin
p:= root; i:= iO; c:= k(i);
while c/=rrlk and p.all(c)/= null loop
p:= p.all(c); i:= i+1; c:= k(i);
end loop;
if c=mk and p.all(mk)=p then -- the element exists
p.all(mk):= null;
-- actual removal
end if;
end remove;

This version is correct, to the extent that the deleted key will not be found
any longer. Nevertheless, it does not release the nodes that may become su-

122

CHAPTER 4. SETS AND MAPPINGS

perfl.uous. For instance, if this algorithm is applied to delete the key reverse
in the example shown at the beginning of the section, the child labelled $ from
the node associated to the prefix reverse would actually disappear, but the
branch associated to the suffix verse, would keep unnecessarily attached to the
structure.
Consequently, the previous version must be improved by keeping a pointer

to the last node in the traversal that does not have a unique descendant, and
also the symbol that was used to switch from that node down. If the key to be
deleted is actually found in the set, and it is not a prefix of another key, then
the full branch can be disconnected from the structure and its nodes released.
The root is somewhat exceptional because if the spurious branch pends from
the root it can be removed even if it is its only branch. The improved algorithm
follows:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
p, r: pnode;
i:
key _index;
c, cr: key_component;
begin
p:= root; i:= iO; c:= k(i); r:= p; cr:= c;
while cf=mk and p.all(c)/= null loop
if not unique_desc(p) then r:= p; cr:= c; end if;
p:= p.all(c); i:= i+l; c:= k(i);
end loop;
if c=mk and p.all(mk)=p then - the element exists
if unique_desc(p)
-- is not a prefix of a longer key
then r.all(cr):= null;
--unlink spurious branch, if any
else p.all(mk):= null;
- remove only $ branch
end if;
end if;
end remove;
where

function unique_desc(p: in pnode) return boolean is


c: key_component;
nd: integer; -- number of descendants

begin
c:= mk; nd:= 0;
while c/=lastc and nd<2 loop
if p.all(c)/= null then nd:= nd+l; end if;
c:= key_component'succ(c);
end loop;
if p.all(c)/= null then nd:= nd+l; end if;-- check the case c=lastc
return nd<2;
end unique_desc;

Is empty

As the empty set is represented by a single node ie. the root with no descendants, checking whether the set is empty is very similar to the algorithm for
1

4.11. HASIDNG

123

unique_desc, and it must be applied to the root. It is left as an exercise to the


reader.

The cost of the operations


It is clear that the operations for inserting, searching or deleting a key make
at most as many steps as symbols in the key, and that at each step they make
operations that take constant time. Consequently, the execution time of these
operations is independent from the number of keys stored and is therefore 0(1)
in all the cases.
Extending the basic model: mappings
If data items have to be associated to the keys, it suffices to replace the pointers
labelled with $ by a different pointer type to the data associated to that key.

Extending the basic model: symbols


As noticed earlier, each node of a trie is a mapping from symbols to pointers.
The description above has implemented such a mapping by arrays indexed by
symbols, which is the simplest and the most readable approach. However, there
are some drawbacks in this approach, for instance that the nodes near the
root are reasonably full of significant pointers, whereas the nodes near the leafs
contain very few of them and consequently the use of memory space is far
from efficient. Arrays, or even linked lists, of couples symbol-pointer should be
considered as reasonable alternatives.
These alternatives allow a wider range of key components to be used too.
For instance, programs that play chess have to decide the best move for a given
configuration. In the middle of a game this is achieved by intensive computing,
but at the beginning the moves are very well studied and recorded in the books
as openings. Tries would be a very suitable option to store openings, with
the branches labelled by moves and the nodes stored in a direct file. Such a
structure would require a single reading operation upon that file per move to
get a set of reasonable moves for that configuration. Indeed, a move is a type
too complex to be used as an index for an array, so a different alternative should
be considered to irllplement the mapping from moves to pointers. An array of
couples move-pointer may be a reasonable choice.

4.11

Hashing

4.11.1

Introduction

Any implementation of a mapping converts a key value into a position where the
associated data is stored. A possible approach can be to compute that position
from the value of the key itself. For instance, if the keys are strings and the
actual data are stored in an array indexed from 0 to b- 1, a possible approach
might be do some strange operation with the values of the characters in the
string that finally leads to a number in the range [0, b -1].
There are many ways to achieve this goal, from the simplest to the most
sophisticated. A simple approach, which is far from the best but may serve as a

124

CHAPTER 4. SETS AND MAPPINGS

first approximation to understand the technique, might be the function below.


Better approaches will be discussed later.
function hash(k: in string) return natural is
s, p: natural;
begin
s:= 0;
for i in s'range loop
p:= character'pos(k(i));
s:= (s+p) mod b;
end loop;
returns;
end hash;
Indeed, as the cardinality of the possible key values is far greater than the
number of values that are expected to be stored, ie. b, such a hash function cap
never be exhaustive. 'l'his means that different key values can be mapped to
the same position, and that the collisions shall be managed in some way. There
are two approaches to deal with this problem, which are called open and closed
hashing respectively.

4.11.2

Open hashing

Overview
In open hashing, the synonyms, ie. the elements with keys having the same
value of hash(k), are organized in a linked list. 'l'he figure below illustrates the
structure:
dt
0

b-1

The hash function ha.sh takes the key k and converts it into a natural number
in the range [0, b- 1], that is used as an index into a dispersion table dt. Each
component of dt is a pointer to a list of elements having keys k' such.- that
ha.sh(k') = ha.sh(k).
A good hashing function should distribute the possible key values evenly
among the components of d. If so, we may expect that after storing n different
elements, the length of all the lists of synonyms have lengths that are approximatsly nfb. The value a= n/b is called the load factor of the hash table. If we
choose a value for b that makes a = 1, the average length of these lists will be 1,
so we may expect that the search, and modification algorithms can be very fast.
A detailed probabilistic analysis is developed below.

4.11. HASHING

125

The type declaration


According to these principles, the declaration of the type becomes:
generic
type key is private;
type item is private;
with function hash (k: in key; b: in positive) return natural;
size: positive; -- Please supply a prime number.
package d_ret is
--Exactly as in section 4.2 (mappings).
private
b: constant natural:= size;
type node;
type pnode is access node;
type node is
record

k:
key;
x:
item;
next: pnode;
end record;
type dispersion_table is array(natural range

o.. b-1) of pnode;

type set is
record
dt: dispersion_table;
end record;
end d..set;
Empty
The empty set is represented by a dispersion table where all the lists are empty.
Setting them empty is as simple as:
procedure empty(s: out set) is
dt: dispersion_table renames s.dt;
begin
fori in O.. b-1 loop
dt(i):= nuli;
end loop;
end empty;
Put
The insertion of an element in the set is achieved by applying the hash function
to the key to obtain a position in the dispersion table to get a pointer. Then,
the list that starts from that pointer has to be traversed to discard the storage
of duplicated keys. If no element with the same key exists in the list, a new

126

CHAPTER 4. SETS AND MAPPINGS

block has to be requested to allocate the new element. Then the new block has
to be inserted in the list. Indeed, the insertion place that makes the algorithm
easiest is at the beginning of the list. The detailed algorithm is:

procedure put(s: in out set; k: in key; x: in item) is


dt: dispersion_table renames s.dt;
i: natural;
p: pnode;
begin
i:= hash(k, b); p:= dt(i);
while pf=null and then p.k/=k loop
p:= p.next;
end loop;
if p/=null then raise already_exists; end if;
p:= new node;
-- create cell
p.all:= (k, x, nul!);
-- fill cell
p.next:= dt(i); dt(i):= p; -- insert at the beginning of the i-th list
exception
when storage_error => raise space_overflow;
end put;

Get and update


Likewise, retrieving the data associated to a key is achieved by applying the
hash function to the key, to obtain a position in the dispersion table to get a
pointer. Then, an element with a key equal to that provided has to be searched
in the list that starts from that pointer:
procedure get(s: in set; k: in key; x: out item) is
dt: dispersion_table renames s.dt;
i: natural;
p: pnode;
begin
i:= hash(k, b); p:= dt(i);
while pf=null and then p.k/=k loop
p:= p.next;
end loop;
if p=null then raise does_not_exist; end if;
x:=p.x;

end get;

The algorithm for update is exactly the same but replacing x:= p.x; with
p.x:= x;
Remove

The same principle applies to the removal. The hash function has to be applied
to the key, to obtain a position in the dispersion table. From this position a
pointer is got to the beginning of the only list that may oontain an element with
that key. Then, an element with a key equal to that provided has to be searched

4.11. HASHING

127

in the list that starts from that pointer and in the case of success, the element
is actually removed from the list. Indeed, the removal requires that a pointer
to the previous element in the list is kept along the traversal.
procedure remove(s: in out set; k: in key) is
dt: dispersion_table renames s.dt;
i:
natural;
p, pp: pnode;
begin
i:= hash(k, b); p:= dt(i); pp:= null;
while p/=null and then p.k/=k loop
pp:= p; p:= p.next;
end loop;
if p=null then raise does...noLexist; end if;
if pp=null then dt(i):= p.next; else pp.next:= p.next; end if;
end remove;
Cost analysis

After the insertion of n elements, they are distributed in b lists, so the average
length of the lists is n/b. As we have assumed that the size of the dispersion
table is chosen to match the number of elements that are expected to be stored,
the average length ofthe lists is 1. Indeed, if the hash function does not perform
well, this technique has a worst case in which all the elements go to the same list.
If so, this technique would perform as bad as an unsorted linked list. Instead,
if the hash function is good, we may expect that the lists will be in general
short. Therefore, the algorithms for insertion, deletion and retrieval, that make
as many steps as elements are in these lists, are efficient too. However it is
required to make the concept of "short11 more precise. In other words, we have
to establish the probabilistic bounds for these lengths and, in particular, the
average length of the non empty lists.
In the study of these bounds, we shall assume that the hash function is
so good that each key is equally likely to hash to any of the positions of the
dispersion table. Later we shall study the algorithms to obtain hash functions
that achieve this goal reasonably well.
Let us consider one of the lists. The probability that this list is empty is the
probability that along the n insertions, the selected list has always been another
one. Therefore:

b-1)n
po= ( -bAs the total number of lists is b, the total number of non empty lists is

b(1 -Po) and the average length of the non empty lists is
L=

n
b(1- Po)

This length is a growing function of n because (b- 1)/b < 1 implies that
((b- 1)/bt is decreasing. However, if the size of the dispersion table is adjusted
to make its load factor a= 1 (ie. b = n), the growth of Lis bounded and for
large sets we have that

128

CHAPTER 4. SETS AND MAPPINGS

lim b ( 1 bn~=

(nb b 1 )n) = n~=


lim

n (1

n ft )n)

- (1 -

1
e
,.---:--;-- = - - "' 1.582

1-1/e

e-1

As this is the average length of the non empty lists, and all these lists have
at least one element, we may expect that very few of them will have more than
two.
Another way to realize how short these lists are is to consider the probability
that a given list has length i. For i = 0 we have established it above. The
probability that after n insertions a given list has length exactly one is the
probability that this list has been selected once and has not been selected for
the remaining n- 1 insertions. However, the actual selection can occur in any
one of the n insertions. Consequently,

_ (1) (b-1)n-l

PI-n

--

In general, the probability that after n insertions a given list has length i is
the probability that it has been selected i times and has not been selected the
remaining n - i. However the actual selections can be any i among the total n,
. C' :
Ie.
11

Indeed, the probability that a list has length l ~ i is P; = I:i=o p; and the
probability that a list has length l > i is P; = 1 - P;.
The following figures show how small is the probability of having long lists,
no matter how many items are stored in the set:

b-n
10
100
1000
10,000

p3
1.279%
1.837%
1.891%
1.914%

Ps

0.164%
0.343%
0.362%
0.382%

0.015%
0.053%
0.057%
0.076%

The conclusion of this analysis is that the operations for insertion, retrieval
and deletion do not require more than a limited number of steps, which is
independent of the number of elements stored. Therefore, these operations can
be considered to be 0(1).

4.11.3

Closed hashing

Overview

In closed hashing all the elements are stored in the distribution table itself. In
case of a collision, an alternative position is selected in the same table, by a
mechanism called rehashing.

4.11. HASHING

129

The type declaration


A closed hashing implementation for a set requires only a single dispersion table.
Each component of the table must store a key and its associated data plus a
mark that tells whether that position is actually occupied or not.
Indeed, for closed hashing the number of elements stored can not ex:ceed the
number of components of the dispersion table, so a = n/b ::::;; 1. However, as
we shall see in the section devoted to the cost analysis of closed hashing, good
performance require that a: < 0.8. For this reason, the exception of capacity
exceeded shall be raised when this limit is reached. To make this test easy, an
integer component keeps the number of elements currently in the set.
According to these principles, the type declaration is:
generic
-- Exactly as for open hashing (see section 4.11.2).
package d..set is
--Exactly as in section 4.2 (mappings).
private
b:
constant natural:= size;
max_ne; constant natural:= size*l0/8;

type cell_state is (free, used);


type cell is
record
k: key;
x: item;
st: cell....state;
end record;

type dispersion_table is array(natural range O.. b-1) of cell;


type set is
record
dt: dispersion_table;
ne: natural; -- number of elements currently stored
end record;
end d_set;
Empty
To represent the empty set, all the components of the dispersion table must be
marked to free and the number of elements set to zero:
procedure ernpty(s: out set) is
d: dispersion_table renames s.dt;
ne: natural
renames s.ne;
begin
fori in O.. b-1 loop d(i).st:= free; end loop;
ne:= 0;
end empty;

130

CHAPTER 4. SETS AND MAPPINGS

Put (linear rehashing)


The insertion of an element in the set is achieved by applying the hash function
to the key to obtain a position in the dispersion table. If that component is
free, the key and its data are copied in that place.and the place is marked as
used. Otherwise an alternative position must be selected. The simplest way to
do it is to search the first non used component that is found after the selected
position (circularly). This technique is called linear rehashing. It is not the best
approach to solve the problem but its simplicity makes it the most appropriate
to understand the basic principle of the technique. Later we shall consider a
better approach. The detailed algorithm is:
procedure put(s: in out set; k: in key; x: in ite:rp.) is
d: dispersion_table renames s.dt;
ne: natural
renames s.ne;
iO, i: natural; - initial and current position
na: natural; -- number of attempts
begin
iO:= hash(k, b); na:= 0; i:= iO;
while d(i).st=used and then d(i).k/=k loop
na:= na+l; i:= (iO+na) mod b;
end loop;
if d(i).st=used then raise already_exists; end if;
if ne=max_ne then raise space_overftow; end if;
d(i).st:= used;
--mark cell as used
d(i).k:= k; d(i).x:= x; --fill cell
-- count cell
ne:= ne+l;
end put;

Get and update (linear rehashing)


Retrieval and updating of elements in the set follow the same principles:
procedure get(s: in set; k: in key; x: out item) is
d: dispersion_table renames s.dt;
iO, i: natural; -- initial and current position
na: natural; -- number of attempts
begin
iO:= hash(k, b); na:= 0; i:= iO;
while d(i).st=used and then d(i).k/=k loop
na:= na+l; i:= (iO+na) mod b;
end loop;
if d(i).st=free then raise does.noLexist; end if;
x:= d(i).x;
end get;
Like in any other implementation, update is exactly the same as get but
replacing x:= d(i).x; by d(i).x:= x;.

Put (quadratic rehashing)


Linear hashing has the inconvenience that lists may become clustered if the hash
function maps two sets of keys to two positions in the dispersion that are near

4.11. HASHING

131

to each other. In this case, the two lists overlap and the number of steps of the
operations for storing or retrieving an element can become twice as many as it
would be required.
Quadratic hashing avoids this drawback, by making jumps that seldom collide if rehashing starts from different positions.
procedure put(s: in out set; k: jn key; x: in item) is
d: dispersion_table renames s.dt;
ne: natural
renames s.ne;
iO, i: natural; -- initial and current position
na: natural; -- number of attempts
begin
iO:= hash(k, b); i:= iO; na:= 0;
while d(i).st=used and then d(i).k/=k loop
na:= na+1; i:= (iO+nMna) mod b;
end loop;
if d(i).st=used then raise already_exists; end if;
if ne=ma:x:Jle then raise space_overflow; end if;
d(i).st:= used;
- mark cell as used
d(i).k:= k; d(i).x:= x; - fill cell
-- count cell
ne:= ne+l;
end put;

Get and update (quadratic rehashing)


Likewise, if quadratic rehashing is applied, the algorithms for retrieval and update become:
procedure get(s: in set; k: in key; x: out item) is
d: dispeision_table renames s.dt;
iO, i: natural; -- initial and current position
na: natural; -- number of attempts
begin
iO:= hash(k, b); i:= iO; na:= 0;
while d(i).st=used and then d(i).k/=k loop
na:= na+1; i:= (iO+nana) mod b;
end loop;

if d(i).st=free then raise does.not_exist; end if;


x:= d(i).x;
end get;
Analysis of quadratic rehashing
Linear rehashing naturally traverses all the components of the dispersion table.
Instead, it is less clear that quadratic hashing can reach any component in
the table. Actually, it does not. Nevertheless, if b is a prime number, it can
be granted that at least one half of the components of the table are scanned
without visiting the same component twice. This is because for a component
to be reached twice it should happen that two different attempts n 1 and n 2
(0 (; n1 < n2 < b) satisfy that

132

CHAPTER 4. SETS AND MAPPINGS

(io

+ nf)

mod b = (io + n~)

mod b

or, equivalently,

0 = (n~- ni) mod b = (n, + n 1 )(n 2 - n 1 ) mod b


As 0 < n2 - n, < b and b is prime, this implies that n2 + n 1 must be a
multiple of b, and therefore that n 2 +n1 ~b. In turn, this implies that n, > b/2
because n1 < n2.
As the lists in closed hashing are not expected to be that long, as we shall
see it below, quadratic rehashing can be considered to be satisfactory enough to
solve rehashing because it provides an almost equiprobable distribution among
the non yet reached positions in the ru;ray and it requires a weak computational
effort.
Remove

Removal in closed hashing is cumbersome. We can not simply mark the component containing the deleted key as free because that component may not be
the last one in its list. If so, such a mark would make the components of the list
that come after the deleted one to become unaccessible. Therefore, if removal
is required, the mark should have a third possible state, namely deleted.
Then, the algorithms for insertion and retrieval should skip the components
marked as deleted, because only the components marked as free actually identify
the end of the list. Indeed, the algorithm for insertion, after checking that no
key equal to that to be inserted is in the set, it must reuse one of the components
marked as deleted to allocate the new element, if any has been found.
Although this mechanism is possible, in principle, it has the drawback to
make the lists grow longer than it would be required in the absence of removals
(for the same number of elements finally stored). Indeed, the elements of the
list after the deleted element might be shifted down to fill the gap. However the
algorithm is not sinlple because only the elements with a key that is mapped
to the same initial position have to be shifted. If not, they belong to other lists
and must keep being in the same place.
To sum up, for applications that require frequent insertions and deletions,
the right choice should be open hashing. Instead, when only insertions are
foreseen, closed hashing can be simpler.
Cost analysis

As for open hashing, we shall assume that any key is equally likely to hash to
any of the b elements of the dispersion table. Moreover, we shall assume that
the rehashing function selects any one of the components not yet examined with
uniform probability.
Let p; and q; be
Pi: probability of examining exactly i components in a successful

insertion (or an unsuccessful retrieval) before finding a free component.


q;: probability of examining at least i components in a successful
insertion (or an unsuccessful retrieval) before finding a free component.

4.11. HASHING

133

The average number of steps

S in any of these operations is


00

S=1+

Lip,
i=O

However, on one hand we have that


00

co

00

00

Lip,

00

00

i=l

00

00

i=l

i=l

00

00

L iq, - :L iq, + L q, = L q,
i=1

On the other hand, q1 is the probability that the first component examined
is occupied. Therefore qi = n/b, as there are n occupied components in the
table. Likewise, q2 is the probability that the first component examined is
occupied and the second one is occupied too. Under the assumption that the
rehashing function selects (equiprobably) one component among the b- 1 not
yet examined, as n- 1 of these components are occupied, Q2 = %~=i and, in
general:

b
nn-1
bb-1

bb=T b i+1

<

nn-1

n-i+1

(because n <b)

<

Therefore
-

oo

oo

ana-l

S=1+Lqi<~a'=
lim
L....J
n--too a - 1
i=l

i=O

1-a

(because a

< 1)

As long as a gets clase to 1, this upper bound for the number of steps of
the operations of inserting and retrieval grows extremely fast. Nevertheless,
for a < 0.8 this upper bound is rather small: S < 5. Consequently, if the
algorithms grant that the load factor does not exceed the 80%, we can grant
that the number of steps is bounded by a small factor that is independent of
the number of elements stored and therefore that the operations for insertion
and retrieval are 0(1).
An application remark

Closed hashing is very well suited to improve the implementation of expressions


by cursor-based trees propased in the solution to exercise 1 (section 3.5.10). In
that implementation, building a new expression from its components was O(n)
because checking if there existed a node equal to the new one was achieved by
a linear search upon the created nodes. If the array that stores the nodes of the
expressions is organized as a dispersion table of closed hashing, the performance
of these operations is speeded up to 0(1). The reader is referred to section 7.2,
where the solution to that exercise is revisited, for the details.

CHAPTER 4. SETS AND MAPPINGS

134

4.11.4

External hashing: static

The technique of hashing can be applied aJso to sets that have to be kept in
secondary storage. As for B+ trees, in this case the goal is to minimize the
physical accesses to a disk that are required by the operations that consult or
modify the set. Consequently, the dispersion table must be sized to fill a disk
page and the lists of synonyms must link disk pages too, where each page must
contain as many elements as possible, as the figure below shows:
dt
0

b-1

To grant that the disk pages are reasonably filled, b must be set to 2n/ML,
being ML the maximum nnmber of elements that fit into a cell of a data list. If
so, we may expect that in average the data cells are filled to about one half of
their capacity, which implies that the data elements are spread through about b
cells and, consequently, that the average length of the non empty lists is still
ej(e -1) c:= 1.58.
Indeed, if the required size of the dispersion table is so large that it can
not fit into a single disk page additional disk pages can be used. Nevertheless,
even for single disk page dispersion tables, the set sizes can be rather large. For
instance, if we take the same sizes of disk pages, key type and item type of the
example that illustrated how flat B+ trees are (see section 4.9), we had that
ML = 50. The disk page size considered (2KB), can keep a dispersion table of
up to 512 pointers of 32 bits, Mv = 509 (the first prime number innnediately
below 512), which gives an upper bound for the size ofthe represented sets (to
keep reasonable execution times) of U = MLMv = 25,450.

4.11.5

External hashing: extendible

Overview

Any of the variants of hashing considered hitherto, assumed that the expected
size of the sets to be represented is known beforehand :and that, therefore, the
size of the dispersion table can be established accordingly at compile time.
However, in practice, this size can seldom be predicted and the increase (or
decrease) in size of the dispersion table has to be done at run time. According to
the algorithms considered hitherto, a change in the size of the dispersion table
implies a complete reorganization of the structure, in which new hash positions
stored have to be computed for all the keys currently stored and new links have
to be established accordingly. The cost of those changes is clearly unaffordable
if it can be the result of any insertion or deletion, specially in the case of large
sets.
Extendible hashing is a technique that allows the dispersion table to grow
and shrink in a cost effective way. The technique may be adapted to primary

4.11. HASHING

135

storage too, although it has been developed for the implementation of indexes
in database systemsl and therefore having secondary storage in mind.
In this approach, the hash image of a key, ie. the result provided by the hash
function, is a natural number with a reasonably large number of bits (ie. 32
or 64), in which each bit has a uniform probability to be either 0 or 1. From
this image, only the least significant d bits will be taken into account and the
other ones will be discarded. The value of d will be adjusted at run time,
depending on the size of the dispersion table, which will grow and shrink at run
time too.
hash image

0100101101010010100001011000110101110110101101011010010111011101

1--d--1

These d least significant bits are used to index the dispersion table. Each
component of the dispersion table points to a disk page that contains the elements having keys with the same d least significant bits. It is important to
remark that they point to a single page, ie. not to a page list. For d = 2 the
structure looks as:

Indeed, the elements stored in the first data page have keys with a hash
image that ends with 00, those in the second page end with 01 and so on. Such
an organization allows a cost effective duplication in size of the dispersion table,
which can be achieved by simply copying the contents of the current table into
the new half part of the extended table:
000
001
010
011
100
101
110
111

k,x
k,x
k,x

k,x

k,x

k,x

k,x
kx

kx

k,x

k,x

k,x

So extended, the dispersion table still addresses to the right pages, because
either the new bit conSidered from the hash image of our key is 0 or 1 we get
the pointer to the page that contains the element with that key. The advantage
of having extended the dispersion table is that in the case that a new element
has to be added to a page that has been found completely full, we can split that
page into two and redistribute its elements according to the newly considered
bit. As the number of elements that fit into a page is expected to be somewhat
large (for the realistic sizes considered in the example of section 4.9 each page
could contain about 50 data elements) and the hash function is expected to
generate bits with equal probabilities, we can expect that there will be room
enough to allocate the new element after splitting the page into two.

136

CHAPTER 4. SETS AND MAPPINGS

In the example of the figure above, if a new element bas to be added into
the third page, which is full, we can allocate a new page, move there all the
elements with keys having their hash image ended with 110, keep in the current
page those that end with 010 and insert the new element to the page that
corresponds to its key:

000
001
010
011
100
101
110
111

k,x
k,x
k,x

k,x
k,x

k,x

k,x
k,x

k,x

k)X~

k,x

,x

Indeed, any attempt to add a new element to a completely full page requires
to split the page. However, not any page split requires to extend the dispersion
table. For instance, the addition of a new element into the first page, requires
splitting the page because it is full too. However, the dispersion table does not
require to be split again because the page still groups the keys on the basis of
the two last bits whereas the dispersion table keeps pointers on the basis of the
three last bits.
To distinguish among the two situations, a new field is associated to both
the data pages and the dispersion table, which is called the depth. The depth
of the dispersion table is called the global depth (d9 ) and is the number of bits
taken into account to index the table. Indeed, the size of the dispersion table
is b = 2d. The depth of a data page is called the local depth (d1) and is the
number of bits taken into account to group the elements in that page. In other
words, the keys of all the elements stored in a page with local depth d, have
their hash images with the last d, bits identical.
Given a data page with depth factor dl, the insertion of a new element in
that page does not require a split of the page unless the page is found completely
full. If so, a split and the new depth factor of both the newly allocated page
and the original page becomes df = d, + 1. If df > d 9 then the dispersion table
has to be extended and its new depth becomes d~ = d9 + 1.

If the number of elements that fit into a page is reasonably large and the
hash function is reasonably good, for all pages d, will differ at most by one with
respect to d9 . However, in general dl ~ d9 and the number of components of
the dispersion table that point to a page is 2dg -dz .

So, even though it may be unlikely, it is possible that very few keys have
had a hash image ending with 00 compared with the other possible values. If
so, we might reach to the situation that illustrates the figure below, where B 00
represents a data page that contains elements having keys with their hash image
ending by 00 (indeed, no other page may contain keys with their hash image
ending by 00). The numbers enclosed by square brackets represent the attached
depths.

4.11. HASHING

137

In this situation, if a new insertion forces Boo to be split, its local depth will
be increased by one. Then, the old page and the new one shall be pointed from
the dispersion table as:

To remove an element, first it must be looked for. So, the hash function has
to be applied to the key to provide the hash image aud then select the lowest
d9 bits from that image to obtain an index to the dispersion table, from which
we will get the pointer to the page containing the searched element, provided
that the element is in the set. If so, we shall remove it from that page and then
attempt to merge that page with another one to release disk pages as long as
possible.
Let us assume that d9 = 6 and that the lower d9 bits of the hash image of k,
the key to be deleted, are 010101. Using this value as an index to the dispersion
table we will get a pointer to the page containing the element with key k, if this
element is in the set. Let us assume that it does, and that the local depth of
this page is dz = 4. Let us indicate that page as B0101[4]. As dz < d9 , several
components of the dispersion table will point to that page. To be precise, these
components are:
00
01
10
11

0101
0101
0101
0101

If possible, the page B0101[4] should be merged with its split image, ie. the
page B 1101 [4] to make the page B101[3]. The merging will be possible only if
B 11 01 [4] does exist. It does not if it has been further split into BonDI [5] and
B 11101 [5]. If so, the merging with BDIDI[4] is not possible until the latter two

138

CHAPTER 4. SETS AND MAPPINGS

pages are merged in tum. In addition to the fact that the split image exists,
the second requirement for merging is that the elements in both images fit in a
single page.
If Bmm [4] has a split image Bu01 [4], its split image will be pointed from the
following positions in the dispersion table:

00
01
10
11

1101
1101
1101
1101

So, we may select 00 1101 and get the page pointed from that position in the
dispersion table. If the local depth of that page is also 4, the split image does
exist. If the number of elements in B 0101 [4] and Buo1[4] fit in a single page,
then, the elements of Bum [4] must be copied into B0101 [4] to make B1o1 [3], and
the disk page that contained Bu01 [4] can be released.
Then, all the positions in the dispersion table that pointed to Bum [4] must
be updated to point to B10 1 [3]. Indeed, after such merging, it must be checked
if B101 [3] may still be merged with its split image B001 [3] and so on.
On completion of merging, it must be checked if the second half of the
dispersion table is equal to the first half. If so, d9 may be reduced by one,
(ie. the dispersion table is reduced by one half). Then it must be checked if the
new second half of the table is equal to the first one and so on.
Indeed, it may happen that the dispersion table grows so much that a single
disk page is insufficient to keep it. For instance, if the page size is 2KB and
the pointer size is 32 bits, a disk page can contain at most 512 pointers (which
are addressed by 9 bit indexes). If d9 becomes greater than this amount, the
dispersion table must be organized as a tree. Then the d9 highest bits of the
hash images are used to select a page and the lower bits are used to select a
component in that page. For instance, if d9 = 11 and the lowest 11 bits of
the hash image of key k are 01101111010, those 11 bits are subdivided into the
highest two, to select a page and the lowest 9 to select a component in that
page:

[11]
00 rOll-+--~
10 f-.=
11\-L
101111010t+-- BllOllllOlO[lO]

To sum up, extendible hashing is a very efficient way to store data ip secondary storage. When the dispersion table fits in a single disk page, any data
access requires exactly two disk accesses, one for the dispersion table and one
for the data page. For very large sets, the retrieval of the dispersion table can
take a very few additional disk accesses.

4.11. HASHING

139

The type declaration


According to these general principles, the representation of the type becomes as

shown in the package below. The generic parameters have two slight differences
with respect to the previous approaches. Namely, no size is provided and the
hash function has no bounding parameter.
generic
type key is private;
type item is private;
with function ''="(kl, k2: in key) return boolean;
with function hash(k: in key) return natural;
package d..set is
--Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;

type nodetype is (disp.node, data.node);


bdi_disp: constant natural:= 10; -bin. digits to index table in disp page
max..disp: constant natural:= 2bdi..disp; - length of table in disp page
max..data: constant natural:= 52;
-- length of table in data page
type elem is
record

k: key;
x: item;
end record;

type Ldisp is array(natural range O..!IlaJ(_disp-1) of pnode;


type t_elems is array(natural range L=_data) of elem;
type node(t: nodetype) is
record
case tis
when disp.node =>
td: t_disp;
when data.node =>
dl: natural;
-- local depth
ne: natural;
te: t_elerns;
end case;
end record;

-- number of elements
-- table of elements

type set is
record
root: pnode;
dg: natural; -- global depth
end record;
end d...set;

140

CHAPTER 4. SETS AND MAPPINGS

As in the case of B+ trees, the lengths of the arrays should be derived from
the sizes of the involved types (pointers, keys, items) and the size of the disk
physical pages. To avoid that these technicalities may distract the reader from
the essential points, we have directly set some realistic values, assuming a disk
page size of 4KB, a pointer size of 4B, a key size of SB and data items of 70B.
As we have done with B+ trees too, to keep a unified notation with the rest
of the book, the algorithms below describe the management of the nodes as if
they were allocated in dynamic memory. In any practical implementation, the
reader should replace any operation of the type p:= new node in the algorithms
below by a request to the operating system for a new disk page.
Empty
The empty set is represented by a null pointer:
procedure empty (s: out set) is
begin
s:= (null, 0);
end empty;
Put (single-page dispersion table)
We shall assume initially that the dispersion table is allocated in a single disk
page. Once this simple version is understood, we will describe how to deal with
a dispersion table expanding through several pages.
The operation put is divided in two levels. The top level deals with the
exceptional cases that the set is either completely empty or it consists on a
single data page. The procedure normaLput deals with the usual case that the
root node is a dispersion table.
procedure put(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p: pnode;
begin
if root=null then
root:= new node(data.node);
root.ne:= 0; root.dl:= 0;
insert_data(root, k, x);
elsif root.t=disp..node then
normaLput(s, k, x);
elsif isjn(root, k) then -- (root.t=data..node)
raise already_exists;
elsif root.ne<max_data then
insert_data(root, k, x);
else
-- create the dispersion table
p:= new node(disp..node);
p.td(O):= root; root:= p; dg:= 0;
normaLput(s, k, x);
end if;
end put;

4.11. HASHING

141

If the elements of the set can be placed in a single page, there is no need to
build a dispersion table on top. When a single page is insufficient, then a new
page is requested to allocate the dispersion table. That table is initially created
with a single entry, the current data page, and the new element causing its
creation is then inserted according to the algorithm for the normal case, which
is as follows:
procedure normaLput(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
dg: natural renames s.dg; - global depth
p, ps:
pnode;
-- data page and its split image
hi:
natural; -- hash image of k
hs:
natural; - hash bits of higher sibling's image after split
completed: boolean;
begin
hi:= hash(k); completed:= false;
while not completed loop
p:= get_data_page(root, dg, hi);
if is_in(p, k) then raise already_exists; end if;
if p.ne<max_data then
insert_data(p, k, x);
completed:= true;
else
ps:= new node(dataJJode); split(p, ps, hs);
if p.dl>dg then extend(root, dg); end if;
inserLdisp(root, ps, dg, hs);
end. if;
end loop;
end normaLput;

Initially, the hash image of the key is computed, and then the pointer to
the data page where the new element is to he inserted is obtained from the
dispersion table in the root node according to the lowest d9 bits of the hash
image. If there is room enough in that page to allocate the new element, it is
placed there and that completes the operation. If the selected page is full, that
page must be split, ie. the elements it contains that page have to be distributed
into two pages (the original p and its sibling ps), according to an additional
bit of the hash images of their keys (hitherto they were allocated according to
the d1 lower hits of their hash images, and now they will he allocated according
to the d, + 1 lower bits). Indeed, the local depth of the new two pages is one
more than the local depth of the original page. If the latter value is greater
than the general depth, then the the general depth must be increased by one
too, and the dispersion table has to be extended twice its size by copying the
pointers currently in the table into the expanded area. Either if the table has
been expanded or not, the pointer to the created sibling page ps has to be
allocated into the dispersion table, at the position corresponding to the lower
d, bits of the keys currently allocated in it, which is provided by hs.
After splitting the page, we may expect that there will he room enough to
allocate the element to be inserted. But it may happen that the split leaves all
the elements in one of the pages and that the hash image of the element to be

CHAPTER 4. SETS AND MAPPINGS

142

inserted correspon~s to the full page. Therefore, the insertion of the new element
must be deferred to the next step of the iteration, because it might be required
to split the full page again. However, in practice, this is extremely unlikely. If
the hash function is good, the probability that a page containing n elements is
split into an empty page and a full page is (1/2t-l and the probability that the
new element can not be allocated after the first split is (1/2(. For the realistic
sizes of our type declaration, this probability is 2.22. w-lH
The operation that selects from the dispersion table the pointer to the appropriate data page according to the hash image of a given key is

function get_data_page
(root: in pnode; dg: in natural; hi: in natural) return pnode is
hd: natural;
p: pnode;
begin
hd:= hi mod(2dg); p:= root.td(hd);
return Pi

end get_data_page;
The algorithm for splitting a page into two is as follows. The new local depth
of the two pages is one more than the old one because now the elements will be
distributed according to one more bit. The number of elements of the sibling
page is initially zero. Then, the hash images of the current keys are computed
again and, if the most significant bit (among those that correspond to the new
local depth) is one, the element is moved .from p tops. Finally, the lower dz bits
that are common to the hash images of keys that are currently in ps is set into
hs and returned as an out parameter.
procedure split
(p: in pnode; ps: out pnode; hs: out natural) is
- hash image of inspected keys
hi: natural;
hd: natural;
- lower p.dl bits of hi
- divisors to select lower bits of hi (s2=s/2)
s, s2: natural;
i:
natural;
- index to elements of p
begin
p.dl:= p.dl+ 1; ps.dl:= p.dl; ps.ne:= 0;
s:= 2*p.dl; s2:= s/2;
i:= 0;
while i<p.ne loop

i:= i+l;
hi:= hash(p.te(i).k); hd:= hi mod s;
if hd>=s2 then
ps.ne:= ps.ne+1; ps.te(ps.ne):= p.te(i);
p.te(i):= p.te(p.ne); p.ne:= p.ne-1;
i:= i-1;
end if;
end loop;
hs:= hd; if hs<s2 then hs:= hs+s2; end if;
end split;

The operation that extends the dispersion table is:

4.11. HASHING

143

procedure extend(root: in pnode; dg: in out natural) is


td: t_disp renames root.td;
s: natural; -- size of td before extending it
begin
s:= 2**dg;
fori in o.. s-l!oop td(s+i):= td(i); end loop;
dg:= dg+l;
end extend;
Finally, the procedure inserLdisp inserts, into a dispersion table having
global depth dg, the pointer p addressing to a data page having local depth
p.dl and containing keys whose hash images have their lower p.dl bits equal
to hs. The number of positions where that pointer has to be assigned is 2d.-d,.
The first of those positions is at hs (ie. hs extended by the left with d9 - d,
zeroes). The next positions are those that correspond to hs extended by the
left with any possible combination of d9 - d, bits, which are at distances equal
to 2d1
procedure insert_disp
(root, p: in pnode; dg: in natural; hs: in natural) is
s:
natural; - length of step
hps: natural; - positions to assign p
nr: natural; - number of pointer assignements
begin
s:= 2**P-dl; hps:= hs mods; nr:= 2**(dg-p.dl);
for i in l..nr loop
root.td(hps):= p; hps:= hps+s;
end loop;
end insert_disp;

Get and update (single-page dispersion table)


The procedure, get deals with the special cases that no dispersion table exists
whereas the procedure normaLget deals with the case in which it does:
procedure get(s: in set; k: in key; x: out item) is
root: pnode renames s.root;
dg: natural renames s.dg; - global depth
found: boolean;
1:
natural; -- position of the element with key k in root, if found
begin
if root=null then
raise does....not_exist;
elsif root.t=data_node then
search(root, k, found, i);
if not found then raise does....not_exist; end if j
geLdata(root, i, x);
else-- root.t=disp_node
norrnal.get(root, dg, k, x);
end if;
end get;

. 144

CHAPTER 4. SETS AND MAPPINGS

procedure normaLget(root: pnode; dg: in natural; k: in key; x: out item) is


p:
pnode;
-- data page
hi:
natural;
-- hash image of k
found: boolean;
1:
natural;
-- position in p of the element with key k, if found
begin
hi:= hash(k);
p:= get_data_page(root, dg, hi);
search(p, k, found, i);
if not found then raise does_not_exist; end if;
get_data(p, i, x);
end normal...get;

The procedure update is exactly the same as get, but replacing get...data by
update._data.
Remove (single page dispersion table)
The procedure remove deals with the special cases that no dispersion table exists
whereas the procedure normaLremove deals with the case in which it does. The
former is rather simple:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -, global depth
found: boolean;
1:
natural; -- position of the element with key k in root, if found
begin
if root=null then
raise doesJloLexist;
elsif root.t=data_node then
search(root, k, found, i);
if not found then raise does..not_exist; end if;
remove-data(root, i);
if root.ne=O then root:= null; end if;
else-- root.t=disp_node
norrnalremove(s, k);
if dg=O then root:= root.td(O); end if;
end if;
end remove;
Instead, normaLremove requires a more detailed explanation. The first part
is straightforward: it looks for the element to remove, in a way identical to the
procedure normaLget, and deletes it from its data page, provided that it has
been found.
If the element has been found and actually removed, the second part of
normaLremove attempts to merge the updated page with its split image, if
possible. This possibility depends from two facts: first, such a split image does
exist (ie. the original sibling has not been further split), and second, there is
room enough in a single page to place the elements currently stored in the two
pages. Such a merging must proceed as long as possible. As a result of previous

4.11. HASHING

145

remove operations, it may happen that some pages have become even emPty
but there ha.'3 not been the chance to merge them because their split image
did not exist. Consequently, it may happen that as a result of merging two
pages, the new page has a split image that is almost empty, or even completely
empty, and now there is the chance to merge it. Whenever a merge operation
is completed, the pointer to the merged page must be placed at the required
positions in the dispersion table, which is achieved by the procedure insert_disp
described above.
If some pages have been merged, the third step of normaL remove checks if
the dispersion table can be reduced to one half. This is possible if the contents
of the second half is equal to the contents of the first half.
According to these considerations, the algorithm for normaLremove is:
procedure normaLremove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p, ps:
pnode; -- data page and its split image
hi:
natural; -- hash image of k
adjusted: boolean;
found:
boolean;
i:
natural;
begin
hi:= hash(k);
p:= get_data_page(root, dg, hi);
search(p, k, found, i);
if not found then raise does_not_exist; end if;
remove_data(p, i);

adjusted:= false;
ps:= split_image(root, p, hi);
while p.dl>=1 and ps.dl=p.dl and ps.ne+p.ne<=mruc_data loop
merge(p, ps);
insert_disp(root, p, dg, hi);
adjusted:= true;
ps:= split_image(root, p, hi);
end loop;
if adjusted then
while dg>O and then equaLhalves(root, dg) loop
dg:= dg-1;
end loop;
end if;
end normaL.remove;
The function spliLimage looks in the dispersion table for a pointer to the
split image of a given page p. However, the pointer provided is actually the split
image of p only if its local depth is equal to that of p, otherwise it means that
it has been further split.
The page for which we look its split image is either the page from which we
have deleted an element or the result of a previous merge operation. In both
cases, we know the lower bits of the hash images of the keys that it contains. If

146

CHAPTER 4. SETS AND MAPPINGS

hi is the hash image of the deleted element, it means that the page corresponds
to all keys having the lower dl bits of their hash images equal to those of hi.
Therefore, the split image is the page associated to keys having their hash images
with the lower dl-1 bits equal to those in hi but the dl bit opposed to that
of hi. ff that page has the same local depth and there is room enough to allocate
the elements currently in both pages into a single one, the new page will have
local depth one less, as the common lower bits of the hash images of the keys
contained in the page that results from the merge operation are one less than
in the merged pages.
Consequently, the resulting algorithm is:
function split-image(root, p: in pnode; hi: in natural) return pnode is
td: t_disp renames root.td;
s: natural;
hd: natural;
begin
s:= 2p.dl; hd:= hi mod s; s:= s/2;
if hd>=s then hd:= hd-s; else hd:= hd+s; end if;
return td(hd);
end splitJ.rnage;
The operation merge just adds to p the elements in ps. The counter in ps is
not adjusted because that page will become immediately released.
procedure merge(p, ps: in pnode) is
i: natural;
begin
i:= ps.ne;
while i>O loop
p.ne:=p.ne+1; p.te(p.ne):= ps.te(i);
i:= i-1;
end loop;
p.dl:= p.dl-1;
end merge;
The function that checks if the two halves of the dispersion table are equal
is straightforward too:

function equal..halves(root: in pnode; dg: in natural) return boolean is


td: t_disp renames root. td;
s: natural; - half the current size of the dispersion table
i: natural;
begin
s:= 2(dg-1); i:= 0;
while i<s and then td(i)=td(i+s) loop i:= i+1; end loop;
return i=s;
end equa.Lhalves;
The management of data pages
The operations to insert, remove, search and update elements in data pages are
so simple that are self explaining. The use of pragma inline would be highly
recommended for them.

4.11. HASHING

147

procedure inserLdata(p: in pnode; k: in key; x: in item) is


ne: natural renames p.ne;
te: t_elems renames p. te;
begin
ne:= ne+1; te(ne):= (k, x);
end inserLdata;
procedure search
(p: in pnode; k: in key; found: out boolean; i: out natural) is
ne: natural renames p.ne;
te: t_elems renames p.te;
begin
i:= 0; found:= false;
while i<ne and not found loop
i:= i+l;
if te(i).k = k then found:= true; end if;
end loop;
end search;
procedure get_data(p: in pnode; i: in natural; x: out item) is
ne: natural renames p.ne;
te: t_elems renames p.te;
begin
x:= te(i).x;
end get_data;
procedure remove_data(p: in pnode; i: in natural) is
ne: natural renames p.ne;
te: t_elems renames p.te;
begin
te(i):= te(ne); ne:= ne-1;
end remove_data;

The subprograms is_ in and update._data are left as exercises to the reader.
Unbounded dispersion table
For large sets, the size of a disk data page may become insufficient to keep the
dispersion table. Therefore, as described in the overview, the dispersion table
is organized as a hierarchy of data pages. Then the higher order bits of a hash
image serve to index a table that contains pointers to another page that contains
another table that is indexed by the lower bits of the hash image. Indeed, if
required, such a hierarchy can expand to more than two levels.
This improvement requires no change in the type declaration and the procedure empty is also the same.
Put (unbounded dispersion table)
The procedures put and normaLput are exactly the same as above. Only the
subprograms geLdata._page, extend and insert_disp must be reconsidered.
Regarding geLdata._page, now the lower d 9 bits of the hash image have to
be decomposed in bit groups, each group having as many bits as required to

148

CHAPTER 4. SETS AND MAPPINGS

index the table that fits into a disk page. In our type declaration, this amount
is indicated by the constant bdi..disp (for binary digits to index a table in a
dispersion page). Indeed, the highest order of these groups may have less than
bdi_disp bits and is used to index the root table to get a pointer to a dispersion
page that will be indexed by the second bit group and so on. The next figure
illustrates such a decomposition of a hash image:

1 - - - - dg---..l
hi

L__L~~~==~~==~

bit group to index the-J

bit groups to index tables in

table in the root page

secondary dispersion pages

Initially, the number of steps to follow is computed. This number is equal


to the number of groups to index tables in secondary pages. This number
is computed as ( dg-1) I bdi_disp because in the case that dg is a multiple of
bdi_disp the natural division dg I bdi_disp would account the bits for the root as
bits for secondary pages, which would lead to one step too many. Indeed; the
special case dg=O (ie. the dispersion table contains only one pointer) is dealt
with separately. This situation occurs only in a transitory intermediate state of
the insertion of an element causing the dispersion table to be created.
After these considerations, the rest of the algorithm is a trivial decomposition
of the hash image into the appropriate bit groups that serve to select dispersion
pages until the pointer to the proper data page is obtained.
function get_data_page
(root: in pnode; dg: in natural; hi: in natural) return pnode is
p: pnode;
dv, d, r, s, st: natural;
begin
if dg=O then st:= 0; else st:= (dg-1)lbdLdisp; end if;-- number of steps
s:= 2**dg; r:= hi mod s; dv:= max_disp**st; p:= root;
for i in l..st loop
d:= rldv; r:= r mod dv; dv:= dvlmax_disp;
p:= p.td(d);
end loop;
p:= p.td(r);
return p;

end geLdata_page;
In the simple version, the procedure extend, just had to increase the size
of the dispersion table twice, by copying the pointers currently in the table to
the new half. The same principle applies when the dispersion table is organized
in several pages. Indeed, if the root is completely filled, the increase of the
dispersion table forces to expand it to a sibling dispersion page and to build
a new root on top of both. Either if a new root has been created or not, the
second half of the new table has to be filled by a copy of the first half. If this
contents are pointers to data pages, those pointers have to be copied. However,
if they are pointers to other dispersion pages, new dispersion pages have to
be created, with a copy of their siblings. The procedure copy is in charge to
do it recursively. To distinguish the two cases, the procedure copy receives a

4.1l. HASHING

149

parameter st (for steps) indicating its level in the tree of pages that implement
the dispersion table. The resulting algorithm is:
procedure extend(root: in out pnode; dg: in out natural) is
p:
pnode;
s, sp: natural; - current sizes of td and root page's size
st:
natural; -- height of dispersion tree
begin
st:= dg/bdi_disp; -- number of steps
if dg>O and dg mod bdi_disp = 0 then -- disp. tree gets one level higher
p:= new node(disp..node); p.td(O):= root; root:= p;
end if;
s:= 2dg; sp:= s/(mrucdispst);
for i in O.. sp-1 loop
copy(root.td(i), st, p);
root.td(sp+i):= p;
end loop;
dg:= dg+1;
end extend;

where
procedure copy(p: in pnode; st: in natural; cp: out pnode) is
q: pnode;
begin
if st=O then -- the pointed node is a data node
cp:= p;
else
cp:= new node(disp..node);
for i in O.. max_disp--1 loop
copy(p.td(i), st-1, q);
cp.td(i):= q;
end loop;
end if;

end copy;
The logical positions to insert a pointer to a data page in the dispersion
table are the same as for the simple case. However the conversion of a logical
position into a physical one is more complex and it is achieved by the procedure
inserLdisp_single.

procedure insert_disp(root, ps: in pnode; dg: in natural; hs: in natural) is


s:
natural; -- length of step
hps: natural; -- positions to assign p
nr: natural; -- number of pointer assignments
begin
s:= 2ps.dl; hps:= hs mod s; nr:= 2(dg-ps.dl);
for i in l..nr loop
inserLdisp_single(root, ps, dg, hps);
hps:= hps+s;
end loop;
end insert_disp;

150

CHAPTER 4. SETS AND MAPPINGS

The algorithm to convert the logical position indicated by hps into the proper
component of the proper dispersion page is achieved by the same algorithm that
uses get_data_page discussed above. Indeed, in this case the pointer is set in the
table instead of retrieved from it:
procedure insert_disp...single

(root, ps: in pnode; dg: in natural; hps: in natural) is


p: pnode;
dv, d, r, s, st: natural;
begin
s:= 2dg;
if dg=O then st:= 0; else st:= (dg-l)fbdi_disp; end if;-- number of steps
r:= hps mod s; dv:= max_disp**Btj p:= root;

for i in l..st loop


d:= r/dv; r:= r mod dv; dv:= dv/rnax..disp;
p:= p.td(d);
end loop;
p.td(r):= ps;
end inserLdisp....single;

Get and update (unbounded dispersion table)


These procedures are exactly the same as in the bounded case, but for the
fact that normaLget and normaLupdate invoke the new version of the function
geLdata_page.
Remove (unbounded dispersion table)
The differences of the algorithm for deleting an element with respect to the
bounded case are very few. The only difference in the procedure remove itself

is that the statement that eliminates the dispersion table after the call to normaLremove (if on return d9 = 0) is no longer required because such a shrink in
height of the dispersion tree is now part of the general case:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
found: boolean;
i:
natural;
begin
if root=null then
raise does....not_exist;
elsif root.t=data..node then
search(root, k, found, i);
if not found then raise does..noLexist; end if;
remove_data(root, i);
if root.ne=O then root:= null; end if;
else-- root.t=disp_node
normaLremove(s, k);
end if;
end remove;

4.11. HASHING

151

Indeed, the operation that shrinks the height of the page tree that implements the dispersion table is now included in the procedure normaLremove.
The decrease in height occurs when the decrease of the global depth leaves the
size of the table at the root node equal to one. In other words, whenever d9
mod bdi_disp = 1 after reducing the table by one half of its size:
procedure normaLremove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p, ps:
pnode;
-- data page and its split image
natural;
- hash image and its lowest dg bits
hi:
adjusted: boolean;
found:
boolean;
i:
natural;
begin
hi:= hash(k);
p:= get_data...page(root, dg, hi);
search(p, k, found, i);
if not found then raise does_not_exist; end if;
remove_data(p, i);

adjusted:= false;
ps:= splitJmage(root, p, dg, hi);
while p.dl>=1 and ps.dl=p.dl and ps.ne+p.ne<=mrocdata loop
merge(p, ps);
insert_disp(root, p, dg, hi);
adjusted:= true;
ps:= split..image(root, p, dg, hi);
end loop;
if adjusted then
while dg>O and then equa.Lhalves(root, dg) loop
if dg mod bdi_disp =1 then root:= root.td(O); end if;
dg:= dg-1;
end loop;
end if;
end normaLremove;

Besides of these little differences, and the fact that the invocations to the
subprograms geLdata_page and insert...disp correspond to the new versions di&cussed for the unbounded version of put, the subprograms to get the image of a
page and to check if the two halves of the dispersion table are equal have some
differences too.
The procedure to get the split image of a page is essentially the same as for
the bounded case. However, in the unbounded case, the bits that correspond
to the split image ( hd) can not be used directly as an index to the dispersion
table. Instead, these bits indicate a logical position and its conversion into a
proper position of a proper dispersion page has to be made through the function
geL.data_page. As this latter function requires to get the global depth as an input
parameter, the function to get the split image requires this input parameter as
well. Consequently, the algorithm that results for the function spliLimage is

152

CHAPTER 4. SETS AND MAPPINGS

function splitjmage
(root, p: in pnode; dg: in natural; hi: in natural) return pnode is
td: t_disp renames root.td;
s: natural;
hd: natural;
ps: pnode;
begin
s:= 2**p.dl; hd:= hi mod s; s:= s/2;
if hd>=s then hd:= hd-s; else hd:= hd+s; end if;
ps:= geLdata...page(root, dg, hd);
return ps;
end splitjmage;
To check if the two halves of the dispersion table are equal, the two halves
of the table in the root page must be compared. Two components that point to
a data page are equal if they point to the same page, but two components that
point to a dispersion page are equal if all the components of the pointed pages
are themselves equal. The function equal is in charge to check it recursively.
It receives a parameter indicating the level in the dispersion tree to distinguish
between the two cases, like the procedure copy discussed above. Consequently:
function equal..halves(root: in pnode; dg: in natural) return boolean is
td: t_disp renames root.td;
dr: natural; -- binary digits to index the root's dispersion table
st: natural;. - number of steps
s: natural; - half the current size of the root's dispersion table
i: natural;
begin
st:= dg/bdi_disp; dr:= dg mod bdi_disp;
if dr = 0 then st:= st-1; dr:= bdLdisp; end if;
s:= 2(dr-l); i:= 0;
while i<s and then equal(td(i), td(i+s), st) loop i:= i+l; end loop;
return i=s;
end equal.halves;
where

function equal(p, ps: in pnode; st: in natural) return boolean is


eq: boolean;
1: natural;
begin
if st=O then
eq:= (p=ps);
else
i:= 0;
while i<max_disp and then equal(p.td(i), ps.td(i), st-1) loop
i:= i+l;
end loop;
eq:= (i=max_disp );
end if;
return eq;
end equal;

4.11. HASHING

153

Cost analysis of extendible hashing

For the bounded case, it is obvious that the operation empty is 0(1) and that
the operation to get an element requires only two disk accesses: one to get the
distribution table and another to get the data. The operation to update requires
an additional access to write the page back to the disk. Finally, the operations
to put and remove require at most five disk accesses because the involved pages
are at most three (the dispersion table and the two split images) from which
two must be read first aud later written back. So we may consider that any of
these operations is 0(1).
For an unbounded dispersion table, the pages involved in a retrieval operation are the data page plus the pages in the path from the root to that data
page. The height of the dispersion tree grows with the logarithm of the number
of data pages, being the base of this logarithm the number of pointers that fit
into a disk page. However, this number is so large that makes the tree flat
enough to consider its height to be bounded by a constant in practice.
For instance, for the same sizes considered for the example of B+ trees
(100,000 elements to be stored, disk pages of 4KB, pointers of 4B and a maximum of 52 data elements per disk page), the dispersion table would consist
on a root node containing 4 pointers and a single level of secondary dispersion
pages. Therefore, a retrieval operation would involve three disk pages. This
height would be increased by one only if the number of elements to be stored is
greater than 26 106 .

4.11.6

Hash functions

The concept of a good hash function

The impressive efficiency of hashing techniques for implementing sets critically


depends on the quality of the hash function, ie. its ability to distribute the keys
uniformly to the image range. If keys are drawn independently from a set of
possible values U and P(k) is the probability that key k is selected, the condition
of uniform distribution can be expressed as
\fi:O(i<b:

P(k) =

koh(k)~i

However, this statement is somewhat metaphysical because the probabilities


P(k) are usually unknown. Actually, for any hash function h, even the more
sophisticated, there is always the possibility to maliciously select a set T of at
most IUI/b keys that satisfy that \fk 1 , k2 E T: h(k1 ) = h(k2 ).
Even though this fact might be discouraging, in practice there are some
techniques that have proven to perform well after having been applied to large
amounts of realistic data. This section discusses first two techniques that might
appear to be satisfactory but they are not, to show their subtle pitfalls. Then
a better algorithm is provided.

Simple additive hashing

The example of a hash function that was used to introduce the general principles of hashing techniques in section 4.11.1 consisted in adding the bytes that
represent a type and then apply the modulus to get a result in the proper range:

154

CHAPTER 4. SETS AND MAPPINGS

function hash(k: in string) return natural is


s, p: natural;
begin
s:= 0;
for i in s 'range loop
p:= character'pos(k(i));
s:= (s+p) mod b;
end loop;
returns;
end hash;
This function has a drawback. Let us assume that this hash function is used
in the implementation of the names table of a compiler, which converts the
names found into consecutive naturals (see exercise 2 is section 7.3. The names
that programmers generate use to have some similarities if their meanings are
related. So, let us suppose that a program that computes the intensities in
different sections of a circuit uses the identifiers IO, Ii, ... , !99. It is clear that
the function above will hash the keys I19, !28, !87, ... , !91 to the same slot
because the sums of the character codes is the same in each string. Actually,
these 100 identifiers will be mapped to only 27 different hash images and in
certain cases up to 9 different keys hash to the same slot. It is clear that such
a distribution is far from being uniform.
Weighted additive hashing
A way to avoid this drawback is to assign a different weight to each byte to be
added. The most natural weight consists on considering the full binary representation of the key as a natural number, and so h(k) = k mod b. According
to this principle, the new hash function would become:
function hash(s: in string; b: in positive) return natural is
m: constant natural:= character'pos(character'last)+l;
p, h: natural;
begin
h:=O;
for i in s'range loop
p:= character'pos(s(i));
h:= (hm+p) mod b;
end loop;
return h;
end hash;

If the standard 8-bit based character set is used, the value of the constant m
Ho~ever, if an extended character set is used instead, this base
could be larger than 256.
This approach is, in general, better than simple addition. Nevertheless, if b
is a power of 2 it turns much worse because then the hash image depends only
from the lowest binary digits of the original key. In particular, if b = 256, all
the names ending with the same letter wonld hash to the same slot.
Therefore, if such an approach is used, it is most recommendable that b is a
prime number, so that the modulus forces an effective division, and the result

will be 256.

4.11. HASIDNG

155

depends from all the bits in the key. In any case, values of b that are a power
of two should be completely discarded.
Quadratic hashing
A better approach to achieve that the hash image of a key depends from all the
bits representing the key is to square it and take the middle digits. In other
2
words: h(k) =
/cJ mod b, being can integer sucb that bc2 oe k 2 .
The implementation of the square is the same algorithm that scholars use
to multiply, but using base 256 instead of 10. The fignre below illustrates the
distribution of the digits. The underbraced digits indicate the bits selected to
apply the modulus b.

lk

r-----Jkl----+-j

====
x==r--,
==
==
==
==
==
k2========
k
k

~r

JcJ

:I

The algorithm below implements sucb a product and selection of digits:


function hash(s: in string; b: in positive) return natural is
n: constant natural:= s'last;
m: constant natural:= cbaracter'pos(character'last)+1;
c: natural;
k, l: integer;
a: array(Ln) of natural;
r: array(L2n) of natural;
begin
fori in s'range loop a(i):= character'pos(s(i)); end loop;
for i in L2m loop r(i) := 0; end loop;
k:= 2n+1;
for i in reverse l..n loop
k:= k-1; l:= k; c:= 0;
for j in reverse l..n loop
c:= c + r(l) + a(i)a(j);
r(l):= c mod m; c:= c/m;

1:= 1-1;
end loop;
r(l):= c;
end loop;
c:= (r(n)m + r(n+l)) mod b;
return c;
end hash;

156

CHAPTER 4. SETS AND MAPPINGS

Multiplicative hashing
Instead of multiplying the key by itself, there is the possibility to multiply it
by a fixed weighting factor. In [14] it is suggested to use as weighting factor
as many significant digits as the key has from (y'5 -1)/2 = 0.6180339887 ...
because those digits have the same properties than random generated numbers.
Hash facilities in the Ada language
Hash techniques are so important that the predefined language environment
supplies a hash function. The type Hash_Type, to represent hash images, is
declared in the package Ada. Containers as a modular type (ie. an unsigned
integer with modular arithmetic) with a range that is implementation defined,
usually depending from the facilities provided by the underlying hardware, like
for integer and floating point arithmetic.
The function hash is declared as
with Ada. Containers;
function Ada.Strings.Hash(Key: String) return Containers.Hash_Type;

The result can be constrained to the proper range by applying the convenient
modulus to the provided resnlt.

4.11.7

Constraints on keys

In the case that the keys that are not strings of characters, it is usual to take
the set of bytes of their binary representation as strings. For fixed length
types, this goal can be achieved using the appropriate instance of the procedure Ada. unchecked_conversion, which is declared as:
generic
type Source(<>) is limited private;
type Target(<>) is limited private;
function Ada.Unchecked_Conversion(S: Source) return Target;

where the symbols ( < >) indicate that unconstrained types are allowed as actual
parameters for the instantiation.
For Source, we shall use the key type to which a hash function has to be
applied, and for Target we shall use a string type having the same size in bytes
than our key type. The instance of this generic function will then return a string
having the same bytes than the key provided as parameter. Such a string can
be passed as a parameter to any of the hash functions described above.
However, we can apply hashing only to types having a unique binary representation. Let us assume, for example that we have a set S whose elements
are sets in tum, ie. x E S means that x c A for some A. If the subsets of
A are represented by an array of boolean components, the hash technique can
be applied to implement S because two arrays with different component values
represent different subsets of A. Instead, if the subsets of A are represented by
arrays of elements, the subset {a, b, c} can be represented in several ways, for
instance the two in the figure below:

4.12. ITERATORS

157
3

Ia Ibl cl

3
j

I clal bl

It is clear that a hash function will map these two representations of the
same set to different slots, making the technique unusable. Even if the sets
are ordered, there still exists the difficulty of the non significative trailing components. In these cases, specifically tailored algorithms have to be applied to
discard the non relevant components.

4.12

Iterators

4.12.1

Introduction

In addition to the operations studied so far, some applications require to niake


traversals over the elements currently contained in the set. In other words,
some algorithms that use sets should be of the form that conceptually would be
described as

for x E A loop
end loop;

An iterator is a data type that allows such a retrieval, keeping the independence of the algorithm that uses the sets with respect to the technique used to
implement them. Such a type works in a way similar to file types, ie. they act
as a kind of index to a larger structure and have operations to point to the first
element in the set, to point to the next element, to check if no more elements
exist and to get the pointed element.
If the traversal of the elements in the set is required, the type set must be
extended with the following declarations:
generic

; - As it corresponds to each kind of implementation


package d.Bet is

; -Exactly as in section 4.2 (mappings).


type iterator is private;
bad_use: exception;
procedure first (s: in set; it:
out iterator);
procedure next (s: in set; it: in out iterator);
function is_valid(it: in iterator) return boolean;
procedure get(s: in set; it: in iterator; k: out key; x: out item);
private
- As it corresponds to each kind of implementation,
- plus the corresponding complete declaration for iterator
end d_set;

CHAPTER 4. SETS AND MAPPINGS

158

If the parameter it refers to the last element of the set, on return from next. it
has a non valid value. Likewise, if the set is empty, the procedure first returns a
non valid value of iterator too. The procedures next and get raise the exception
bad_use if the actual parameter for it has a non valid value.
With these operations available, the traversal of the set becomes
procedure traversal(s: in set) is
k: key; x: item;
it: iterator;
begin
first(s, it);
while is_valid(it) loop
get(s, it, k, x);
next(s, it);
end loop;
end traversal;
and the search for an element in the set satisfying some property P is
procedure search(s: in set) is
k: key; x: item; found: boolean;
it: iterator;
begin
it:= first(s); found:= false
while is_valid(it) and not found loop
get(s, it, k, x);
if P(k, x) then found:= true; else next(s, it); end if;
end loop;
if found then success(k, x); else failure; end if;
end search;
For keys having a total order defined, the retrieval of the keys should be in
ascending order inasmuch- as possible. As we shall see, most of the implementations make such an ordered retrieval possible. Ordered retrieval has several
kinds of advantages. In particular, the implementation of the operations of
union, intersection and difference, discussed in section 4.13 can be more efficient.
If pure sets are required, instead of mappings, the declarations are exactly
the same, but for the fact that the procedure get does not have the out parameter
of type item and the details are left as an exercise for the reader.
The rest of the section addresses the way the iterators have to be implemented according to the way the sets are.

4.12.2

Arrays indexed by keys

For set implementations based on keys that belong to a discrete type that is used
to index an array (see sections 4.3.1 and 4.3.2), the iterators can be implemented
straightforwardly by the key itself. Ifthat key is provided as a generic parameter,
it is not possible to extend it with an extra value to identify a non valid operator.
Therefore, under such a restriction a boolean field has to be added to indicate
the operator's validity. Consequently, the type declaration should be:

4.12. ITERATORS

159

type iterator is
record
k:
key;
valid: boolean;
end record;

The operation first and next have to provide the lowest and next keys
recorded respectively, if any. The function is_valid just has to check the corresponding field in the parameter it. Retrieving the data pointed by the iterator
is immediate too. These simple algorithms are:
procedure first(s: in set; it: out iterator) is
e:
existence renrunes s.e;
k:
key
renames it.k;
valid: boolean renames it.valid;
begin
k:= key'first;
while not e(k) and k<key'last loop
k:= key'succ(k);
end loop;
valid:= e(k);
end first;
procedure next(s: in set; it: in out iterator) is
e:
existence renames s.e;
k:
key
renames it.k;
valid: boolean renames it.valid;
begin
if not valid then raise bad_use; end if;
if k<key'last then
k:= key'succ(k);
while not e(k) and k<key'last loop
k:= key'succ(k);
end loop;
valid:= e(k);
else
valid:= false;
end if;
end next;

function is_valid(it: in iterator) return boolean is


begin
return it. valid;

end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
c:
contents renames s.c;
valid: boolean renames it.valid;
begin
if not valid then raise bad_use; end if;
k:= it.k; x:= c(k);
end get;

CHAPTER 4. SETS AND MAPPINGS

160

4.12.3

Arrays of elements

For sets implemented by arrays of elements (see section 4.4), the iterator can
be implemented as an index to the array. The special value 0 can be used to
represent a non valid iterator. The algorithms are so simple that are left as an
exercise to the reader. Indeed, the elements will be retrieved ill ascending order
only if the keys have an order relationship defined and the set is represented by
a sorted array.

4.12.4

Linked lists

If the sets are implemented by linked lists, the iterator can be a just a pointer
to the current eiement, if any. The related algorithms are straightforward and
reproduced below. Indeed, the elements will be retrieved in. ascending order
only if the keys have an order relationship defined and the set is represented by
a sorted list.
type iterator is
record

p: pcell;
end record;

procedure first(s: in set; it: out iterator) is


first: pcell renames s.first;
p:
peel! renames it.p;
begin
p:= first;
end first;
procedure next(s: in set; it: in out iterator) is
p: pcell renames it.p;
begin
p:= p.next;
exception
when constnllnLerror => raise bad_use;
end next;

function is_valid(it: in iterator) return boolean is


p: pcell renames it.p;
begin
return pf=null;
end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: pcell renames it. p;
begin
k:= p.k; x:= p.x;
exception
when constraint_error => raise bad_use;
end get;

4.12. ITERATORS

4.12.5

161

Binary search trees: stack-based iterators

If a set is implemented as a binary search tree, the ordered retrieval of' its el-

ements correspond to the inorder traversal of the tree. However, the recursive
approach to such a traversal is inappropriate for two reasons. First, we want
to keep the external view of the set independent from any particular implementation, so the external interface must keep unchanged with respect to any
other implementation. Second, some applications on sets require a coordinated
advance on several sets. For instance, the algorithms for computing the union
or the intersection of two sets can be best achieved by merging algorithms on
sorted sequences, as those studied for sequential files in sections 8.5 and 8.6
of vaL II and section 5.3 of val. IIL Such ooordinated advancement over two
different sets can not be achieved on the basis of the recursive algorithm.
Therefore, the information provided by the iterator must be equivalent to
the information passed from one step to the next in the iterative version of the
inorder traversal of the tree. Binary search trees only have pointers from a parent to its children, so the operations parent and is_left can not be implemented.
Therefore, the information passed from one step to the next is precisely the stack
used in the third version of the iterative traversal described in section 3.7.5 of
val. III and the resulting type declaration is:

with dstack;
generic
: - Exactly as in section 4.6.
package d.ret is
: - Exactly as in section 4.12.1.
private
0-

Exactly as in section 4.6.

package dnodestack is new dstack(pnode);


use dnodestack;
type iterator is
record
st: stack;
end record;
end d....set;

The body of the procedure first corresponds to the preparation phase previous to the main loop of the third version of the iterative traversal described
in section 3.7.5 of val. III. If the tree is empty, the stack is left empty as well.
Otherwise, the first element to be extracted is looked for, ie. the lowest one. As
this element is in the leftmost node of the tree, this is achieved by traversing
its left spine. At each step, the parent node is pushed in the stack because that
node is the next one to be visited after the complete traversal of its left child.
Finally, the leftmost child is pushed in the stack as well, because the algorithm
expects to find the element to be retrieved on top of the stack. Consequently,
the algorithm below follows:

162

CHAPTER 4. SETS AND MAPPINGS

procedure first(s: in set; it: out iterator) is


root: pnode renames s.root;
st:
stad{ renames it.st;
p: pnode;
begin
empty(st);
if root/=null then
p:= root;
while p.lc/=nullloop
push(st, p); p:= p.lc;
end loop;
push(st, p);
end if;
end first;

The procedure next must obtain the next element to be retrieved. So, first
the current element is popped from the stack. If the popped element has a right
child, the next element to be visited is the lowest one (ie. that in the leftmost
node) in its right subtree. This node is computed like in first, but starting at
the root of the right subtree of the popped element. Otherwise, ie. if the current
element has no right child, the next element to be visited is already on top of
the stack. Indeed, if the iterator is not valid, ie. the stack is empty, the attempt
to get its top will cause a dnodestack. bad_use exception to be raised. Such an
exception must be reconverted into d_set. bad_use.
procedure next(s: in set; it: in out iterator) is
st: stack rena:rnes it.st;
p: pnode;
begin
p:= top(st); pop(st);
if p.rcj=null then
p:= p.rc;
while p.lc/=null loop
push(st, p ); p:= p.lc;
end loop;
push(st, p);
end if;
exception
when dnodestackbad_use => raise d_set.bad_use;

end next;
Checking the validity of the iterator is as simple as
function is_valid(it: in iterator) return boolean is
st: stack renames it.st;
begin
return not is_empty(st);
end is_valid;
The retrieval of the element pointed by the iterator is simple as well:

4.12. ITERATORS

163

procedure get(s: in set; it: in iterator; k: out key; x: out item) is


st: stack renames it.st;
p: pnode;
begin
p:= top(st);
k:= p.k; x:= p.x;
exception
when dnodestack.bacLuse => raise d..set.bad_use;
end get;

4.12.6

Binary search trees: thread-based iterators

Overview
The complexity of the iterators for binary search trees can be simplified by
improving the representation of binary search trees. In a binary tree, each node
contains two pointers and is pointed by a single one. Therefore, exactly one half
of the pointers in a non empty binary search tree are null. This is somewhat
wasteful to the extent that a single bit is sufficient to tell if a pointer is null.
A better way to use this space of memory is to keep a boolean (in principle,
a single bit) to tell if a pointer is null (ie. it does not address to a child) and
if so, use the pointer field for the right child to indicate the next node to be
visited in the inorder traversal of the tree. Likewise, the pointer field for the
left child can be used to point to the previous one. These latter pointers are
called threads. Threads allow the iterator type to be simply a pointer to a node.
For a node p, if p.rc is effectively a pointer to a right child, then the next node
to be visited is the leftmost one in this right child, which can be found by a
simple algorithm. Otherwise p.rc is a thread that links directly to the node to
be visited immediately after p.
Threaded trees can be built either with cursors or pointers, although the use
of cursors makes it easier to pack the pointer and the boolean because the sign
bit can be used to make the distinction. In this case, pointers to children are
represented by positive values, whereas non positive values represent threads.
The type declaration
Binary search trees can be declared as:

generic
: - Exactly as in section 4.6.
package d..set is
: -- Exactly as in section 4.12.1.
private
max: constant integer:= 1000;
type index is new integer range -max .. max;
type set
is record root: index:= 0; end record;
type iterator is record p:
index;
end record;
end d....set;

CHAPTER 4. SETS AND MAPPINGS

164

AB usual in cursor-based implementations, the declarations of the cells, the


memory space and the list of free cells are local to the body of the package:
package body d_set is
type block is
record
k:
key;
x:
item;
lc, rc: index; -- pointers to left and right children (if > 0)
end record;
type memory..space is array(index range l..index'last) of block;
rns: memory....space;
free: index;

-- Bodies of the operations

begin
prep_mem_space;
end d....set;

The management of memory space


The procedures for the management of memory space are rather simple, and
quite similar to those used in the case of stacks and queues (see sections 2.2.4
and 2.4.4):
procedure prep..mem....space is
begin
fori in index range l..index'last-lloop ms(i).rc:= i+l; end loop;
ms(index'last).rc:= 0;
free:= 1;
end prep..rnem..space;

function get_block return index is


p: index;

begin
if free=O then raise space_overflow; end if;
p:= free; free:= ms(free).rc;
return p;

end get_block;
procedure release_block(p: in index) is
begin
ms(p ).rc:= free; free:= p;
end release_block;
procedure release_tree(p: in index) is
begin - Precondition: p>O
if ms(p).lc > 0 then release_tree(ms(p).lc); end if;
if ms(p).rc > 0 then release_tree(ms(p).rc); end if;
release_block(p );
end release_tree;

4.12. ITERATORS

165

The empty set


The empty set is represented by an empty tree:
procedure empty(s: out set) is
root: index renames s.root;
begin
if root>O then relea.se_tree(root); end if;
root:= 0;
end empty;
The operation put
The operation to add a new element in the set requires a more accurate discussion. The basic principle is the same as the insertion operation for binary
search trees (see section 4.6. However, now the proper cursors to the previOus
and next elements in the inorder traversal of the tree must be placed (changed
in sign) instead of null pointers. For this reason, the procedure put calls to an
internally defined procedure put, which has two additional parameters, previn
and nextin, that are, respectively, the pointers to the nodes that have to be
visited immediately before and immediately after the subtree that has as root
the node indicated by the first parameter:
procedure put(s: in out set; k: in key; x: in item) is
root: index renames s.root;
begin
put(root, k, x, 0, 0);
end put;

where

procedure put
(p: in out index; k: in key; x: in item; previn, nextin: in index) is
begin
if p <= 0 then
p:= get_block;
ms(p):= (k, x, -previn, -nextin);
else
if
k < ms(p).k then put(ms(p).lc, k, x, previn, p);
elsif k > ms(p ).k then put(ms(p ).rc, k, x, p, nextin);
else
raise already _exists;
end if;
end if;
end put;
The operations get and update
These operations follow the same principles of binary search trees, the only
novelty in this case is that cursors are used instead of pointers. As usual, the
only difference between get aud update is that the assignment in update is
ms(p).x:= x instead of x:= ms(p).x and therefore only the algorithm for get is
reproduced here.

166

CHAPTER 4. SETS AND MAPPINGS

procedure get(s: in set; k: in key; x: out item) is


root: index renames s.root;
begin
get(root, k, x);
end get;
where

procedure get(p: in index; k: in key; x: out item) is


begin
if p=O then
raise does....not_exist;
else
if
k < ms(p).k then get(ms(p).lc, k, x);
elsif k > ms(p).k then get(ms(p).rc, k, x);
else
x:= ms(p).x;
end if;
end if;
end get;
The operation remove

The removal of an element follows essentially the same steps as for normal binary
search trees: first the element to be removed has to be found and then proceed
to the actual removal 'Which has the same subcases: no child, one child, two
children. However, some additional care is required to keep the threads to the
next and previous node in the inorder traversal of the tree. As in the case of
insertion, the main procedure remove defers its job to an auxiliary procedure
remove that has two additional parameters, namely the pointers to the previous
and next nodes in the inorder traversal of the tree that has as root the node
pointed by the first parameter:
procedure remove(s: in out set; k: in key) is
root: index renames s.root;
begin
remove(root, k, 0, 0);
end remove;
where

procedure remove(p: in out index; k: in key; previn, nextin: in index) is


begin
if p <= 0 then
raise does_not_ex.ist;
else
if
k < ms(p ).k then remove(ms(p ).lc, k, previn, p );
elsif k > ms(p).k then remove(ms(p).rc, k, p, nextin);
else
actuaLremove(p, previn, nextin);
end if;
end if;
end remove;

4.12. ITERATORS

167

As the explicit release of the removed block is mandatory, a local pointer q


is initially addressed to the node to be removed, so that the node can be finally
released at the end of the procedure, no matter the parameter p is updated
meanwhile. Then the different cases require an accurate discussion.
Case 1: the node to be removed has no child
In non threaded trees, the removal in this case is achieved by p:= null; and,
being the mode of parameter p in out, this null value will be placed to either
the lc field or the rc field of p's parent. However, in this case the situation
is slightly more complex because now the Value to be assigned is a pointer to
either the previous or the next node to be visited in the inorder traversal of the
tree, and this value is different if p is a right or a left child. Let us consider the
two subcases individually.
Case l.a: the node to be removed is a left child
The figure below, in which the threads are indicated by dotted lines, illustrates
the situation:
\

nextin./1

~
The way to recognize that p is a left child is to check if it is aetually the
left child of nextin. If so, the field for the left child of nextin, ie. p, has to be
replaced by the node that has to be visited immediately before to it, which is
pointed by the thread in the field for the left child of p.

Case 1. b: the node to be removed is a right child


This case is symmetrical to the previous one, as the figure below shows:

Now, p is the field for the right child of its parent, and this field has to
be replaced by the node that has to be visited immediately after it, which is
pointed by the thread in the field for the right child of p.

Case 2: the node to be removed has only a left child


In non threaded trees, the only child of p replaced p either as left or right child
of p's parent. In the threaded case we shall do the same, plus an additional
adjustment of the threads because hitherto the rightmost node in p's left child
has a thread pointing to p, as the figure below illustrates:

As the node pointed by p is going to be removed, the thread in the field for
the right child of the rightmost node in p's child has to be updated with the
value ne.:din.

CHAPTER 4. SETS AND MAPPINGS

168

Case 3: the node to be removed has only a right child


This case is symmetrical of the previous one, and now it is the thread in the
field for the left child of the leftmost node in p's child that has to be updated
to previn:

Case

4:

the node to be removed has both left and right children

In non threaded trees, this removal is achieved replacing the node to be removed

by the leftmost node of its right subtree. For threaded trees we shall do the same,
plus the adjustment of the threads. The figure below illustrates the situation:

As the node pointed by p is to be replaced by plowest, the rightmost node


in the left subtree of p and the leftmost node in the right subtree of p have to
be updated so that their threads to their next and previous nodes, respectively,
point now to plowest.
The algorithm that results from these considerations is:
procedure actuaL.remOve (p: in out index; previn, nextin: in index) is
q, plowest: index;
begin
q:= p;
if ms(p).lc <= 0 and ms(p).rc <= 0 then-- p has no child
if nextin>O and then p=ms(nextin).lc then - is_left(p)
p:= ms(p ).lc;
else
- is_right(p) or p is the root
p:= ms(p).rc;
end if;
elsif ms(p).lc > 0 and ms(p).rc <= 0 then- p has only left child
p:= ms(q)lc;
setJlext(p, ms(q).rc);
elsif ms(p).rc > 0 and ms(p ).lc <= 0 then - p has only right child
p:= ms(q).rc;
set_prev(p, ms(q).lc);
else -- p has both left and right children
rem_lowest(ms(q).rc, plowest);
ms(plowest).lc:= ms(q).lc;
ms(plowest).rc:= ms(q).rc;
p:= plowest;
setJlext(ms(plowest) .lc, plowest);
set_prev(ms(plowest ).rc, plowest );
end if;
release_block(q);
end actuaLremove;

4.12. ITERATORS

169

where

procedure rem.lowest(p: in out index; plowest: out index) is


begin -Precondition: p>O
if ms(p).lc > 0 then rem_lowest(ms(p).lc, plowest);
else plowest:= p; p:= ms(p) .rc;
end if;
end remJowest;
procedure set_prev(p, prev: in index) is
t: index;

begin
if p>O then
t:=p;
while ms(t).lc>O loop t:= ms(t).lc; end loop;
ms(t).lc:= -prev;
end if;
end set_prev;
procedure set_next(p, next: in index) is
t: indexj
begin
if p>O then
t:= p;
while ms(t).rc>O loop t:= ms(t).rc; end loop;
ms(t).rc:= -next;
end if;
end set..next;
The operations for iteration

The threads make the implementation of the type iterator and its related operations quite simple. As stated earlier, the iterator can be implemented by a
single index. The operation first has to make it point to the first element to be
visited, which is in the leftmost node of the tree, if there is any:
procedure first(s: in set; it: out iterator) is
root: index renames s.root;
p:
index renames it.p;
begin
p:= root;
if p>O then
while ms(p).lc > 0 loop p:= ms(p).lc; end loop;
end if;
end first;

Making the iterator point to the next node to be visited in the inorder
traversal of the tree is easy too. If the current node has a right child, then the
next element to be retrieved is the leftmost one of the right child of the current
node. Otherwise, the thread found in the field for its right child points to the
next node to be visited directly. Indeed, if this thread is 0, the node pointed by
the iterator is the last one and has no successor.

170

CHAPTER 4. SETS AND MAPPINGS

Consequently, the algorithm beoomes:


procedure next(s: in set; it: in out iterator) is
p: index renames it.p;
begin
if p=O then raise bad_use; end if;
if ms(p) .rc > 0 then
p:= ms(p).rc;
while ms(p).Jc > 0 loop p:= rns(p).lc; end loop;
else
p:= -ms(p) .rc;
end if;
end next;

Checking the validity of the iterator is immediate:


function is_valid(it: in iterator) return boolean is
p: index renames it.p;
begin
return p>O;

end is_valid;
And retrieving the data associated to the iterator is immediate as well:
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: index renames it. p;
begin
k:= ms(p).k; x:= ms(p).x;
end get;

4.12.7

B+ trees

The iteration on B+ trees is most easily solved by organizing the leaf nodes in a
doubly linked list, which can be achieved quite easily. From the point of view of
the type declaration, the leaf pages have to be extended by the two additional
pointers to the previous and to the next page in the list:

type node(t: nodetype) is


record
n: natural;
case tis
when leaf=>
-- Exactly as in section 4.9
prev, next: pnode;
when interior =>
-- Exactly as in section 4.9
end case;
end record;
Maintaining the list requires tiny additions to the algorithms that insert
and delete elements in the set. Concerning the operation empty, there is no
difference at all.

4.12. ITERATORS

171

In the procedure pu~ in the place where a leaf was created at root level,
ie. in the place where the sentence p:= new node( leaf); appears, now it must
be complemented by setting the two pointers to null:
p:= new node(leaf); p.prev:= null; p.next:= null;
In the procedure putO, in the place where the tree grows in width at leaf
level, the sentence ps:= new node (leaf); now it must be complemented by the
chaining of the node pointed by ps at the right of the node pointed by p:
ps:= new node(leaf);
pnext:= p.next;
ps.next:= pnext; ps.prev:= p; p.next:= ps;
if pnext/=null then pnext.prev:= ps; end if;
where pnext is a local auxilia..ry variable of type pnode. The fig11re below may
help to understand the changes in the links:
P

I
l__L__

pnext

__t_l_J't+-t::__c_l_

pnext

__t__J

ps

l__L__

ps
j

pnext

__t_I_Ji+~I-----'-I_J~~~I-~

Concerning the deletions, the links have to be updated when the tree decreases in width, ie. in the procedure removeO, immediately before the call to
remove_child, the following sentences have to be added:
if pls.t=leaf then
nextprs:= prs.next; pls.next:= nextprs;
if nextprs/=nu!l then nextprs.prev:= pls; end if;
end if;
Again, a figure may be helpful to understand the changes in the links:
nextprs
j

pis
j
prs

pls

L____l___j_ji4

nextprs

141

prs

With these links established, the implementation of the operations for iteration is straightforward. The type iterator is a record of a pointer to a leaf page
plus an index to some position in that page. Therefore we have:
type iterator is
record

p: pnode;
i: positive;
end record;

172

CHAPTER 4. SETS AND MAPPINGS

The operation first must go to the leftmost leaf, if any, and then point to the
first element in that page (alternatively, the type set might have been extended
with a pointer to the leftmost leaf):
procedure first(s: in set; it: out iterator) is
root: pnode renarnes s.root;
p:
pnode renames it.p;
1:
positive renames it.i;
begin
p:= root; i:= 1;
if rootj=null then
while p.t=interior loop p:= p.tp(O); end loop;
end if;
end first;

The subprograms next, is_valid and get are immediate and self explaining
too:

procedure next(s: in set; it: in out iterator) is


p: pnode renames it.p;
i: positive renames it.i;
begin
if i<p.n then i:= i+1;
else p:= p.next; i:= 1;
end if;
exception
when constraint_error => raise bad_use;
end next;

function is_vaJid(it: in iterator) return boolean is


p: pnode renames it.p;
begin
return p/=null;

end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: pnode renames it.p;
i: positive renames it.i;
begin
k:= p.te(i).k; x:= p.te(i).x;
exception
when constraint_error => raise bad_use;
end get;

4.12.8

Tries

An iterator for a trie has to keep the data required by an algorithm to step from
one leaf to the next. Indeed, these data are the pointers to the nodes in the
path from the root to the current node plus the choice taken in each branch,
ie. the symbol used to choose the branch down, as the figure below shows.

4.12. ITERATORS

k
b

173
th
r

l
u

n
d
$

e $
r

$
$

n
g

s
e

$ 1$

In principle, a stack might be used to keep such data. However, the sequence
of the choices is precisely the key to be retrieved. So, the stack to be used should
have the additional operation to retrieve all the stacked symbols and it is more
practical to implement it directly by an array and a pointer. Consequently,
type path is array(keyjndex) of pnode;
type iterator is
record
pth: path;
k: key;
1:
keyindex;
end record;

The procedure first is in charge to search the leftmost branch up to the first
recorded key, if any. This is achieved by means of the procedure firstbranch,
that looks for the first non null pointer in the node, if any.
procedure first(s: in set; it: out iterator) is
root: pnode
renames s.root;
pth: path
renames it.pth;
k:
key
renames it.k;
1:
key judex renam.es it.i;
c:
key_component;
p:
pnode;
found: boolean;
begin
p:= root; i:= iO;
firstbranch(p, c, found);
while found and c/=mk loop
pth(i):= p; k(i):= c; i:= i+l;
p:= p.all(c);
firstbranch(p, c, found);
end loop;
pth(i):= p; k(i):= mk;
end first;

174

CHAPTER 4. SETS AND MAPPINGS

where
procedure firstbranch
(p: in pnode; c: out key_component; found: out boolean) is
begin
c:= mk; found:= (p.all(c)/=null);
while c<lastc and not found loop
c:= key_component'succ(c);
found:= (p.all(c)/=null);
end loop;
end firstbranch;
If s is the empty set, the procedure first provides as a result an iterator value
containing a key that is the empty string. If so, the iterator will be considered
as non valid.
The procedure next -has to find the next recorded key in the trie, if any. It
has two phases. First, it goes upwards until it reaches a node that has a non null
pointer for a symbol greater than the one that was chosen for the current key, if
any. Then it goes down to the leftmost branch of the found subtree. The reader
should realize that if the first phase actually finds a subtree to go downwards,
that subtree contains at least one recorded key, because the algorithm that
inserts new keys never creates spurious branches (ie. branches that do not end
by a node labelled with $) and the algorithm for remove described in section 4.10
prunes; them out in the case that they appear.
procedur~

next(s: in set; it: in out iterator) is


root: pnode
renames s.root;
pth: path
renames it.pth;
k:
key
renames it.k;
i:
keyJ.ndex renames it.i;
c:
key_component;
p:
pnode;
found: boolean;
begin
if k(iO)=mk then raise bad_use; end if;
p:= pth(i); c:= k(i);
nextbranch(p, c, found);
while not found and i> 1 loop
i:= i-1; p:= pth(i); c:= k(i);
nextbranch(p, c, found);
end loop;
while found and c/=mk loop
pth(i):= p; k(i):= c; i:= i+l;
p:= p.all(c);
firstbranch(p, c, found);
end loop;
pth(i):= p; k(i):= mk;
end next;

The procedure nextbranch is identical to firstbranch, but for the fact that
the search starts from the symbol next to c, instead of the first possible value
of the type key_component:

4.12. ITERATORS

175

procedure nextbranch
(p: in pnode; c: in out key-component; found: out boolean) is
begin
found:= false;
while c<lastc and not found loop
c:= key_component'succ(c);
found:= (p.all(c)/=null);
end loop;
end nextbranch;

As stated earlier, the non valid value for the type iterator is characterized by
a key representing the empty string. So, checking the valiruty of the iterator is
function is_valid(it: in iterator) return boolean is
k: key renames it.k;
begin
return k(iO)/=mk;
end is_valid;

Retrieving the data pointed by the iterator is straightforward too:


procedure get(s: in set; it: in iterator; k: out key) is
begin
if it.k(iO)=mk then raise bad_use; end if;
k:= it.k;
end get;

4.12.9

Hashing: open

The retrieval of the elements from a set implemented by open hashing is rather
easy, although it is clear that it can not be sorted. The type iterator is represented by an index that indicates a slot and a pointer that addresses to the
current node in that slot:
type iterator is
record
i: natural;
p: pnode;
end record;

The first element is the first cell of the first non empty slot, if any:
procedure first(s: in set; it: out iterator) is
dt: dispersion_table renaJiles s.dt;
i: natural
renames it.i;
p: pnode
renames it.p;
begin
i:= 0;
while i<b-1 and dt(i)=nullloop i:= i+ 1; end loop;
p:= dt(i);
end first;

176

CHAPTER 4. SETS AND MAPPINGS

If s is empty, all the lists are found empty and p is assigned the value null.
The procedure ne:d is straightforward too. If further elements exist in the
current slot, then the next one is selected. Otherwise, the next non empty slot
is searched and its first element selected.

procedure next(s: in set; it: in out iterator) is


dt: dispersion_table renames s.dt;
1:
natural
renames it.i;
p: pnode
renames it.p;
begin
if p=null then raise bad_use; end if;
p:= p.next;
if p=null and i<b-1 then
i:= i+l;
while i<b-1 and dt(i)=null1oop i:= i+1; end loop;
p:= dt(i);
end if;
end next;
The subprograms is_valid and get are simple and self explaining:
function is_valid(it: in iterator) return boolean is
p: pnode renames it.p;
begin
return p/=null;
end is_valid;

procedure get(s: in set; it: in iterator; k: out key; x: out item) is


p: pnode renames it.p;
begin
k:= p.k; x:= p.x;
end get;

4.12.10

Hashing: closed

Iteration in closed hashing is yet simpler than in open hashing and the algorithms are the same no matter the rehashing function is. Indeed, the retrieval
can not be sorted either.
The type iterator can be a simple index to the dispersion table:
type iterator is
record
i: natural;
end record;

The first element is the first non free component of the dispersion table,
if any. If s is the empty set, all the components of the dispersion table will
be found free and the value assigned to the iterator will be b, which will be
recognized as a non valid iterator value:

4.12. ITERATORS

177

procedure first(s: in set; it: out iterator) is


dt: dispersion-table renames s.dt;
i: natural
renam.es it.i;
begin
i:= 0;
while i<b and then dt(i).st=free loop i:= i+1; end loop;
end first;

The procedure next is identical to first, but starting at the position next to
the one currently pointed by the iterator:
procedure next(s: in set; it: in out iterator) is
dt: dispersion_table renames s.dt;
i: natural
renames it.i;
begin
if i=b then raise bad_use; end if;

i:= i+l;
while i<b and then dt(i).st=free loop i:= i+ 1; end loop;
end next;

The subprograms is_valid and get axe simple and self explaining:
function is_valid(it: in iterator) return boolean is
i: natural renames it.i;
begin
return i<b;
end is_valid;

procedure get(s: in set; it: in iterator; k: out key; x: out item) is


dt: dispersion..table renames s.dt;
i: natural
renames it.i;
begin
if i=b then raise bad_use; end if;
k:=dt(i).k; x:= dt(i).x;
end get;

4.12.11

Hashing: extendible

Like in B+ trees, in extendible hashing data pages axe created by splitting


earlier ones and their removal is achieved by merging previous ones, with the
obvious exception of the first created data page. Consequently, the approach to
the iteration in extendible hashing can be analogous to iteration in B+ trees,
ie. based on organizing the data pages in a doubly linked list. As this retrieval
does not involve the dispersion table, the approach is the same no matter the
dispersion table is bounded or not. Indeed, in the case of B+ trees, the retrieval
is sorted, whereas in extendible hashing it can not.
Concerning the type declaration, the data pages have to be extended with
the two pointers required for the organization of the doubly linked list, and the
type set extended with a pointer to the first element in the list:

CHAPTER 4. SETS AND MAPPINGS

178
type node(t: nodetype) is
record

case tis
when disp_node =>
-Exactly as in section 4.11.5
when data_node =>
-- Exactly as in section 4.11.5
prev, next: pnode;
end case;
end record;
type set is
record

- Exactly as in section 4.11.5


first: pnode;
end record;
As in the case of B+ trees, keeping the data pages in a linked list requires only
a few additions to the algorithms that build values of the type, ie. empty, put
and remove. For empty, the old assignments:= (null, 0); has to be replaced by
s:= (null, 0, null);
because s.first, the field that points to the beginning of the list also has to be
set to null.
Regarding the procedure put, immediately after the sentence that creates
a data page at the root level, ie. root:= new node( data._node); the following
statements have to be added:
first:= root;
root. prev:= null; root.next:= null;
where root and first are renamings of s. root and s.first respectively.
In the procedure normal_put, in the place where a page is split into two:
ie. immediately after the sentence ps:= new node( data._node); the following
sentences have to be added:
pnext:= p.next;
ps.next:= pnext; ps.prev:= p; p.next:= ps;
if pnext/=null then pnext.prev:= ps; end if;
where pnext is a local auxiliary variable of type pnode. These sentences are
exactly the same as for the linking of data pages in B+ trees and the figures
that may help to understand them can be found in section 4.12.7.
A, usual, the operation remove requires a more careful attention. For the
procedure remove itself, immediately after the sentence root:= null; the following sentence has to be added:
first:= null;

4.12. JTERATORS

179

where first is a renaming of s.first.


Regarding to the subordinate procedure normal_remove, when a page pointed

by p is merged with its split image, pointed by ps, the elements from the latter
are copied into the former and the latter is removed from the structure. By
construction, one is next to the other in the list. To be precise, the one that
corresponds to a 0 in the leading bit goes before than the one that corresponds
to a 1. To make the appropriate removal from the linked list, it is important
that the page that goes first in the list plays the role of p and the page that goes
next plays the role of ps, so that the elements are copied in the page that goes
first and the page that goes next is removed. Therefore, immediately before to
the sentence merge(p, ps); the following sentences have to be added:
swap(p, ps);
nextps:= ps.next; p.next:= ne:x:tps;
if nextps/=null then nextps.prev:= p; end if;

where nextps is a local auxiliary variable of type pnode and the procedure
swap is:

procedure swap(p, ps: in out pnode) is


paux: pnode;

begin
if p.nextf=ps then
paux:=p; p:= ps; ps:= paux;
end if;
end swap;
With these links established, the operations to iterate on the set are quite
straightforward, although slightly different than in the case of B+ trees because
now it may happen (although it is very unlikely) that some data page is empty.
The declaration of the type iterator is, like for B+ trees, a pointer to a data
page plus an index to a component in that page:
type iterator is
record
p: pnode;
i: natural;
end record;

The procedure first has to look for the first non empty page. Indeed, if
s.firstj"'null, at least a non empty page exists.
procedure first(s: in set; it: out iterator) is
p: pnode renames it.p;
i: natural renantes it.i;
begin
p:= s.first;
while p/=null and then p.ne=O loop p:= p.next; end loop;
i:= 1;
end first;

180

CHAPTER 4. SETS AND MAPPINGS

If additional elements remain in the current page, the procedure next seleCts
the next element in the current page. Otherwise, it looks for the next non empty
page:
procedure next(s: in set; it: in out iterator) is
p: pnode renames it.p;
i: natural renames it.i;
begin
if p=null then raise bad_use; end if;
i:= i+l;
if i>p.ne then
p:= p.next;
while p/=null and then p.ne=O loop p:= p.next; end loop;
i:= 1;
end if;
end next;

The subprograms is_valid and get are so simple that are self explaining:
function is_valid(it: in iterator) return boolean is
p: pnode renames it.p;
begin
return p/=null;
end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: pnode renames it.p;
i: natural renames it.i;
begin
k:= p.te(i).k; x:= p.te(i).x;
end get;

4.13

The impact of sorted retrieval

The most obvious advantage of the fact that the elements from a set can be
retrieved in order is the possibility of printing them out in a more attractive
and useful way.
Nevertheless, there are other important advantages, perhaps less obvious.
For instance, the fact that some sets can be retrieved in sorted order with respect to the same index, gives a database management system the chance to
implement queries involving joins in a more efficient manner. It is out of the
scope of this book to describe the strategies for optimizing queries in database
systems, however it is essential for a database designer to have a good understanding of how B+ trees and extendible hashing work so that he can decide
which approach he shall request the database system tc implement a given index, according to the expected queries and their respective frequencies. Indeed,
hashing provides a faster access to individual elements, whereas B+ trees provide sorted retrieval, which can be the basis for more efficient implementations
in some kinds of queries.

4.14. SUMMARY OF IMPLEMENTATION TECHNIQUES

181

Sorted retrieval is also important regarding the operations of union 1 difference and intersection. If it is available, these operations can be implemented in
linear time by means of merging algorithms on sorted sequences sucb as those
studied for ordered sequential files in section 8.5 of vol. II. Otherwise, they have
to be implemented traversing one of the sets and for each element retrieved from
the first set check if it is in the second set. Indeed, the sets implemented by
means of boolean arrays are an exception, as for such an implementation these
operations can be best implemented by means of bit-based operations. The
details are left as an exercise to the reader.

4.14

A summary of the different implementation techniques

This long chapter has presented a rather complete repertoire of techniques for
implementing sets. A proficient programmer should be familiarized with all
them, so that he can choose the most appropriate to the particular characteristics of his application. The choice should be made according to different
facts:

1. The size of the sets. For small sets the simplest techniques are preferable
and they can provide even lower execution times that the more sophisti-
cated ones. Instead, for large sets, the simplest techniques such as arrays
of elements or linked lists the execution times of the operations become
unacceptably high.

2. The keys are of a discrete type. For keys of discrete types, the implementations based on arrays indexed by the keys themselves are the best
choice unless the cardinality of the type exceeds by far the sizes of the sets
that are expected to be managed.
3. The keys have a total order relationship defined. If not, techniques
such as binary search trees and B+ trees can not be applied.
4. The values of the keys have a unique binary representation. If
not, hashing techniques have to be discarded.
5. The keys are arrays of symbols. If so, the trie approach should be
considered as a good choice.
6. The set has to be stored in secondary storage. If so, B+ trees
and extendible hashing are most appropriate techniques. In particular
circumstances, other techniques can also be adapted to secondary storage.
For instance, the nodes of a trie can be implemented as records in a direct
file.

7. Sorted retrieval is required. The applications that do require the


sorted retrieval of the elements stored in the set must discard efficient
techniques sucb as hashing.
8. Remove is required. If so, closed hashing must be discarded.

182

CHAPTER 4. SETS AND MAPPINGS

9. The expected frequency of each operation. Indeed, the implementation technique should be chosen so that it provides the highest efficiency
for the most frequent operations.

The table below surnm:rrizes the asymptotic execution times of the operations of the abstract data type set according to the different implementation
techniques:

empty

get
update

put
remove

sorted
retrieval

union
intersection
difference

Arr. of booleans
n
1
1
n
n
n2
Arr. of elem. (unsort.)
n
n
1
-tt
Arr. of elem. (sorted)
1
n
logn
n
n
n2
Linked list (unsort.)
n
n
1
-tt
Li:nked list (sorted)
1
n
n
n
n
Binary search trees
1
lognf
lognt
n
n
logn
Bal. bin. search trees
1
n
n
logn
B+ trees
logn
logn
1
n
n
Tries
1
1
n
n
1
Hashing
n
1
1
n
-ttt
t These values :rre m average. There eXISt worst cases that make them O(n).
ft A sorting algorithm is needed, that can be applied in the same structure.
ttt A sorting algorithm is needed, but must be applied in a different structure.

In all cases, there is the obvious request that the keys have the test for
equality defined. Additionally, the different implementations have the following
constraints reg:rrding the properties of the keys:
Arr. of booleans

The type of the keys is discrete and the number of


elements that ;rre expected to be stored in the sets
is about this range.

Arr. of elem. (unsort.)


Linked list (unsort.)
Arr. of elem. (sorted)
Linked list (sorted)
Binary search trees
Bal. bin. search trees
B+ trees

Tries

The keys are arrays whose elements belong to a discrete type of a limited range.
Either the keys have a specific hash function defined
or each key value has a unique binary representation.

Hashing

There is a total order relationship defined to compare the keys.

You might also like