Data Structures
Data Structures
BASIC MECHANISMS OF
DYNAMIC MEMORY
1.1
Introduction
Hitherto, the collections of data of the same type have been stored in arrays.
For instance, words have been stored as arrays of characters. The difficulty
that introduces such kind of representation is that some upper bound has to
be estimated in order to set the size of the array. In some circumstances, such
upper bounds can be easy to estimate, but there are applications where the
collections to store can vary considerably in size from one execution of a program
to another. In these cases, the overestimation of the array sizes can lead to an
unnecessary waste of memory space.
Dynamic memory is a mechanism that allows us to construct data structures
that vary in size along the execution of the program. The essential principle of
dynamic memory is that we can request pieces of memory to allocate variables
at run time and we get a pointer to the allocated memory space. A pointer is
some information that is equivalent to the address of the reserved block. Using
such an address we can either store values or retrieve values from that space.
Usually, these dynamically allocated blocks of memory also contain pointers
to other blocks, and the set is then called a linked structure.
Indeed 1 programming with such pointers is very error prone. Algorithms
acting on pointers are harder to understand and prove correct than algorithms
having no pointers and, consequently, it is very easy to make seriOus mistakes
when they are used. Therefore, the construction of algorithms that make use of
pointers requires a lot of care.
The main goal in the design of Ada has been to provide the highest possible
reliability to software. In particular, its strong type system is a powerful tool
that allows the compiler to detect a wide range of inconsistencies, specially if
a proper programming style is used to take benefit from that system. Pointer
types are not an exception, and no matter the information they contain is more
or less equivalent to an address of memory, its use is restricted to prevent the
programmer from making inconsistencies.
This chapter, and therefore this book, does not introduce all the ways in
which pointers can be used in Ada, but only the features that are required to
1.2
Access types
In Ada, pointers are called access types because they provide access to other
objects. In this book we will use the two terms indistinctly.
Basic operations on access types
An access type is always bound to some data type. They are declared as follows:
type Tis ... ;
type p_T is access T;
p: p_T;
These declarations state that p is a variable of type p_T, which means that it is
able to contain the location in memory where some datum of type T is stored.
The allocation of space to store a datum of type T is achieved by the operation
p:= newT;
This operation reserves memory space to allocate data of type T and sets the
address of this place -Dr some information equivalent- to the variable p. The
expression p.al! indicates the block of memory addressed to by p and can be
used as any variable of type T. Therefore, after the declarations
p, q: p_T;
x: T;
The second assignment copies the value of x to the block pointed by p. The
second assignment copies the address contained at p to q. As a result, both p
and q point to the same block of memory.
A variable of an access type can also contain a special value, called null,
with the meaning that it points nowhere. This special value can be assigned to
them as in
p:= null;
If an access variable p is equal to null, the attempt to refer to p.all is non sense
and will raise the exception constrainLerror, exactly as when we refer to an
array component with an index which is out of bounds.
Besides the test for equality, which also applies to access types, these are the
only permitted operations to manage pointers. Any attempt to add pointers or
to add an integer value to an access type will be considered as a type inconsistent
expression and will be refused by the compiler.
Strict typing of access types
Pointers addressing different types of data are incompatible. For instance, after
the declarations
type Tl is ... ;
type T2 is ... ;
type p_Tl is access Tl;
type p_T2 is access T2;
pl, ql: p_Tl;
p2, q2: p_T2;
xl: Tl;
x2: T2;
The following assignments are type inconsistent and will be refu~ed at compile time:
It
It
It
It
causes
causes
causes
causes
a
a
a
a
compile-time
compile-time
compile-time
compile-time
error.
error.
error.
error.
Instead, the null value can be assigned to any variable of any access type1 , as
in
pl:= null;
p2:= null;
In most applications, access types refer to blocks of some record type. For
instance:
type Tis
record;
a: Ta;
b: Tb;
end record;
type pT is access T;
p, q: pT;
x: T;
1 NER.~ Actually, Ada allows to declare access types for which the null value is forbidden.
However we will not make use thls feature in the scope of this book.
p.all.a:= x.a;
q.all.a:= p.all.a;
p.all:= x;
However, this notation is somewhat cumbersome unless the algorithms are very
short, and can be abbreviated omitting the qualifier all. Consequently, the
following assignments have the same meaning than those above:
p:= newT;
q:= newT;
p.a:= x.a;
q.a:= p.a;
p.all:= x;
In the last assignment, the qualifier all can not be omitted because p and x are
of different types and x can not be assigned to p. It is required to specify that
the destination is not p but the object addressed by p. When a record field
appears, there is no possible confusion.
Forward declarations
As we shall see soon, we shall require quite often that some of the fields of the
records allocated in dynamic memory are also access types, pointing to other
blocks of the same type. If so, a circular dependency appears in the declarations,
which must be solved by a forward declaration, as in
type cell;
type pcell is access cell;
type cell is
record;
a: integer;
s: pcell;
end record;
The first declaration tells the compiler that cell is the name of a type. This
information is sufficient to check that the second declaration is consistent and
to process it. In particular, the size required for variables of type pcell is known.
Finally, the third declaration provides the full details of type cell and can be
properly processed because all the details for peel/ are known.
The reader should remark the strong similarity of such forward declaration
with the forward declarations required in the case of mutually recursive subprograms that appeared in section 3.6.5 of vol. III.
Linked structures
The fact that blocks in dynamic memory can reference to other blocks allow us
to build structures as the very simple one represented by the following fignre:
These kind of structures are called linked, and we shall make a systematic
use of them throughout the remaining chapters of this volume. The reader is
expected to learn there both their usefulness and the ability to build algorithms
that manage them.
Garbage collection
Blocks that have been allocated in dynamic memory, can become unaccessible
if no access type variables point to them any longer. For instance, the blocks
in the linked structure above become inaccessible if null is assigned both to p
The allocation operator new can include the value to be assigned to the newly
created block. This value must be provided as a qualified expression:
p:= new cell'(7, null);
p:= new cell'(5, p );
which create the same linked structure as above.
Indeed, record values can be build using the named forms too, as in
p:= new cell'(a=> 7, s=> null);
p:= new cell'(a=> 5, s=> p);
which is less error prone in the case that there is more than one field of the
same type.
Access types and unconstrained types
Access types may also point to cells of an unconstrained type. If so, the constraint must be set on allocation, exactly in the same way as it must be set to
declare a variable (see sections 3.2, 3.6 and 5.5 of vol. II). For instance, after
the following declarations2
type p_str is access string;
p, q: p...str;
2 The type string is declared in the package standard as an unconstrained array of characters
(see subsection 3.2.2 ofvol. II).
type cell;
type pcell is access cell;
type celLtype is (atom, unary, binary);
type unary_op is (minus, sin, cos, log, exp);
type binary.op is (add, sub, prod, quat);
type cell(t: ce!Ltype) is
record;
case tis
when atom=>
x:
integer;
when unary =>
u_op:
unary_op;
opnd:
peel!;
when binary =>
b.op:
binary.op;
left.opnd: peel!;
right.opnd: peel!;
end case
end record;
p, q: peel!;
The allocation of blocks to keep values of type cell must provide the discriminant necessarily, as in
p:= new cell(unary);
However, any value of type pcell, no matter what the variant of the pointed
object is, can now be assigned to p.alLopnd, as in
Default values
In Ada, any variable of an access type has the initial value null. Nevertheless,
the algorithms presented in this volume do not take advantage of this feature
because this text is intended to serve as a handbook of algorithms, that can be
either used directly in Ada or translated to other programming languages that
do not have this feature. Therefore, whenever it is required that an access type
has the value null, it will always be explicitly assigned.
Running out of memory space
If the run time system that manages the dynamic memory is unable to provide
the space of memory requested by the allocator new, because no more space is
available, then the exception storage_error is raised.
Pointers to subprograms
For the sake of completeness, the reader should remind that access types may
also point to subprograms, as described in section 7.5 of vol. II and used in
sections 7.6 to 7.9 of the same vol.
Safety of access types rules in Ada
As shown, Ada imposes restricting rules upon the nse of access types. Although
the underlying information they contain are essentially addresses in memory, the
only way these values can be obtained is by means of the allocation operator
new. These values can later be copied, but never modified. Moreover, a pointer
to blocks of a type can never be assigned an address corresponding to a block
of a different type.
These restricting rules provide the guarantee that bits in memory representing one type can never be misinterpreted as if they represented a different
type.
Any inconsistency is detected. Most of them at compile time, which makes it
easy to fix the bug. A few kind of inconsistencies, however, can not be detected
until run time, in the form of the rise of the exception constrainLerror. This
will happen after the attempt to get data referenced by a pointer having the
value null.
The other case is the attempt to assign blocks corresponding to different
sizes or discriminant values of unconstrained types or to reference a field of a
record whlch is inconsistent with the value of the discriminant. However, these
kind of inconsistencies are not specific of access types and they may happen with
conventional variables as well. For example, the following procedure, which uses
no access types, may raise the same exception in exactly the same circumstances:
1.3
Cursors
Linked structures can also be built using indexes to arrays instead of access
types. For instance, after the following declarations
type index..cell is new integer range 0 .. 100;
type cell is
record;
a: integer;
s: index_cell;
end record;
type celLpool is array(index_cell range l..index_cell'last) of cell;
cp: cell_pool;
p: index_cell;
if p and cp have the following values:
p = 30;
..
In this kind of implementation of linked structures, the links are called cursors, instead of pointers. The reader should remind that a very simple use of a
linked structure using cursors was already- introduced in the description of bin
sorting (see section 5. 7 of vol. III).
When cursors are used, the memory space are components of arrays and
the algorithms have to keep track of the components that are in use and those
which are available for further use. Usually, this is achieved by linking the free
components and indicating the beginning of the linked structure by an additional
index variable. Some care must be taken to remoVe a component from the pool
of free blocks, each time a new one is required, as well as to reintroduce it in that
pool of free components when that block becomes no longer accessible, because
for cursors these operations must be explicit.
Naturally, for eac;tJ. kind of block we requite, a different array of memory
space shall be declared. A careful declaration of incompatible types allows the
compiler to detect most inconsistencies at compile time, making the development
faster.
As we shall see in chapter 2, the algorithms using access types and the algorithms using cursors are very much the same. The advantages and drawbacks
of each approach will be discussed there too.
Chapter 2
LINEAR STRUCTURES
2.1
Introduction
At this stage, the reader is already familiar with the use of stacks, which were often required to convert recursive algorithms into iterative ones (see sections 3.5
to 3.7 of vol. III), although stacks can be applied to other kinds of problems
as well. In section 6.2.5 of vol. III, an algebraic specification for stacks is provided. An algebraic specification for queues was also presented in section 6.2.6
of vol. III, although no algorithm from this book has made use of them hitherto.
Both stacks and queues are pools of data elements that have a common characteristic, namely that the order in which elements are retrieved and removed
depend exclusively from the order in which they are inserted. In the case of
stacks the criterium is last in first out (usually indicated as LIFO), and in the
case of queues is first in first out (usually indicated as FIFO). Therefore, in both
cases the elements in the pool can be imagined as forming a linear chain where
the elements are inserted and retrieved at the extremes of the chain. In the
case of stacks, the extreme for retrieval is the same as the extreme for insertion,
whereas in the case of queues the extreme for retrieval is the opposite of the
extreme for insertion. Consequently, stacks and queues have been classified as
linear data structures.
In chapter 3 of vol. III, stacks were implemented by means of an array plus an
integer, which is the most straightforward way. In this chapter we will consider
also the possibility to implement stacks by means of access types and cursors,
and we will discuss the advantages and drawbacks of the different choices. Three
ways of implementing queues will be presented and analyzed too.
In computer science, the term list usually designates a linearly linked structure -ie. a structure where each component has at most one successor- in
which insertion and retrievals can be done in arbitrary positions of the chain.
Lists are therefore also linear structures and the appropriate place to study them
is besides stacks and queues.
However, in contrast with stacks and queues, lists are not abstract data
types but a technique that is extremely useful in the implementation of other
types. Any attempt to characterize the behavior of lists ignoring the implementation details (ie. the links) actually describe the type multiset (ie. a set where
more than one copy of the same element can be stored and retrieved indepen9
10
dently). But to implement other types, multisets do not have the advantages
derived from the flexibility that lists have. To sum up, queues and stacks will be
treated as abstract data types and different implementations will be presented
and compared. Inst!'ad, lists will be treated as a technique that can be used in
many circumstances, as we shall see in later Chapters of this volume.
2.2
Stacks
2.2.1
Specification
A Stack is a data pool where the elements are -retrieved in the inverse order
they have been stored. The algebraic specification of the data t:fpe stack has
been provided in section 6.2.5 of vol. III. From an algebraic point of view, any
operation isa function, which makes both the algebraic specification of the type
and the formal proof of the algorithms that use it reasonably simple. However,
as stated in section 6.2.1 of val. III, inimperative languages such as Ada, obvious
efficiency considerations force us to express most of the operations as procedures
instead of functions. Henceforth, instead of invoking a function push as
s:= push(s, x);
we shall achieve the same effect programming a procedure push with an inout
parameter of type stack:
push(s, x);
bad_use:
exception;
space_overflow: exception;
procedure empty (s: out
stack);
procedure push
(s: in out stack; x: in item);
procedure pop
(s: in out stack);
function top
(s: in stack) return item;
function is_empty(s: in stack) return boolean;
private
-- Implementation dependent
end dstack;
2.2. STACKS
2.2.2
11
Compact implementation
n: index;
end record;
end dstack;
The operations for this implementation are rather obvious. The attempts
to access the array out of its range will raise the exception constraint_error.
According to the circumstances, such an exception has to be converted either
to bad_use or to space_overflow.
The resulting package body becomes:
12
2.2.3
In many circumstances it is not easy to foresee a safe upper bound for the size of
a stack. Consequently, we might be interested in providing an. implementation
that is only bound by the memory space effectively available for the program.
This can be achieved by a linked structure such as:
where the element on top of the stack is the one pointed by top. According to
this implementation, the type stack is represented just by this pointer and the
resulting private part becomes:
generic
type item is private;
package dstack is
- Exactly as in section 2.2.1
private
type cell;
type pcell is access cell;
2.2. STACKS
13
type cell is
record
x:
item;
next: peel!;
end-record;
type stack is
record
top: pcell;
end record;
end dstack;
The implementation of the operations is rather simple and an excellent introduction to the use of access types. They are presented here in increasing
order of difficulty.
An empty stack is represented by a chain containing no element. This is
represented by the top pointer having the value null:
14
structure. If the run time system does not find spac;:e to allo"cate the new cell
a storage_error will be raised. If so, this exception must be converted into the
more explicit space_overfiow 1 :
procedure push(s: in out stack; x: in item) is
top: pcell renames s.top;
r:
pcell;
begin
r:= new cell;
r.all:= (x, top);
top:= r;
exception
when storage_error
end push;
=>
raise space-Overflow;
The following fignres illustrate the three steps. Let us assume that the initial
value of the stack is:
top
rG-{[3
The situation after r.all:=(x, top); is:
top
The reader should also be aware that if initially top=nul!, then the same
operations lead, as expected, to
top~
The elimination of the top element of the stack is rather straightforward:
1 Each instance of dstack will define its own instance of the exception. So, if dstack_ T is
an instance of dstack for items of type T, a specific exception becomes declared and its name
is dstack-T.space_overfiow.
2.2. STACKS
15
The cell containing x being now inaccessible, it becomes garbage and the
space it occupies can be reused. If the stack contained initially a single item,
then the same operation leads to top=null.
Indeed, if the stack is initially empty, the attempt to consult top.ne:rt will
raise the exception constraint_error. This .exception must be reconverted into
the more precise baLuse.
2.2.4
Overview
The implementation provided in section 2.2.3 has the advantage that the size
of the stacks is only bounded by the memory that is available in the computer
for the running program. However it has the hidden drawback of the work to
be done by the run-time system to detect unaccessible blocks and reorganize
the memory to reuse the released blocks2 _ This drawback is less important for
structures that only grow, or shrink occasionally, but it may become serious for
structures such as stacks, that grow and shrink continuously.
In some algorithms a bounded number of items move from one stack to
another. In such cases, it would be inappropriate to L.-nplement each stack
by an array and an index because when one stack is full we know that the
others are empty and therefore such an implementation would be wasteful of
memory space. Instead, we may implement all the stacks as linked structures
with cursors on a common array. Any component of the array has a field for
the item and a field to point to the next component of the chain, if any. Any
stack is implemented by an index pointing to the component of the array where
the top of that stack is stored.
2The algorithms for managing the dynamic memory are studied in chapter 8 and, a.s the
reader shall see, are far from simple and occasionally they may become time consuming.
16
The array used as support must be hidden in the package body. The available
components must be linked before any operation on stacks is executed and an
index variable named free shall point to the beginning of the linked structure
of available components.
Data representation
According to these considerations, the specification part of the package for this
implementation is:
generic
type item is private;
max: integer:= 100;
package dstack is
- Exactly as in section 2.2.1 but with pragma pure removed.
private
type index is neW integer ra:rlge 0 ..max;
type stack is
record
top: ind~;
end record;
end dstack;
and the structure of the body is:
package body dstack is
type block is
record
x:
item;
next: index;
end record;
type mem_space is array(index range l..index'!ast) of block;
ms: mem...space;
free: index;
... - Bodies of the subprograms for stack operations and other auxiliary
. - subprograms, to be detailed below.
begin
prep_mem_space;
end dstack;
2.2. STACKS
17
procedure prep_mem....space is
begin
fori in index range l..index'last-1 loop ms(i).next:= i+1; end loop;
ms(index'last) .next:= 0;
free:= 1;
end prep--I!lem....space;
and its effect is to leave aJl the components of ms linked starting at the position
indicated by free:
free~
.
ms{GLLGLG&GGLLGGGGGGLGGLLGGGGL~
We shall require two auxiliary operations, one to get a block from the list of
free blockS (ie. the operation equivalent to the allocator new for access' types),
and an operation to bring a block back to the pool offree blocks when it becomes
no longer accessible (an operation that is implicit in the case of access types,
but must be explicit in the case of cursors).
The first of these auxiliary subprograms is
function get_block return index is
r: index;
begin
if free=O then raise space_overftow; end if;
r:= free; free:= ms(free).next; ms(r).next:= 0;
return r;
end get_block;
The algorithm is quite straightforward, it selects the block pointed to by free
and removes it from the linked chain of free blocks, in a similar way as the pop
operation when the stack was implemented by access types. If the pool of free
blocks is empty, it raises the exception space_overfiow. The following figures
illustrate the operation. Initially we have
free~
~--
After r:= free; we get
18
free~
free~.
rG---tj hl_._...
After ms(r).next:= free; the state is
free~ ~.
r~
t:i.1L
least in this first version. The procedure empty will be revisited below
2.2. STACKS
19
The operation top is straightforward as well. The only significant consideration is that in this implementation the attempt to get the top of an empty
stack has the effect to consult the ms array at position 0, which is out of its
allowed range. Consequently, a constraint_error will be raised, which must be
converted into a more explicit exception.
function top(s: in staCk) return item is
p: index renames s.top;
begin
return ms(p ).x;
exception
_when constrainL.error => raise bad_use;
end top;
:~-
p~
r
...
And finally, after p:= r; the result is (regardless of r, which is a local variable):
p~-
20
=>
raise bad_use;
This operation works in three steps. First, an auxiliary index keeps the
position of the element to be deleted because later we shall bring it back to
the pool of free blocks. Then this element is actually removed from the stack,
by making p point to the next element of the stack. Then, the released block
may actually be brought back to the pool of free blocks because it is no longer
accessible..
The figures below illustrate the three steps. The situation after r:= p; is
:~
:~-
Then the block pointed to by r is inserted back to the pool of free blocks
and r vanishes because it is a local variable.
If the stack is initially empty, the exception constraint_error will be raised
because then p = 0 and it is out of the range of 171.1i.
The operation empty revisited
The implementation of the procedure empty presented above is fine if it is applied for the first time to the actual argument. However, if the actual parameter
is a variable of type stack that has been used previously, and its value is not
the empty stack: the current version of the procedure empty sets the top index
to 0, which leaves the blocks currently in the stack unaccessible and they can
not be used any longer.
Indeed, it would be more appropriate to bring the blocks currently in the
stack back to the pool of available blocks. However, this can not be done if the
actual argument is variable that is used for the first time, because in this case
its contents is undefined.
The way to solve this problem is to force all the variables of type stack to
have an initial value equivalent to an empty stack. This is achieved by declaring
the type stack as:
21
type stack is
record
top: index:= 0;
end record;
and then,. being sure that any unused variable of the type stack has its top field
set to 0, empty can be safely implemented as
procedure empty(s: out stack) is
p: index renames s.top;
begin
release_stack(p);
end empty;
where release_stack is a memory management operation that releases all the
blocks currently in the stack by linking the last element of the stack to the first
one of the pool of free blocks:
2.3
The study of the implementations of the type stack, may be useful to make a
few considerations that are of general interest in the implementation of any data
type.
2.3.1
Implementation independence
The reader should be aware that for any of the implementations of the data
type stack studied in section 2.2, the visible part of the specification part of
the package dstack has remained unchanged. Therefore, any software using
the type stack is not affected at all by the replacement of one implementation
of the type by another. Structuring the software in modules that can change
their implementation by keeping their external behavior unchanged is essential
in making software maintenance feasible.
2.3.2
22
times greater than the-involved operation itself. In these cases, the programmer
should use the pragma inline, as explainedjn section 7.8 of vol. I to force the
compiler to implement the call by expanding the code online instead of making
an actual call, so that the resulting code be as efficient as if the implementation
was directly visible by the calling program. In the case of stacks implemented
by arrays, it would be reasonable to add, in the visible part of the specification
part, pragma inline(empty, is_empty);
2.3.3
If pragma pure appears in the specification part of a packsge, the user of that
package may be Sure -that the package has no internal state. As a consequence,
that package has several nice properties. On one hand, the result of the .operations depend only from their arguments, and therefore their behavior can be
algebraically specified according to the techniques presented in section 6.2 of
vol. III, and serve as a basis for the formal reasoning on algorithms that use the
type. On the other hand, they can be safely used by concurrent tasks, as it will
be discussed in more detail in vol. V. Also, testing becomes siruplified.
To grant this property, in the presence of pragma pure the compiler will
refuse any declaration of variables both in the specification part of the packsge
and in its body. Moreover, only packages declared at library level can be pure
and they shall depend only from other units that are also declared as pure. The
use of access types is rejected too, as the blocks created in dynamic memory act
as global variables.
In the case of our stack, the implementation based on an array plus an
index, would be accepted by the compiler in a pure packsge. Instead, the
linked implementations would not, no matter if the approach is access-type
based or cursor based4 . Certainly, in a sequential environment, the results of
the operations depend only from their arguments in the three cases, because the
internal state of the linked implementations is used just for memory management
purposes, but the compiler has not the means to grant it.
In as much as possible, software should be built on the basis of pure units
because they are usable in any context and it is easier to reason about them.
2.3.4
On raising exceptions
2.4. QUEUES
23
2.4
Queues
2.4.1
Specification
A queue is a data pool where the elements are retrieved in the same order they
are inserted. The abstract data type queue has been algebraically specified in
section 6.2.6 of vo!. III.
AB for stacks, efficiency considerations force us to implement some of the
operations as procedures with parameters of mode in out instead of as functions,
and also some exceptions must be declared so that they can be raised either
in the case of attempts to execute operations upon illegal values of the type,
or in the case of running out of-memory space. Consequently, the package
specification part should be as follows, where rem_first is an abbreviation for
remove first.
generic
type item is private;
package dqueue is
type queue is limited private;
24
bad_use:
exception;
space_overfiow: exception;
2.4.2
Compact implementation
The most straightforward implementation of a- queue is an array and two integers, the first one pointing to the place the next incoming item must be stored
in and the second one pointing to the place of the next item that must retrieved.
AB long as the items are deleted, the components of the array become released
and can be used again. Therefore, the pointers must traverse the array in a
circular way, as shown in the figure, -where the shadow indicates the unused
components of the array. The easiest way to detect that the queue is empty or
full is to add a counter that keeps the number of items stored in the queue.
1-----n---
max
25
2.4. QUEUES
The reader should remark that, although the counter will get the same values
that the indexes (plus 0), the counter is declared of an incompatible type to
prevent it from being used as an index by mistake.
The operations for this implementation are rather obvious and the resulting
package body becomes:
package body dqueue is
procedure empty(qu:
p: index renames
q: index renames
n: natural renames
begin
p:= 1; q:= 1; n:= 0;
end empty;
out queue) is
qu.p;
qu.q;
qu.n;
26
2.4.3
In rriany circumstances, it is not easy to foresee a safe upper bound for the size
of the queues. Consequently, we might be interested in providing an implementation that is only bound by the memory space effectively available for the
program. This can be achieved by a linked structure such as:
q
where p points to the item most recently stored in the queue and q points to
the older item in the queue. According to this implementation, the type queue
is represented by these two pointers and -the resUlting private part .becomeS:
generic
type item is private;
package dque_ue is
- Exactly as in section 2.4.1
private
type cell;
type pcell is access cell;
type cell is
record
x:
item;
next: pcell;
end record;
type queue is
record
p, q: pcell;
end record;
end dqueue;
begin
p:= null; q:= null;
end empty;
begin
return q=null;
end is_empty;
2.4. QUEUES
27
Getting the first item of the queue, ie. the oldest one, is as simple -as getting the item pointed by q. However, if q=null, the attempt to dereference it
will raise the exception constraint_error, which has to be reconverted into the
more explicit bad_use because it results from applying an operation to an illegal
argument:
q~
Initial state
Final state
For the second case, after r:= new cell; r.all:= (x, null); we have the
following situation:
q
28
The cell containing a being now inaccessible, becomes garbage and the space
of memory it occupies can be reused. The exception constraint_error will be
raised if the queue is empty, as a result of attempting to dereference a null
pointer.
If the queue contains only one element, then also p must be updated. The
following figure illustrates the situation:
qm
Initial state
Final state
2.4. QUEUES
29
generic
type item is private;
max: natural:= 100;
package dqueue is
-- -- Exactly as in section 2.4.1 but with pragma pure removed.
private
type index is new integer range 0 ..max;
type queue is
record
p, q: index:= 0;
end record;
end dqueue;
Memory space management
The management of the memory space for queues is the same as for stacks.
Therefore, the declarations of the types block and mem...space, as well as the
variables ms and free are exactly the same in the package body dqueue as in the
package body dstack of section 2.2.4. The procedures prep_meTTLspace, get_block
and release_block are also the same, as well as the invocation of prep_mem_space
free~
30
The subprograms empty and is_empty are so simple that are self explaining:
proce!}ure empty(qu: out queue) is
p: index renames qu.p;
q: index renames qu.q;
begin
if qj"=O then release_queue(p, q); end if;
end empty;
function is_empty (qu: in queue) return boolean is
q: index renames qu.q;
begin
return q=O;
end is_empty;
The operation get-first is straightforward as well. The only significant consideration is that in this implementation the attempt to get the first element of
an empty queue has the effect to consult the ms array at' position 0, Which is
out of its allowed range. Consequently, a .constraint_error will be.raised which
must be converted into a more explicit exception.
function geUirst( qu: in queue) return item is
q: index rena:mes qu.q;
begin
return ms( q).x;
exception
when constraint_error => raise bad_use;
end geUirst;
The operation put is achieved in three steps, as in the case of the implementation using access types. First, a block of available memory space must be
obtained. Second, the pushed item must be stored in the obtained block. And
third, the new block must be linked at the end of the queue. Like in the case of
the implementation using access types too, the link operation has two different
cases, depending on whether the queue is initially empty or not.
2.4. QUEUES
31
For the case that the list is initially empty, it is clear that the previous
algorithm leads to
For the case of a non empty list, after r:= geLblock; ms(r):= (x, 0); the
situation is:
The operation rem_first also works in three steps. First, an auxiliary index
keeps the position of the element to be deleted because later we shall bring it
back to the pool of free blocks. Then this element is actually removed from
the stack, by making q point to the next element of the queue. Like in the
implementation using access types, if as a result q becomes null, then p must
become null as welL
Finally, the released block may actually be brought back to the pool of free
blocks because it is no longer accessible. The complete algorithm is:
32
The following figures describe the case for a queue that initially contains
more than one component. Initially the situation is:
q
Once the block pointed by r has been brought back to the pool of free blocks
the final state is:
If the list contains initially' only one element, the initial situatioJ?. is:
After r:= q; q:= Tns(q).next; if q=O then p:= 0; end if; the situation is
q[g]
P[gj
r~
Then the block pointed by r goes back to the pool of free blocks and r
vanishes because it is a local variable.
If the queue is initially empty, q = 0, and a constraint_error will be raised
because this value is out of the range of m.s.
2.5. LISTS
2.5
33
Lists
2.5.1
Principles
A linked list is a linked structure where each component has at most one successor. Linked lists are a powerful mechanism for the implementation of many
different abstract data types. They are so frequently used that, for simplicity,
they are often called lists, in short. The linked implementations of stacks and
queues are particular applications of lists. In chapters 4, 5 and 6 lists will be
applied to the implementation of further data types.
However, a list is not, properly speaking, an abstract data type because
the essential aspect of lists are precisely the links, which is, intrinsically, an
implementation issue. Any attempt to characterize the behavior of a list regardless of the links leads to the specification of either multisets, ie. sets that
can keep multiple instances of the same element or (unbounded) strings of elements, ie. multisets that keep the order of insertion and can- be concatenated:
HOwever, multisets can be implemented either as linked lists or by other means,
and it is rather unclear that the other implementations of multisets have the
same advantages from the point of view of memory management, flexibility and
efficiency.
The application of lists to the implementation of stacks and queues has shown
their a4-vantages from the point of view of memory management. However,
stacks and queues require that insertions and deletions are done only at the
ends of the linked structure and from the point of view ofthe operations, they
are insufficiently illustrative about the advantages of linked lists. Therefore, it
is convenient to address some attention to other frequent operations on lists.
2.5.2
Usual operations
Insertion
Lists are very flexible to make insertions in arbitrary positions. Usually, this
_operation appears from th~ requirement to keep a set of elements in some order.
Let ns assume that the list is provided, as nsual, by a pointer to the first element
of the list as in the next figure, where the list may be empty, in which case first"
will be equal to null. Let us assume too that the elements in the list are sorted
in some ascending order.
The insertion of a new element must proceed in three steps. First, the right
position for the insertion is found. Second, a new cell is created to store the
new element. And third, the created cell is linked in the list.
The first step is a search for the first element in the list that is greater than
the element to be inserted. As the insertion has to be done before ~hat position,
say p, it is required to keep a pointer to the cell that is in the position previous
to the current one, say pp (for previous top).
The linkage step has to consider the possibility that the new element has
to be inserted at the beginning of the list, either because the list was initially
empty or because the new element is lower than any other one currently in the
list.
34
The following figures illustrate the linkage in the case that pptfnu!l, assuming that the new cell must be inserted immediately before the cell containing c.
Initially we have
Then p, pp and r vanish because they are local variables. The reader should
check by himself that the algorithm also works when the insertion has to be
done either at the beginning or at the end of the list. In the first case pp=null
at the end of the search step. In the second case p=null at the end of the search
step.
Deletion
Lists are also well suited to delete components from arbitrary positions. The
usual requirement is to delete an element that satisfies some property PR. The
removal must proceed in two steps. First, the cell containing the element to
be deleted is found. Second, the links are rearranged so that that cell becomes
excluded list.
The first step is a search for the first element in the list that satisfies PR. As
for the insertion, it is required to keep a pointer to the cell that is in the position
2.5. LISTS
35
previous to the current one. In p;rinciple, we should consider the possibility that
the search is unsuccessful, in which case the loop will end with p=nulL
The linkage step has to consider the possibility that the element to be deleted
is the first one in the list.
The resulting algorithm is:
Then p and pp vanish because they are local variables and the cell pointed
by p becomes garbage to be reused because it is no longer accessible.
Inversion
The problem of reversing the order of the elements stored in a list is illustrative
of the flexibility of this data structure. The algorithm is as simple as:
36
The following fignres illustrate the step of the loop. At the beginning of the
loop's body the state is:
n~
~
d
...
The reader should be aware that these steps also work at the beginning of
the list (in which case pp=null) and at the end (in which case p.next=nul!).
The principle of this algorithm is also used by the algorithms for garbage collection,- that have to make recursive traversals of linked structurespreciselywhen
there is no memory space to allocate any stack, as described in section 8.3.3.
Concatenation
The concatenation of lists is very simple if the lists are implemented by pointers
to both ends of the list, as in
type list is
record
first, last: pcell;
end record
Then the algorithm for concatenation (conceptually equivalent to a:= ab;
b:= .\;) is
procedure concatenate( a, b: in out list) is
c: list;
begin
if a.first=null then
c:= b;
elsif bJirst=null then
c:= a;
else
a.last.next:= b.first;
c.first:= a.first;
c.last:= b.last;
end if;
a:= c; b:= (null, null);
end concatenate;
2.5. LISTS
37
Indeed, the elements of a list can also be traversed and searched for, and the
algorithms are so obvious that at this stage the reader should be able to write
them down by himself.
Merge
Sorted lists can also be merged using algorithms very similar to those studied
for sequential files in section 8.5 of vol. II. However, the lists have the advantage
that the main data do not have to be copied. Only the pointers need to be
updated.
2.5.3
\Taria11ts
The principles of linear linking can be applied in a very flexible way. This section
describes the most irriportant patterns.
Multilists
Sometimes it is convenient that a cell containing some data belongs to more
than one list. For instance, a cell containing data about a book can belong to
the list of books of the same author, to the list of books of the same title and
to the list of books in ascending order of ISBN. In these cases, each cell has as
many pointers as lists it may belong to:
book
The algorithms for constructing such lists are as simple as for conventional
lists: the cell is allocated once and then it is linked as many times as lists it
belongs to.
The algorithms for deleting a cell from a multilist require the traversal of
every list that cell belongs to because of the need to update the pointer of the
predecessor cells in every one of the lists. This drawback can be solved by the
use of doubly linked lists, which are explained below.
38
Doubly linked lists have the advantage that we do not require to save the
pointer to the previously visited cell in the algorithms for insertion and deletion. This is particularly important when they are used in multilist structures,
because once a cell has been found it can be deleted from any list it belongs to
with no traversal of these lists.
Doubly linked lists have some drawbacks too. They occupy more space and
the operations for insertion and deletion take more time because more pointers
have to be updated. However, we shall see in chapter 4 some applications in
which the advantages are more important than the drawbacks.
Circular lists
Sometimes it is convenient that the last element in a list points back to the first
one, as the figure below describes. In this case, the list is called circular.
Circular lists can be used in combination with multilists, so that when a cell
is reached through the list according to some criterium (for instance when the
cell that contains data from a given book has been found following the list in
ascending number of ISBN}, a:nd then we need to obtain other cells related to
the one found (for instance, books of the same author). Indeed, the book found
through the ISBN list is not necessarily the first one in the list of books of the
same author and, consequently, it is not possible to retrieve all them unless the
list of books of the same author is circular.
Sentinels
Lists can be extended with dummy cells at either end, that make some algorithms simpler. These dummy cells are called sentinels.
For instance, if a list which is sorted in ascending order is extended with
a dunnny cell at the end that contains a value that is higher than any of the
possible values of the significant items, then the search and merge algorithms
become as mucb simplified as those corresponding to sequential files described
in chapter 8 of vol. II. The following figure illustrates the structure, where the
sentinel appears shadowed:
A dummy cell at the beginning of a list makes the insertion and deletion
algorithms simpler because the first (significant) cell is no longer a special case,
because it also has a predecessor cell (the sentinel). The figure below illustrates
the structure, where the sentinel appears shadowed:
first
2.5. LISTS
39
Indeed, the sentinels consume memory space and its use is- more appropriate
for lists that are expected to be so long that this amount is small in relation
with the total amount consumed by the list.
Variants combined
All the previous variants can be combined in any way: lists can have. sentinels
either at one end or at both ends, can be circular and also doubly linked, can
be circular and include- a sentinel at the .beginning too, etc. In all the cases
the algorithms that manage them are quite obvious and the reader can develop
them by himself if he has got a good understanding of the algorithms on linked
structures desc:r:ibed in the previous sections of this chapter.
2.5.4
Final remarks
The organization of data in linked lists is very useful to implement very different
kinds of types. In this chapter we have applied them to implement stacks and
queues._ In the next chapters they are applied again to implement sets, cartesian
products and graphs.
From the point of view of reorganizing data, the use of links has the advantage that actual data, which is usually large, does not have to be copied back..
and forth once .it has- been stored in a cell. Instead, any reorganization requires
to update only the pointers, which is much more cost effective.
Lists can be used in very flexible manners and with plenty of variants. For
this reason it is easier for a programmer to adapt the technique to each special
circumstance than using general purpose libraries that will seldom be completely
adapted to each particular case just to save the rather simple operations on links.
This volume addresses, with a few exceptions, pointers in primary storage.
Nevertheless, the concept of pointer can be extended also to secondary storage.
In this case, a pointer Can indicate either a record in a direct file, a physical
page of a disk or a record that is stored in some position on a physical page of a
disk. Most algorithms studied in this book that manage linked data in primary
storage can be easily extended to data in secondary storage and the reader is
expected to be able to make that adaptation by himself if he needs to.
Multilists, circular lists and doubly linked lists have been used extensively in
the implementation of database systems based in either the hierarchical model
or the network model. These kinds of lists are less extensively used nowadays
as a consequence of the substitution of the hierarchical and network database
models by new system based on the relational model.
40
Chapter 3
TREES
3.1
Introduction
The concept of tree was introduced in section 3.7 of vol. III. The reader should
find there the general terminology, the definitions of ordinary and binary trees,
the way how ordinary trees can be represented by binary trees, the recursive algorithms for the different kinds of traversals and the derivation of the equivalent
iterative versions.
However, in vol. III, trees are considered as a conceptual tool to help the understanding and the analysis of algorithms, as well as to help the transformation
of recursive algorithms into iterative ones in the case of compound recursion.
No materialization of trees as a data structure was considered there.
In this. volume trees are considered as true data structures that represent
hierarchical data such as algebraic expressions, the result of syntax analysis of
programming languages, directories of file systems or menus having submenus
in graphic user interfaces. Indeed, the implementation of trees is addressed too.
3.2
Binary trees
3.3
Specification
Section 6.2. 7 of vol. III provides the algebraic specification of binary trees, which
in turn can be used to represent ordinary trees.
Like for stacks and queues 1 efficiency considerations force us to implement
most of the operatioDB of that specifi~ation as procedures with parameters of
mode in out instead of actual functions. Also 1 some exceptions must be declared
to be raised either in the case of attempts to execute operations upon illegal
values of the type 1 or in the case of running out of memory space.
After these oonsiderations, the package specification part should be as follows:
41
42
CHAPTER 3. TREES
generic
type item is private;
package dbinarytree is
type tree is limited private;
bad_use:
exception;
space_overflow: exception;
3.4
Indeed, the pointers can be either access types or cursors. The implementation by access types is described here whereas its conversion to cursors is left
as an exercise to the reader. The declaration of the type is:
generic
type item is private;
package dbinarytree is
-- Exactly as in section 3.3
private
type node;
type pnode is access node;
43
type node is
record
x: item;
1, r: pnode;
end record;
type tree is
record
root: pnode;
end record;
end dbinarytree;
The operations empty and is_ empty are so simple that are self explaining:
procedure empty(t: out tree) is
p: pnode renames t.root;
begin
p:= null;
end empty;
function is"empty(t: in tree) return boolean is
p: pnode renames t.root;
begin
return p=null;
end is_empty;
The operation for building a new tree having the provided item at the root
and the provided trees as left and right children is simple too:
procedure graft(t: out tree; lt, rt: in tree; x: in item) is
p: pnode renames t.root;
pl: pnode renames lt.root;
pr: pnode renames rt.root;
begin
p:= new node;
p.all:= (x, pi, pr);
exception
when storage_error => raise space_overflow;
end graft;
perm
~&
The operation to get the data stored at the root node, and the operations
to get the children are straightforward as well. In all these operations, the
exception constraint_error will be raised if p=null, ie. if the actual parameter
is an empty tree, and it must be converted into a more explicit one:
CHAPTER 3. TREES
44
procedure root(t: in tree; x: out item) is
p: pnode renames t.root;
begin
x:=p.x;
exception
when constraint_error => raise bad_use;
end root;
3.5
3.5.1
=>
raise bad_use;
As described in section 3.7 of vol. III, binary trees can be used to represent ordinary trees and, consequently, they are a rmiversal tool for the implementation of
hierarchical structures. Nevertheless 1 their implementation is so simple that the
use of a general purpose package does not provide much saving effort. Instead,
in many cases it is more practical to adapt the concept of tree and its basic
linking mechanisms to each particular circumstance. This section is devoted to
develop a case study that illustrat~s the approach.
3.5.2
construct an algorithm that writes the derivative of the expression with respect
to that variable. The expression may contain variables, constants, operators and
function calls. It may also contain parenthesis up to any depth and it is expected
to end with a semicolon.
The operators allowed are addition, subtraction, product and division, with
their usual representation, plus the power operator, which is indicated by the
symbol '~'. The functions allowed are exp, ln, sin and cos.
In order to make the algorithms simpler, we may assume that the variables
have all single-letter names. Moreover, the input expression does not contain
45
neither spac'es nor any other separating character. Besides these constraints, the
input and the output must be according to the usual syntax of Ada expressions.
3.5.3
Expressions: specification
~~
/ \y
X
*
/'-.......COS
/\ I
Z
X
I
2
46
CHAPTER 3. TREES
package d_expr is
type expression_type is (e..null, e_const, e_var, e_un, e_bin);
type un_op
is (neg, sin, cos, exp, ln);
type bin_op
is (add, sub, prod, quat, power);
type expression
is private;
return expression;
return expression;
return expression;
return expression;
return expression;
3.5.4
3.5.
47
null;
when e_const =>
val:
integer;
when e_var =>
var:
character;
when e_un =>
opun:
un_op;
esb:
pnode;
when e_bin =>
opbin:
bin_op;
esbl, esb2: pnode;
end case;
end record;
end d_expr;
The implementation of the operations is straightforward. In order to focus
on the essential aspects of the problem, the details of raising exceptions in the
case of lack of memory space or bad use have been omitted. It is clear that an
inconsistency between the type of an expression and the operands we expect to
get from it is a case of bad use, which in this implementation will have as a
consequence the raising of a constraint_error exception. The reader is referred
to chapter 2 for a detailed discussion of these aspects.
package body d_expr is
function.b_null return expression is
p: pnode;
begin
p:= new node'(et=> e..null);
return p;
end b_null;
function b_constant(n: in integer) return expression is
p: pnode;
begin
p:= new node'(e_const, n);
return p;
end b_constant;
function b_var(x: in character) return expression is
p: pnode;
begin
p:= new node'(e_var, x);
retur:n p;
end b_var;
48
CHAPTER 3. TREES
begin
return e. val;
end g_const;
function g_var(e: in expression) return character is
begin
return e.var;
end g_var;
procedure g_un
(e: in expression; op: out un_op; esb: out exPression) is
begin
op:= e.opun;
esb:= e.esb;
end g_un;
procedure g_bin
(e: in expression; op: out bin_op; esbl, esb2: outexpression) is
begin
op:= e.opbin;
esbl:= e.esbl;
esb2:= e.esb2;
end g_bin;
end d__expr;
3.5.5
49
Once we have discussed the type that is essential to solve the problem and its
relation with trees, we can address the overall structure of the solution of the
exercise proposed, to show the usefulness of the proposed type. Indeed, the
main program must read the input expression from a file and build the appropriate internal representation. Then it has to obtain the derivative; simplify it
and write the result down. Consequently, we need three variables of the type
expression: one for the source, one for the derivative and one for the simplified
result. Conse-quently, the main program becomes:
with Ada.text...io, Ada.commandJine, algebraic, d_exPr;
use Ada.textJ.o, Ada.commandJine, algebraic, d_expr;
procedure manip_algeb is
f:
file_type;
e, de, sde: expression;
x:
. character; -- the variable to derive with respect to
begin
open(f, in_file, argurnent(l)&" .txt");
read(f, e);
close(f);
x:= argurnent(2)(1);
,_
3.5.6
- "" "
Building expressions
The procedure read builds the expression from the text file following an algorithm that is very similar to the one studied in section 3.6.5 of vol. III to evaluate
arithmetic expressions. That algorithm was made up of three mutually recursive
procedures that d,ealt with expressions, terms and factors recursively. Each one
50
CHAPTER 3. TREES
provided as a result an output parameter that was the value of that expression,
term or factor respectively.
As the source data for the derivation problem is very much the same, the
structure of the algorithm that reads the expression from the source file must be
similar too. In this case, four mutually recursive procedures are required because
the addition of the power operator involves one more level in the hierarchy. The
result that each one provides is an output parameter of type expression that
corresponds to the part read of the expression.
Only the relevant procedures are reproduced below because the auxiliary
operations add nothing to the understanding of trees we are addressing in this
chapter.
procedure read(f: in file_type; e: out expression) is
c: character;
procedure read_expr (e: out expression);
procedure read_term (e: out expression);
procedure read_factor (e: out expression);
proce~ure read...factorl(e: out expression);
function scanjd
return un_op;
function scan....num return integer;
procedure read_expr(e: out expression) is
neg_term: boolean;
op:
character;
el:
expresswn;
begin
neg_term:= c='-'; if neg_ter_m then get(f, c); end if;
read_term(e);
if neg_term then e:= b_nn_op(neg, e); end if;
while c='+' or c='-' loop
op:= c; get(f, c); read_term(el);
if op='+' then e:= b_bin_op(add, e, el);
else e:= b_bin_op(sub, e, el);
end if;
end loop;
end reacLexpr;
procedure read_term(e: out expression) is
op: character;
el: expression;
begin
read_factor(e);
while c='*' or c=' /' loop
op:= c; get(f, c); read_factor(el);
if op='*' then e:= b_bin_op(prod, e, el);
else e:= b_bin..op(quot, e, el);
end if;
end loop;
end read_terrn;
51
52
CHAPTER 3. TREES
For all the procedures that read an entity (ie. an expressionr a term or. a
factor), the precondition is that c is the first character of that entity and the
postcondition is that c is the first character that is not from the recognized
entity, and moreover the output parameter is the internal representation of the
recognized entity.
3.5.7
Deriving expressions
The rules for deriving the expressions proposed in the statement of the problem
are:
Dxc
DxY
Dxx
Dx(-f(x))
Dx sin(f(x))
Dx cos(f(x))
Dxef(x)
D f(x)
x g(x)
Dx(f(x))"
Dx(f(x))g(x)
if c is a constant
if y#x
-Dxf(x)
cos(f(x)) Dxf(x)
- sin(f(x)) Dxf(x)
=
ef(x) Dxf(x)
1
f(x) Dxf(x)
=
=
Dxf(x) + Dxg(x)
Dxf(x)- Dxg(x)
(Dxf(x)) g(x) + (Dxg(x)) f(x)
(Dxf(x)) g(x) - (Dxg(x)) f(x)
(g(x))2
cf(x)c-I Dxf(x)
Dx(eg(x) ln(f(x)))
Dxlnf(x)
Dx(f(x) + g(x))
Dx(f(x)- g(x))
Dx(f(x) g(x))
0
0
The featnres of the internal representation for expressions allow these rules
to be directly applied and make the algorithm for derivation straightforward.
Indeed, the rules to apply depend on the kind of expression (ie. constant, variable, rooted by a unary operator or rooted by a binary operator):
function derive(e: in expression; x: in character) return expression is
de: expression;
begin
case e_type(e) is
when e..null => de:= b..null;
when e_const =>de:= b~constant(O);
when e_var => de:= derive_var(e, x);
when e.un
=>de:= derive_un (e, x);
when e_bin =>de:= derive_bin(e, x);
end case;
return de;
end derive;
where
53
54
. CHAPTER 3. TREES
when prod=>
a:= b_bin_op(prod, desbl, esb2);
b:= b_bin_op(prod, desb2, esbl);
de:= b_bin_op( add, a, b);
when quot =>
a:= b_bin_op(prod, desbl, esb2);
b:= b_bin_op(prod, desb2, esbl);
c:= b..bin_op(sub, a, b);
d:= h-constant(2);
d:= b_bin_op(power, esb2, d);
de:= b_bin_op(quot, c, d);
when power =>
if not contains(esb2, x) then
a:= b_constant(l);
a:= b_bin_op(sub, esb2, a);
a:= b_bin_op(power, esbl, a);
a:= b_bin_op(prod, esb2, a);
de:= b_bin_op(prod, a, desbl);
else- f(xrg(x) = exp(ln(f(xrg(x)) = exp(g(x)ln(f(x)))
a:= b_un_op (ln, esbl);
a:= b_bin_op(prod, esb2, a);
a:= b_un_op (exp, a);
de:= derive(a,x);
end if;
end case;
return de;
end deri ve_bin;
3.5.8
55
SimplifYing expressions
The expression resulting from the derivation process should be simplified but
simplifying an expression is, in principle, a complex matter because there is no
absolute concept of a_ normal form, so the equivalent non reducible form is not
unique. Moreover, some kinds of reduction imply the application of distributive
and commutative properties, which may lead to non terminating algorithms.
As this exercise Only addresses to illustrate the usefulness of trees and how
easy is to use them, only the following elementary rules:
-(- f(x))
f(x)
-0
sin(O)
cos(O)
0
0
eo
1
1
In(1)
0+ f(x)
f(x) + 0
f(x)
f(x)
c1
+ c2
C3
0- f(x)
!('!')- 0
-f(x)
f(x)
C1- C2
C3
-c3
c1 - C:2
0 f(x)
f(x) 0
1 f(x)
f(x) 1
0
0
c1 c2
C3
0/ f(x)
f(x)/1
Of(x)
(f(x))o
1f(x)
(f(x))1
(if c1 ? c2)
(if c, < c2)
f(x)
f(x)
f(x)
0
1
1
f(x)
The features of the internal representation for expressions allow the application of these rules to be as direct as the rules for derivation, and the following
algorithm results, which recursively simplifies the subordinated expressions first
and then applies the rules above to the result:
function sirnplify(e: in expression) return expression is
s: expression;
begin
case e_type(e) is
when e_null I e_const I e_var => s:= e;
when e_un => s:= simplif'y_un (e);
when e_bin => s:= simplif'y_bin(e);
end case;
returns;
end simplifY;
56
CHAPTER 3. TREES
when sin=>
if e_type(esb)=e_const and then g_const(esb)=
s:= esb; -- sin(O) -> 0
else
s:= b_un_op(uop, esb);
end if;
when cos=>
if e_type(esb)=e_const and then g_const(esb )=
s:= b_constant(1);- cos(O) -> 1
else
s:= b_un_op(uop, esb);
end if;
when exp =>
if e_type(esb )=e_const and then g_const(esb )=
s:= b_constant(1); -- exp(O) -> 1
else
s:= b_un_op(uop, esb);
end if;
when In=>
if e_type(esb )=e_const and then g_const(esb )=
s:= b_constant(O); -- ln(1) -> 0
else
s:= b_un-Dp(uop, esb);
end if;
end case;
returns;
end simplify_un;
0 then
0 then
0 then
1 then
'(
_, \
"
57
58
CHAPTER 3. TREES
59
Indeed, the simplification algorithm can be more sophisticated than this one.
For instance, this algorithm is unable to convert 7( -3) into -21, nor to reduce
according to elnx = x or In ex=- x, but the reader should be able to extend the
algorithm to address such reductions by himself.
In contrast, there is a relevant matter regarding the simplification rules based
on the equality of expressions, such as
f(x)- f(x)
f(x)/f(x)
0
1
1
It is clear that the equality test for the type expression provided by the
compiler will- co;mpare the_ pointers, which is not the appropriate comparison,
because the result will be true only if the two pointers refer to the same tree.
Instead what is required for simplification purposes is whether two different
trees are algebraically equivalent or not.
There are two ways to prevent a user of the package d_expr from a wrong
use of the equality test provided by the compiler. One is to declare the type
expression as limit~d .private. However this choice implies that the assignment
is forbidden too, something that would make the derivation and simplification
algorithms quite uncomfortable to read. The second way is to keep the type
expression as (non limited) private, and override the equality test by adding
function ''="(el, e2: in expression) return boolean;
0(1).
3.5.9
Writing expressions
The output of the expression is essentially an inorder traversal of the tree. For
binary operators, the left subexpression must be written first, then the operator
and finally the right subexpression. For unary operators, the operator must be
written first and then the subordinated expression.
So the algorithm is quite simple and the only difficulty is to save unnecessary
parenthesis. A look at the figure below shows that this decision requires a
detailed casuistic: it has to do with the precedence of operators, the rules for
the left operator are different than the rules for the right operator and some
unary operators always require parenthesis, whereas others do not.
1The concept of undecidability is explained in chapter 11 of vol. II.
CHAPTER 3. TREES
60
/'\
a
*
/'\
b c
a+bc
*
/'\
a +
/'\
b c
a.(b+c)
/'\
+ c
/'\
b
a
a+b-c
/'\
a +
/'\
b
c
a-(b+c)
/'\
a +
/'\
b c
a+b+c
I
a
-a
sm
sin( a)
3.5.10
61
An important amount of memory conld be saved if the data structure to represent expressions save common subexpres$ions only once. For instance, the
expression (a+ b) * (a+ b) can be stored by a structure in which the root node
has its left and right pointers referring to the same node that represents the
common subexpression, ie.:
*
()
+
/'\
/~
+
/'\
instead of
/'\
de-*
"-e
c...'<:-/~
+ 2
/'\
In order to avoid the allocation of space for nodes that already exist, it is
required that the allocation algorithm has easy access to all the currently created
nodes. This can be more easily achieved if the memory space is organized by
means of cursors instead of access types. If so, the request for the allocation
of a new node is preceded by a search for a preexisting node with the same
contents. If found, the pointer to the preexisting node is used, instead of actually
allocating a new one.
According to this principle, the new version of the specification part of package d_expr is
package d_expr is
: - Exactly as in section 3.5.4.
private
max: constant integer:= 200;
type expression is new integer range 1. .max;
subtype index is expression;
type block(et: expression_type:= e_null) is
reCord
case et is
when eJ1ull =>
null;
when e_const =>
val:
integer;
when e_var =>
var:
character;
62
CHAPTER 3. TREES
The management of the memory space is simple because the blocks are never
released, so the structure only grows. Consequently, a counter with the number
of used blocks is sufficient and the memory management part of the body of
. d...expr looks as:
begin
ne:= 1;
end d_exprj
The allocation algorithm is a simple linear search that uses the sentinel
technique. The algorithms for building expressions are straightforward too:
function b...null return expression is
b: block; e: index;
begin
b:= block'(et=> e..null);
allocate(b, e);
return e;
end b...null;
function b_constant(n: in integer) return expression is
b: block; e: index;
begin
b:= (e_const, n);
allocate(b, e);
return e;
end b_constant;
63
64
CHAPTER 3. TREES
procedure g_bin
(e:
in expression;
op:
out bin_op;
esb1, esb2: out expression) is
begin
op:= s(e).opbin;
esb1:= s(e).esb1;
esb2:= s(e).esb2;
end g_bin;
3.5.11
Final remarks
Several lessons can be drawn from the development of this long exercise. First,
we have seen the usefulness of trees to represent and manage hierarchical data
such as an arithmetic expression. 'fr_ees can be used to manage any other type of
hierarchical data in a similar way, such as syntax trees of programming languages
which are widely used by compilers, decision trees used in artificial intelligence,
file directories used by operating systems and component hierarchies manaded
by graphic user interfaces.
Second, we have seen that any hierarchical structure can be represented
either by a binary tree of by a specifically tailored tree, the latter choice having
several advantages from the point of view of readability and strong type checking
by the compiler.
Third, that neither the use of a hierarchy nor the implementation of a treestructured data type have special difficulties.
And .fourth, that a clear separation between specification and implementation allows us to safely improve the implementation of a type. For instance,
changing the implementation of the type expression from that provided in section 3.5.4 to that provided in section 3.5.10 does not require the update of
65
package algebraic even by a single comma. The same will happen if the implementation is still improved by using the techniques described in section 4.11
to make the building operations O(n). This principle is essential to make high
quality software that is easy to maintain, adapt and improve.
The fact that some of the subtrees are shared makes the implementation of
expressions to look as directed acyclic graphs, a structure that will be developed
in chapter 6. In practice, however, they are actually trees, to the extent that an
abstract data type is characterized by the external operations. In the case of
expressions, the traversals for deriving or for writing imply a complete traversal
of all the branches, as for a tree, no matter they have been already traversed
as a consequence of being shared. Instead, the algorithms for the traversal of a
graph do not visit any node twice.
3.6
There are other ways of structuring data that share some aspects with trees.
For instance, a binary tree can be stored in an array on a positional basis: the
root is at position 1 and the left and right children of a node at position i are
placed at position 2i and 2i + 1 respectively:
T
Ia I
2i
/\
In this case, given a node it is easy and efficient to retrieve both its children
and its parent. Instead, the approach is ill suited to build trees according to the
operations specified in section 6.2.7 of val. III (see also section 3.3 of this vol.).
Moreover it leads to an unaffordable waste of space unless the tree is completely
balanced. However, they are best suited to implement priority queues, as it will
be shown in section 5.2.
Another possibility, that can be used for the recording of ordinary trees, is
to keep the nodes i,; an array, with a pointer (ie. an index) to the parent.
/1\
/t I
In this case, it is easy and efficient to follow the paths from one node towards
the root, but the structure is ill suited to retrieve the children of a node.
In either case, strictly speaking, these structures can not be considered as
implementations of trees, to the extent that the operations they provide are
different than those that specify the type tree (binary or ordinary). Instead,
these structures appear as solutions for the implementation of data types such
as priority queues or equivalence relations that, in principle, have nothing to
do with a hierarchy of dependencies. They will be studied with respect to the
problems they can be applied to in chapter 5. Indeed, some general properties
of trees, in particular, the relationship between the height and the total number
of nodes, will be useful to discuss the cost of the involved operations.
66
CHAPTER 3. TREES
Chapter 4
67
68
4.2
Specification
The specification part of a package that implements the type set (actually a dictionary, as we do not address union, intersection and difference for the present)
IS:
generic
type elem is private;
package d..set is
type set is limited private;
space_overfl.ow: exception;
procedure empty (s: out
set);
procedure put
(s: in out set; x: in elem);
function isJ.n
(s: in
set; x: in elem) return boolean;
procedure remove (s: in out set; x: in elem);
function is_empty(s: in
set) return boolean;
private
- Implementation dependent.
end d...set;
Some applications require an additional operation to select an arbitrary element from a non empty set. This kind of operations is discussed in section 4.12.
For a mapping, the specification part of the package is:
69
generic
type key is private;
type item is private;
package d_set is
type set is limited private;
already _exists: exception;
does....not_exist: exception;
space_overflow: exception;
procedure empty (s:
procedure put
(s:
procedure get
(s:
procedure remove(s:
procedUre update(s:
private
--- -- Implementation
end d....set;
4.3
4.3.1
out
in out
in
in out
in out
set);
set; k:
set; k:
set; k:
set; k:
in
in
in
in
key; x: in item);
key; x: out item);
key);
key; x: in item);
dependent.
When the elements of a set belong to a discrete type, and the sets are expected
to contain almost as many elements as possible values of its range, the most
practical representation of a set is an array of boolean components in which
a( x) =true means that x E A.
These conditions are usual for algorithms that manage sets of states from
finite automata, such as for the automatic generation of lexical analyzers, alld
also for algorithms that manage sets of (numbered) variables in the code optimization part of a compiler.
Unless otherwise stated, a compiler will use more than one bit to store
a boolean value, usually a byte or two, depending on the addressing mode
of the underlying computer. However, if memory saving is more important
than execution time, we can force the compiler to store the array of boolean
components on the basis of a single bit per component using the pragma pack.
This pragma forces the compiler to use the minimum possible memory space
to allocate a type. Indeed, if this pragma is used, the code to access a single
component of the array will be slower, but still 0(1) _ According to this principle,
the specification part of the package becomes:
generic
type elem is ( <> ); -- Restricts elem to be a discrete type.
package d_set is
- - --Exactly as in section 4.2 (sets).
private
type set is array(elem) of boolean;
pragma pack(set);
end d_set;
The implementation of the operations is simple too:
70
4.3.2
Mappings
If additional data has to be associated to the elements stored in the set, the
representation of section 4.3.1 has to be extended with an array of items:
generic
type key is ( <> ); -- Restricts key to be a discrete type.
type item is private;
package d.Bet is
-Exactly as in section 4.2 (mappings).
private
type existence is array(key) of boolean;
type contents is array(key) of item;
type set is
record
e: existence;
c: contents;
end record;
end d....set;
71
4.4
4.4.1
Arrays of elements
Overview
It may happen that the range of the keys is much larger than the actual number
of elements to be stored in the sets. For instance if the keys are passport
numbers and the sets have to store basket teams or to keep data about their
72
max
where n is the number of eleme:q.ts currently in the set and max is the maximum
number of elements that a set may contain.
It may happen that there the type of the keys has an ordering relationship.
If so, the elements can be stored in ascending order providing faster means for
retrieval.
4.4.2
Unsorted arrays
4.4.3
Sorted arrays
The bodies of the operations for sorted arrays are also simple and they are left
as an exercise to the reader too. It is clear that empty is 0(1), put is O(n)
and more time consuming than in the unsorted version because after checking
that there is no element in the set with the same key, and locating the position
to make the insertion, which takes O(logn) time if binary search is used (see
section 5.3.2 of vol. II), then all the elements having a key greater than that of
the inserted element have to be shifted right. The operations get and update
are O(log n) because the position of the element can be found by binary search,
and remove is O(n) too, because the position of the element to be removed can
be located by binary search, but then the elements with keys greater than that
of the removed element have to be shifted left.
4.5
4.5.1
Linked lists
Overview
73
first
...
This kind of implementation does not require to foresee any upper bound for
the size of a set and each set occupies no more memory space that it is required.
As in the case of arrays, it may happen that the type of the keys has an
ordering relationship. If so, the elements can be stored in sorted order too
(usually ascending) but, as binary search can not be applied, no matter the list
is sorted or not, empty is 0(1), and the other operations are O(n) .. Nevertheless,
a sorted implementation may be convenient to achieve that the operations of
union, intersection and difference become efficient (see section 4.13), or simply
to retrieve an ordered listing of the elements (see section 4.12).
The strong similarity between sorted and unsorted lists makes it worthless to
describe both in detail. So, for unsorted lists only the basic hints are provided.
4.5.2
Unsorted lists
The insertion of a new_ element into an unsorted list is most easily done at
the beginning of the list, like for pushing an element on top of a stack (see
section 2.2.3). Naturally, the list must be scanned first to avoid duplicated
keys. Linear scan is also required for the operations get, remove and update.
4.5.3
Sorted lists
type cell is
record
k:
key;
x:
item;
next: pcell;
end record;
type set is
record
first: p cell;
end record;
end d....set;
The operation to make an empty set is trivial:
74
with k' < k < k". Mter the allocation and linking phases, the state becomes
PP[jJ
~
,
The variables p, pp and q being local, they will vanish on completion of the
procedure.
The algorithms to consult and update items are straightforward and self
explaining:
75
On return from the procedure, the local variables p and pp vanish. Consequently, the cell containing key k is no longer accessible and becomes garbage)
ie. the space it occupies may be reused.
76
4.6
Overview
Linked lists have the advantage over arrays that they do not require to foresee a
size that may become either too constraining or too wasteful. Moreover, sorted
linked lists have the advantage over sorted arrays that the stored elements do
not have to shift right or left on insertions and removals, something that can
be very time consuming when the size of the type item is large. Nevertheless,
sorted linked lists have the drawback, in contrast with sorted arrays that only
linear search is possible and in any case -linked lists or arrays, sorted or notthe operations of insertion and removal take O(n) time, which is unacceptable
for large sets.
Binary search trees are an organization technique that can be applied to
keys that have a defined order relation. They provide the same flexibility than
linked lists, from the point of view of the sizing constraints, but insertion and
removal can be much more efficient and searching to retrieve or to update the
stored data can be as efficient as for sorted arrays.
The principle is simple: the set is represented by a binary tree in which
each node contains one stored element (in the case of mappings, a key pins
its associated item). Any node p of the tree satisfies the property that all the
elements stored in its left child have a key lower than p.k and all the elements
in its right child have a key greater than p.k:
If the data are organized this way, the search can be very efficient because
either the key we are looking for is equal to that of the root or we can discard
one of the two children, and so on recursively.
The type declarations
The generic and private parts of package d_set to implement mappings by means
of binary search trees are:
generic
type key is private;
type i tern is private;
with function"<" (kl, k2: in key) return boolean;
with function "> 17 (kl, k2: in key) return boolean;
package cLset is
... -Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;
77
type node is
record
k:
key;
x:
item;
lc, rc: pnode;
end record;
type set is
record
root: pnode;
end record;
end d....set;
In principle, only the function '' <" should be required, because the order
relationships are assumed to be total, and so ">" is equivalent to not "less
than" and not equal. However, if the function 11 >" is used too, the inherent
symmetry of the algorithms becomes textually reflected, which improves their
readability.
Empty
The empty set is trivially represented by an empty tree:
procedure empty(s: out set) is
root: pnode renrunes s.root;
begin
root:= null;
end empty;
Put
The addition of a new element in the set has two cases. The first case is that the
tree (or subtree) is empty, in which case a single-node tree (subtree) is allocated.
The second case is that the tree (or subtree) is non empty. Then three subcases
appear depending on whether the key to be inserted is equal, greater or lower
that the key of the root's tree (or subtree). If equal, the attempt of insertion
creates a conflict. If it is lower, it is clear that the new element has to be inserted
in the left subtree and if it is greater, in the right subtree.
Indeed, inside the package body d..set the algorithms must see the pointers
explicitly, for this reason the procedure put that is externally visible and receives
a set, invokes a local procedure that receives a pointer to do the job. One way
to solve the potential inefficiency that may derive from this duality is to apply
the pragma inline to the externally visible procedure (after the renaming of
the internal one). A better approach is mentioned below.
procedure put(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
begin
put( root, k, x);
end put;
where
78
,p
79
Remove
The elimination of a node from a binaty tree requires a more careful analysis.
First, the node containing the key to be eliminated has to be found, which is
as easy as for the operations get and update. Once the node has been found, if
it is a leaf, the elimination is quite simple: it is sufficient to replace the pointer
currently addressing to that node by null.
If the node to be deleted (ie. having key k) has only one cl-Jld, the deletion
is simple too. If it is a left child of a parent with key kp, as in the figure below,
then any node of k's unique child (either left or right) has a key lower than kp
Therefore, this only child can become the new left child of the node having key
kp. If the node to be deleted is a right child, the discussion is symmetricaL In
either case, the pointer currently addressing to the node having key k has to be
updated to address the only child of the node to be deleted.
If the node to be deleted has two children the situation is more complicated
because some node must still be in that position to keep the treeJs structure.
Therefore, it has to be replaced by some other node. There are two candidates:
the leftmost node of its right child and the rightmost node of its left child.
These candidates have two properties. First, they have at most one child, and
so they can be easily dropped off. And second, they are greater than any other
element in the left subtree of the node to be removed and smaller than any other
element in the right subtree of the node to be removed. Being symmetric, there
is no preference for any of the two choices and we shall choose, arbitrarily, the
leftmost element of the right subtree.
The following figure illustrates the two phases of the deletion in this case.
First, the leftmost element of the right subtree is dropped off and reserved.
Second this node replaces the node to be deleted. This can be done either
copying its contents and keeping the links unchanged or changing only the links.
In principle, the second option is preferable because the size of a pointer is
usually much smaller than the size of the stored data.
80
The actual remove considers the three cases described above: it is a leaf, it
has only one child, which can be either left or right, or it has two children. In
the latter case, the leftmost node of the right child is removed by rerrLlowest.
On return from this calling, the variable plowest points to the node just removed
from the right child. Then the pointers are adjusted so that this node replaces
the node to be deleted in the tree structure.
procedure actual..remove(p: in out pnode) is
- Prec.: k=p.k
plowest: pnode;
begin
if
p.lc =null and p.rc =null then
p:= null;
elsif p.lc =null and p.rc/=null then
p:= p.rc;
elsif p.lc/=null and p.rc =null then
p:= p.lc;
else-- s.lc/=null and s.rc/=null
reiiLlowest(p.rc, plowest);
plowest.lc:= p.lc; plowest.rc:= p.rc; p:= plowest;
end if;
end actuaLremove;
Indeed, if p.lc=null then p:= p.rc; works no matter p.lc is null or not.
So, the previous algorithm can be slightly optimized by evaluating two boolean
expressions instead of three. We have maintained the four cases explicitly to
keep an exact correspondence between the case analysis and the algorithm, in
81
5
1,2,3,4,5
1
5,4,3,2,1
\
5
I
2
\
4
I
3
1,5,2,4,3
I
1
\
\
4
I
3
5,1,2,4,3
For these degraded cases, the binary search trees work as bad as unsorted
linked lists. All the operations, but empty, are O(n).
It is easy to prove that the number of input orderings leading to these degraded cases is 2n-l whereas the total number of possible orderings is n!. AB
li!Iln~oo 2n-l fn! = 0, we might expect that the problem is not serious for large
values of n. Nevertheless, the fact that some input orderings lead to a degraded
tree is particularly worrying because, in principle, we may not make any assumption about them. The next two sections are devoted to improvements of
binary search trees that avoid such degraded cases.
4. 7
Overview
A binary searcll tree is said to be completely balanced if, for each node, the
number of nodes of its left and right subtrees differ at most by 1. Unfortunately,
82
type node is
record
k:
key;
x:
item;
bl:
balanceJactor;
lc, rc: pnode;
end record;
type set is
record
root: pnode;
end record;
end d....set;
Put
As for simple binary search trees, the operation put is implemented by an internal one. However, in this case, the internal put operation has an additional out
boolean parameter h. It indicates that, as a result of the insertion, the height
of the tree has increased (by one, indeed).
1 The
prefix AVL comes from the inventors of this technique: Adelson) Velslrii and Landis.
4. 7. BALANCED TREES I: A VL
83
The actual insertion algorithm is essentially the same as for binary search
trees. The differences are that h is set to true after converting an empty tree
into a single node tree (because the previous height was 0 and the new is 1),
and that on return of the insertion on a subtree, if the subtree's height has
increased, balancing must be considered.
procedure put(p: in out pnode; k: in key; x: in item; h: out boolean) is
begin
if p=null then
p:= new node; p.all:= (k, x, 0, null, null);
h:= true;
else
if
k<p.k then
put(p.lc, k, x, h);
if h then balance.left (p, h, insert..mode); end if;
elsif k>p.k then
put(p.rc, k, x, h);
if h then balance..right(p, h, insert..rnode); end if;
else - k=p.k
raise already _exists;
end if;
end if;
exception
when storage_error => raise space_overfiow;
end put;
As we shall see below, rebalancing operations after insertions and after deletions have very few differences, so only a single balancing procedure is provided,
which has a parameter that tells whether it works after an insertion or after a
deletion. The type for this parameter, which is declared in the package's body,
is
type mode is (insert_mode, remove_mode)i
We suggest the reader to attempt the understanding of the balanciug algorithm focusing only on the insertion mode for the present. Later, after the
discussion about removal, he should read this algorithm again, then focusing
only on the removal mode.
In some circumstances, the increase in height of a subtree does not require
the tree to be reorganized. It may have become even more balanced than before.
Only in the case that the subtree that has increased its height was higher than
its sibling before the insertion the tree needs some reorganization to become
balanced again:
84
In the first case, the increase in height of the left subtree has left the subtree
rooted at p even more balanced. Moreover, the new height of the subtree rooted
at p is the same as before the insertion. In the second case, even though the left
subtree has increased in height, the subtree rooted at p is still AVL-baJanced
because the new difference in height of its two children is not greater than one.
However, the new height of the subtree rooted at p is greater than before the
insertion. So, although no reorganization is required at this level, h must keep
being true and we may not discard that some reorganization is required at higher
levels, on return from this recursive call to put. In the third case, the growth
of the left subtree has left the subtree rooted at p definitively unbalanced, as
now the difference in height between its two subtrees is now 2. Consequently,
some reorganization is mandatory at this level. The procedure rebalance_lejt is
in charge to do it.
The reorganization of nodes to leave a subtree balanced is one of the rare,
and unfortunate circumstances in which it is required to consider more than two
levels of a tree, which makes the algorithms cumbersome. Two cases appear,
depending on whether the balancing factor of the subtree that has increased in
height is greater than 0 or not. The following figure illustrates the first case
before and after the rebalancing.
b
c
L - - . l ........................................................ L - - . 1
85
In this figure, the dashed lines mean that the subtree can have that additional
level of height or not. Being a binary search tree, it is clear that initially
k1 < k 2 , all the elements in subtree A have keys lower than k 1 , all the elements
in subtree B have keys greater than k 1 and lower than k2 and all the elements
in subtree C have keys greater than k2 . Therefore, the tree that results from
the rearrangement of nodes is a binary search tree too. Moreover, the resulting
tree is also balanced, as the difference in height of the subtrees of nodes having
keys k 1 and k2 is, in both cases, at most 1. Depending on the height of B,
the resulting tree can be higher than before the insertion or not. If it is still
higher, the parameter h must keep being true, and we can not discard that
further rearrangements are required at higher levels of the tree on return of the
recursive call. The names a, b and b2 correspond to the meariing of the local
variables used in the procedure that makes the transformation.
Indeed, if B is initially higher than A, the tree that results from this transformation would be unbalanced. Therefore, a second case has to be considered,
which is illustrated by the figure below:
a
L __ -1 ................................................................................................................
Being a binary search tree, it is clear that initially k1 < k2 < k3, all the
elements in subtree A have keys lower than k1, all the elements in subtree B1
have keys greater than k 1 and lower than k2, all the elements in subtree B2 have
keys greater than k2 and lower than k3 and all the elements in subtree C have
keys greater than k3. Therefore, the tree that results from the rearrangement of
nodes is a binary search tree too. Moreover, the resulting tree is also balanced,
as the difference in height of the subtrees of nodes having keys k 1 , k 2 and k 3
is, in all the cases, at most 1. In this case, no matter what the heights of B1
and B 2 are, the height of the subtree rooted at p after the insertion is the same
as before, so the parameter h must be set to false. The names a, b, c, c1 and
c2 correspond to the meaning of the local v-ariables used in the procedure that
makes the transformation.
In either case, the algorithms work in three steps. First, pointers are addressed to the relevant nodes. Second, the links are updated, and third the
balancing factors and the parameter h are updated, if required. The resulting
algorithm is:
procedure rebalance_.left(p: in out pnode; h: out boolean; m: in mode) is
- Either p.lc has increased in height one level (because of an insertion)
or p.rc has decreased in height one level (because of a deletion)
86
a:
pnode; -- the node initially at the root
b:
pnode; - left child of a
c, b2: pnode; -- right child of b
c1, c2: pnode; -- left and right children of c, respectively
begin
a:= p; b:= a.lc;
if b.bl<=O then - promote b
-- assign name
b2:= b.rc;
a.lc:= b2; b.rc:=a; p:= b;
-- restructure
if b.bl='O then
- update bl and h
a.bl:= -1; b.bl:= 1;
if m=removunode then h:= false; end if; -- else h keeps true
else- b.bl=-1
a.bl:= 0; b.bl:= 0;
if m=inserLmode then h:= false; end if; -- else h keeps true
end if;
else - promote c
c:= b.rc; cl := c.lc; c2:= c.rc;
-- assign names
b.rc:= cl; a.lc:= c2; c.lc:= b; c.rc:= a; p:= c;
-- restructure
if c.bl<=O then b.bl:= 0; else b.bl:=1; end if; -- update bl and h
if c.bl>=O then a.bl:= 0; else a.bl:= 1; end if;
c.bl:= 0;
if m=inserLmode then h:= false; end if; -- else h keeps true
end if;
end rebalanceJeft;
Then, the element having a key equal to that provided has to be looked for.
If found, the element has to be actually removed:
4. 7. BALANCED TREES I: A VL
87
The re<>-der should be aware that the operations for balancing the tree are
the same_ as for insertion, but with inverted roles. Therefore, after insertion on
a left child, the procedure bal.ance_lejt is invoked. Instead, after removing an
element from a left child, the procedure invoked is balance...right.
The procedure for actually removing an element has exactly the same four
cases as for simple binary search trees. Indeed, the removal of a node with less
than two children makes the tree to decrease one level. The removal of the
leftmost element of the right child may have the effect of decreasing the right
child's height or not, which is indicated by the output parameter h:
procedure actual-.remove(p: in out pnode; h: out boolean) is
-- Prec.: p.k = k
plowest: pnode;
begin
if
p.lc= null and p.rc= null then
p:= null; h:= true;
elsif p.lc =null and p.rc/=null then
p:= p.rc; h:= true;
elsif p.lc/=null and p.rc =null then
p:= p.lc; h:= true;
else-- s.lc/=uull and s.rc/=null
remJ.owest(p.rc, plowest, h);
plowest.lc:= p.lc; plowest.rc:= p.rc; p:= plowest;
end if;
end actuaLremove;
where
procedure rem_lowest(p: in out pnode; plowest: out pnode; h: out boolean) is
-- Pre c.: p /=null
begin
if p.lc/=null then
remJ.owest(p.lc, plowest, h);
if h then balance_right(p, h, remove_rnode); end if;
else
plowest:= p; p:= p.rc; h:= true;
end if;
end re:rn.lowest;
88
According to this new figure, the reader should check that the algorithm
balance_lejt described in page 84 is also correct when the mode is for removing.
The procedure rebalance..left described in page 85, that reorganizes the nodes
to keep the tree balanced after an insertion can also be applied to reorganize the
nodes after a deletion. Only the update of the parameter h must be reconsidered,
because its meaning is different in the two cases: in insertion mode it meant
that the tree's height had increased, whereas in remove mode it means that the
tree's height has decreased. The reader should review that algorithm from the
point of view of the adjustment of the value of h.
4.8
Overview
To achieve the RE-balancing property, each node of aRE-tree has a singlebit mark, which is commonly called the color of the node. By convention, the
two possible values of this mark are called red and black.
To grant that the tree grows and shrinks in a balanced way, each operation
on the tree must leave it satisfying two fundamental properties:
Pl.
P2.
As a result of these two properties, the longest possible path from the root
to a leaf has alternatively colored nodes, whereas the shortest possible one has
black nodes only. Therefore P 1 and P2 imply that the longest path is at most
twice as long than the shortest one, as desired.
To make both the algorithms and the discussion easier, two complementary
properties are added to the fundamental ones:
P3.
P4.
2A more thorough analysis (see [5]) may grant that the maximum height of a RB-tree of n
nodes is hma.x :s;;: 2log2(n + 1):
89
Qblack
Indeed, there is a fifth possible case, not illustrated by the figure: the one in
which no conflicting red nodes exist and does not require any reorganization.
90
D
D
B
Case LL
Case LR
Case RL
From any of these situations, the subtree rooted at p must be reorganized as:
p~
~~
BC D
91
The rebalancing algorithm uses an enumeration type for recording the cases:
type baLcase is (none, 11, lr, rl, rr);
which is declared local to the body of the package d_set. It invokes the function
ULbaLca.se to identify the reorganization case, and then changes the links and
colors accordingly:
procedure balance(p: in out pnode) is
a, b, c: pnode;
begin
case i<LbaLcase(p) is
when none=>
null;
whenll =>
c:= Pi b:= c.lc; a:= b.lc;
c.lc:= b.rc;
bJc:= 8.; b.rc:= Cj p:= b;
a.c:= black; c.c:= black; b.c:=
when lr =>
c:= p; a:= c.lc; b:= arc;
c.lc:= b.rc; a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=
whenrl =>
a:= Pi c:= a.rc; b:= c.lc;
c.lc:= b.rc; a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=
when rr =>
a:= p; b:= a.rc; c:= b.rc;
a.rc:= b.lc;
b.lc:= a; b.rc:= c; p:= b;
a.c:= black; c.c:= black; b.c:=
red;
assign names
link low descendants
arrange top
recolor the involved nodes
red;
red;
red;
end case;
end balance;
The reader should realize that the last two lines of each branch are common
for all the cases, but the none case. So they could have been moved down and
appear only once. Nevertheless, we have preferred to leave it in this way to help
the reader to understand each case in isolation from the others.
The function id_bal_case is a little bit tedious but it has no difficulty. The
drawback of the number of checks per node to be done (six in the worst case)
is reduced to three in the iterative version of the algorithm because the stack
provides the parent's and grand parent's data of the conflicting node.
AB the procedure put is simply recursive, obtaining its iterative equivalent
can be easily achieved through the techniques presented in chapter 3 of val. III.
As the binary search .trees are linked downwards, the parent function can not
be calculated and a stack is required. If so, each time a node becomes colored
red, identify whether it has entered in conflict can be achieved by a single check,
as the pointer to its parent is just one position below the top of the stack, and
92
the identification of the reorganizing case (for the grand parent of the node just
colored red) requires only two additional checks if the information whether a
node is a left child or not is stored in the stack too, as the information relative
to the grand parent is just two positions below the top of the stack.
Such an optimization is left an exercise to the reader and the recursive
version, which is far more readable, is presented here to help him to understand
the main principles.
function id_baLcase(p: in pnode) return bal...case is
a: pnode; be: baLcase;
begin
if p.lc/=null and then p.lc.c=red then
a:_= plc;
if a.lc/=null aud then a.lc.c= red then
be:= ll;
elsif a.rc/=nuil and then a.rc.c= red then
be:= lr;
else
be:= none;
end if;
end if;
if bc=none and p.rc/= null and then p.rc.c=red then
a:= p.rc;
if a.lc/=null and then a.lc.c= red then
be:= rl;
elsif a.rc/ =null and then a.rc.c= red then
be:= rr;
else
be:= none;
end if;
else
be:= none;
end if;
return be;
end id_baLcase;
Remove
Removing an element from an RB-tree is more complicated than inserting it
due to the greater variety of cases to be considered. As for the case of put, the
algorithm that removes an element from the tree starts by a conventional binary
search tree deletion and afterwards, the tree is rebalanced if required.
Indeed, the rebalancing affects only the shape of the tree. So, in the case
that the element to be removed has two children and is replaced by the leftmost
element from its right subtree, the rebalancing must start at the position of
this latter node, that is where the tree has changed its shape, and then proceed
backwards to the root. As a consequence, from the point of view _of rebalancing,
we shall consider that the vanishing node has at most one non empty subtree.
Let us now address the different possibilities.
93
Case 1. The node deleted is red. In this case, the removal keeps the properties Pl and P2 unchanged. So, no reorganization is required:
Case 2. The node deleted is black with a red child. In this case, the child
replaces the hole created by the parent's removal. Then it is sufficient to color
it black to achieve that both Pl and P2 keep being satisfied:
p-y>
9<,
p~
=*A
Case 3. The node deleted is black with a black child. In this case, the
child replaces the hole created by the parent's removal. However, in order to
keep P2 being satisfied, the root of the new subtree mnst be colored as double
black (ie. its black weight accounts for two). This is an unstable situation that
implies that some rebalancing has to be done at higher levels. In the figure
below, which illustrates the situation, the double black node is indicated by an
asterisk. In the algorithm this fact is not be recorded in the node, but indicated
by an output parameter of the subprogram in charge of the removal.
p-y>
~
p-1{
==>-A
94
The analysis that follows assumes that D is a left child. When D is a right
child the discussion is symmetrical and is left as an exercise to the reader.
The cases that appear correspond to the possible colors of the nodes involved.
These possibilities are summarized in the table below and the reader should
realize that it copes all the possibilities.
s
R
B
B
B
Colors of
SR SL
R
B
B
B
B
B
After this arrangement, the number of black nodes in the path from D to
the root is the same as before the rebalancing. Therefore, it has to be still
considered as double black. The advantage of the change is that now D has a
black sibling, and the resulting subtree can now be further processed according
to the cases that are studied below.
Subcase 3.2.BR S is black and SR is red; P can be either red or black.
l
In this transformation SL keeps its color unchanged, whatever it is, and S
gets the color that P had before the changes. In the resulting structure, the
paths from the root to the leafs of D have one more black node than before and
therefore, for P2 to be satisfied, D needs not to be double black any longer.
The number of black nodes in the paths from the root to the leafs of both
SL and SR is the same as before the changes. Therefore, P2 is satisfied for all
the involved branches. Moreover, whatever the colors of P and SL are initially,
no red node in the resulting structure has a red parent and henceforth P 1 is
satisfied too.
Subcase 3.2.BBR Both S and SR are black and SL is red. As empty trees
are considered black, the fact that SL is red implies that it is not empty. The
95
It is clear that in the new hierarchy Pl is satisfied, as no red node has a red
parent, and also that all paths from the root to the leafs of a, b and SR have
the same number of nodes than before the reorganization, so P2 is preserved as
well.
Indeed, this transformation does nothing upon the double black nature of D.
However, the new arrangement of nodes now satisfies the properties required by
subcase 3.2.BR and that transformation shall be applied afterwards.
Subcase 3.2.BBBR S, SR and SL are all black and P is red. As Pl is
satisfied, the fact that P is red implies that its (possibly empty) children are
black. In these circumstances, the tree can be transformed as shown by the
figure below:
p
96
nodes from the root to the leafs of S. As a result, this number is the same as
for the paths to the leafs of D. Therefore, the double black nature of D has to
be moved up to P and further rearrangements at higher levels shall be expected
to solve the double black nature of this latter node.
As a result of the previous discussion, the algorithm for remove starts with
the conventional remove operation for binary search trees and the rebalancing
is achieved on return from the recursive invocations. As the root plays a special
role, because it is the place where the black height increases (by setting black
a root colored red by the rebalancing algorithm) or decreases (by ignoring the
double black nature the root may get as a result of a deletion), the recursive
algorithm is invoked from a special procedure on top. The recursive algorithm
returns an output parameter indicating if the node provided as actual parameter
has turned double black as a result of the removal.
procedure remove(s: in out set; k: in key) is
root:
pnode renames s.root;
dbiack: boolean;
begin
remove(root, k, dblack); --ignore dblack (subcase 3.1)
end remove;
The removal for the general case does the conventional removal on binary
search trees. If, the node provided as the actual parameter has become double
black, the appropriate rebalancing is applied, depending on whether the node
is a left or a right child:
97
For the simple cases, in addition to the actual removal, it has to be decided,
depending of the oolors of the removed node and its replacing child, whether
the replacing node has to become double black to preserve P2. For the oomplex
case, in addition to the replacement by the leftmost node of its right child,
some rebalancing has to be considered. Although the crossing cases may make
the algorithm to look somewhat complex, after a careful look the reader will
discover that it is quite simple.
procedure actuaLremove(p: in out pnode; dblack: out boolean) is
Prec.: k=p.k
plowest: pnode;
begin
if p.lc= null and p.rc= null then
if p.c=red then
pis .red (case 1): freely remove
dblack:= false;
else
p is black with no red child (case 3)
dblack:= true;
end if;
p:= null;
elsif p.lc=null and p.rc/ =null then
if p.c=red then
- p is red: freely remove
p:= p.rc; dblack:= false;
(case 1)
elsif p.rc.c=red then
p is black with a red child
p:= p.rc; p.c:= black; dblack:= false;
(case 2)
else
p is black with no red child
p:= p.rc; dblack:= true;
(case 3)
end if;
elsif p.lc/=null and p.rc=null then
if p.c=red then
- p is red: freely remove
(case 1)
p:= p.lc; dblack:= false;
elsif p.lc.=red then
- p is black with a red child
p:= p.lc; p.c:= black; dblack:= false;
(case 2)
else
- p is black and no red child
p:= p.lc; dblack:= true;
(case 3)
end if;
else
- s.lc/=null and s.rc/=null
rem_lowest(p.rc, plowest, dblack);
plowest.lc:= p.lc; plowest.rc:= p.rc; plowest.c:= p.c;
p:= plowest;
if dblack then right.dblack_bal(p, dblack); end if;
end if;
end actuaLremove;
98
procedure remJ.owest
(p: in out pnode; plowest: out pnode; dblack: out boolean) is
-Free.: sf=null
begin
if p.lc/=null then
rem.lowest(p.lc, plowest, dblack);
if dblack then left_dblack..bal(p, dblack); end if;
else
plowest:= p;
if p.c=red then
- p is red: freely remove (case 1)
p:= p.rc; dblack:= false;
elsif p.rc/=null and then p.rc.c=red then
--pis black with a red child (case 2)
p:= p.rc; p.c:= black; dblack:= false;
else
--pis black with no red child (case 3)
p:= p.rc; dblack:= true;
end if;
end if;
end remJ.owest;
The two procedures to rebalance the tree when a node has become double
black are symmetrical and here only left_dblack...bal is described. The other one
is left as an exercise to the reader.
The procedure lefi-dblack...bal invokes the function id._case to identify the
particular case it has to deal with, which is identified by the enumeration type
left_db_case declared in the body of the package d_set as follows:
type left_db_case is (R, BR, BBR, BBBR, BBBB);
to describe the subcases of case 3.2 using the same convention as in the text
above, ie. the colors of S, SR, SL and P respectively. Then, it applies the
transformations that correspond to each case. The resulting algorithm is:
4.9. B+ TREES
99
whenBR=>
a:= s.lc; s.lc:= p; p.rc:= a; sr:= s.rc;
s.c:= p.c; p.c:= black; sr.c:= black;
p:=
Sj
whenBBR=>
- restructure s so that it can be reduced to case BR
sl:= s.lc; b:= sl.rc; s.lc:= b; sl.rc:= s;
s.c:= red; sLc:= black;
s:= sl;
-- then apply BR
a:= s.lc; s.lc:= p; p.rc:= a; sr:= s.rc;
s.c:= p.c; p.c:= black; sr.c:= black;
p:=s;
end left_dblaclcbal;
The algorithm for itLcase is trivial and it is left as an exercise to the reader.
4.9
B+ trees
Overview
Retrieving data from balanced binary trees is very efficient, as it requires about
log2 n key comparisons. So, finding a key on a set having 10,000 keys requires
about 14 key comparisons. A detailed analysis can prove that in the very worst
case it will never require more than 19 in AVL-trees and 28 in RB-trees.
However, if the set is to be stored in a disk, the technique for implementing
sets must be different. A physical access to a disk requires that the disc arm is
moved upon the proper track, then to wait for the proper sector to pass under
the arm's heading and finally transferring all the data contained in that sector.
The size of a sector is a characteristic of the disk and can not be changed. Usual
sizes range from 1 Kbyte to 8 Kbyte. In this section we will use the word page
as a synonym of disk sector.
Indeed, the time of comparing a key is insignificant compared with the sum
of the seek time plus the rotational delay plus the transfer time (an access to a
fast disk may take about 10- 2 s, whereas a key comparison on a fast processor
may take about 10- 6s). Consequently, the aim of a techPique for implementing
mappings upon secondary storage must be to minimize the number of physical
accesses to a disk, not the actual key comparisons.
The family of data structures known as B-trees is intended to achieve this
goal. This section describes in detail the most widespread variant of B-trees,
ie. the B+ trees. The other variants of the family are summarized at the end of
the section.
A B+ tree has two kinds of nodes: the leaves and the interior nodes. The
leaves contain the items and its corresponding keys, whereas the interior nodes
contain only keys and pointers. Interior nodes have a similar principle as interior
100
nodes of binary search trees, ie. to redirect the search towards the appropriate
subtree, and look like:
(1 <i <n)
As the user shall see, there is one more pointer than keys. It is also important
to realize that the key equality is satisfied for the subtree "at the right" of the
corresponding separator key.
The leaf nodes look as
Each node of the tree is located in a different disk page. So, in principle,
each access to a node requires a physical access to a disk3 .
The search principle of B+ trees is similar to that of binary search trees,
but to the extent that each node has many more subtrees, therefore B+ trees
are much more flat and the operations to insert, remove and retrieve require a
much lower number of steps, ie. physical accesses to a disk. Indeed, the more
items that fit into a leaf node, the lower is the number of leafs that are required.
Consequently, the sizes of both leafs and interior nodes should be adjusted to
best fit the size of a disk page. Moreover, as we shall see below, the algorithms
that manage the structure guarantee that each node, but the root, is used at
least to one half of its capacity.
Even though the search principle of B+ trees is similar to that of binary
search trees, the way they grow and shrink is completely different. Binary
search trees grow at the leaves. Instead, B+ trees grow in height at the root:
when no more elements (either items or pointers) can be allocated in the_ root,
a new sibling is created for the old root and a new root is allocated on top of
them. Likewise, B+ trees decrease in height at the root: when as a result of
a deletion, the root has only one child, its only child becomes the new root.
This way of growing and shrinking in height has as a result that all the leaves
are always at the same depth. In other words, B+ trees are always completely
balanced.
For the sake of keeping a uuified notation with the rest of the. book, the
algorithms below describe the management of nodes as if they where allocated
in dynamic memory. In any practical implementation, the reader should replace
any operation of the type p:= new node(leaf); or p:= new node( interior); in
the algorithms below by a request to the operating system for a new disk page
and its formatting to keep either an interior node or a leaf.
3 0perating systems and database management systems use buffering techniques that keep
the most recently accessed pages in primary memory, but for the present we may ignore such
kind of improvement.
4.9. B+ TREES
101
type node;
type pnode is access node;
type nodetype is (leaf, interior);
maxJeaf: constant natural:= 52;
maxJ.nt: constant natural:= 340;
type elem is
record
k: key;
x: item;
end record;
type t_pointers is array(natural range O..ma.xJ.nt) of pnode;
type t_keys
is array( natural range l..maxint) of key;
type t_elems is array( natural range L.ma.xJ.eai) of elem;
type node(t: nodetype) is
record
n: natural;
case tis
when leaf=>
te: t_elems;
when interior
=>
tk: t..keys;
tp: t_pointers;
end case;
end record;
type set is
record
root: pnode;
end record;
end d....set;
Indeed, the constants max_leaf and max...int must be adjusted to the sizes
of disk pages, keys and items. A ocmpletely general purpose package should
receive the disk page size as an additional generic parameter, and compute the
appropriate values for these constants by formulae that make use of the page
size plus the sizes of the types pnode, key and item. The latter can be obtained
by the attribute size, which provides the size in bits for any type. A general
102
purpose package should also receive the procedures to request a new disk page
to the operating system and to release it back when it becomes no longer used.
However, at this stage these technicalities would serve only to distract the reader
from the essential aspects ofthe data structure under study. Anyway, the values
proposed are typical, assuming a page of 4 Kbytes, a key of 8 bytes, a pointer
of 4 bytes and an item of 70 bytes.
Empty
The representation of the empty set in B+ trees is as simple as:
4.9. B+ TREES
103
node provided, in which case, on return, the value of ps (an abbreviation for p's
sibling) is different than null and kps is the lowest key of the elements currently
allocated in the subtree rooted at ps. In this situation, the tree must grow in
height, by the creation of a new root on top of the ancient root. The following
figures illustrate this growth. The initial situation is
p-..j
After the insertion of the new element in this tree, the tree is reorganized.
Consequently, the contents of the node pointed to by p may be updated and
occasionally split by a new sibling. If so, the situation becomes:
The procedure putO inserts the element (k,x) into the subtree rooted at p.
As a result, it may happen that there is not enough space in p to allocate either
the element itself or the pointers to newly allocated pages at lower levels. If so,
the elements of p plus any new contents are redistributed between p and a new
right sibling of p, returned as parameter ps. In this case, on return psfnul! and
kps is the key of the lowest element in the subtree rooted at ps. If no sibling
has been created, on return ps#null and the value of kps is meaningless.
The code for putO is:
procedure putO
(p: in pnode; k: in key; x: in item; ps: out pnode; kps: out key) is
q: pnode; -- child selected
iq: natural; -- index of the child selected (ie. q = p(iq).k)
qs: pnode; -- new immediate right sibling of q, if required
kqs: key;
-- key of the lowest element pending from qs
begin
ps:= nuH;
case p.t is
when leaf=>
if p.n < maxJ.eaf then
insertJ.nJ.eaf(p, k, x);- --may raise already_exists
else-- p.n = maxJeaf => - tree increases in width at leaf level
if is_in(p, k) then raise already-exists; end if;
ps:= new node(leaf);
distribute_inJeafs(p, ps, k, x, kps);
end if;
104
tp(O)
tp(i)
tp(n)
if
if
if
k < kl
< k;+l
kn .;; k
k; .;; k
(1 <i <n)
The function selecLchild is in charge of doing it. Second, the element is inserted
recursively into the selected child. As a result, a new right sibling for the selected
child may (or may not) be generated. If so, the situation is illustrat~d by the
figure below:
qs-:7\
(kqs)LA:'i
If there is room enough to allocate the new couple of key and pointer in the
node pointed by p, the insertion is achieved by shifting one position right all
the couples (key, pointer) starting from position q + 1, leading to
P4L----~~~~k~q~~i~q~kq~s~~kB~~~------~
&~11
4.9. B+ TREES
105
qs-:7\
(kqsJoo
As in this case there is no more room in p to allocate the couple (qs, kqs),
a new right sibling has to be created and the existing couples plus the new one
must be evenly distributed between the two siblings:
The key of the lowest element stored in the subtree rooted at ps.tp(O) (in the
illustrating figures, kc) has to be promoted up, ie. passed as the out parameter
kps, because it will be needed at the upper level of recursion to complete the
appropriate structure:
P-ic__ _ _ _____,/'
~!'t'-c----------'
Indeed, depending on the value of iq, the couple ( qs, kqs) shall go either to
p, or to ps or even become split: qs at position zero of ps and kqs promoted
up, in which case kps=kqs. The procedure distribute-in-interior is in charge of
these details.
The pending auxiliary subprograms are described later. At this stage it is
more important to focus on the essential aspects of the main operations.
Get and Update
The algorithms for retrieving and updating are quite simple. They just have
to follow the path towards the appropriate leaf, which is directed by the keys
stored in the interior nodes, and then look for the searched element in that leaf.
The procedure get invokes an internal procedure that receives a pointer type as
a parameter and does actually the job:
106
The operation to remove an element from a B+ tree is divided into two procedures: remove, which deals with the exceptional nature of the root and may
make the tree to shrink in height, and removeO, which deals with arbitrary nodes
and may make the tree to shrink only in width. The algorithm for remove is:
-procedure remove(s: in out set; k: in key) is
p: pnode renames s.root;
kp:
key;
-- irrelevant at root level
kpu, unbal: boolean; - irrelevant at root level
begin
if p=null then raise does_not_exist; end if;
removeO(p, k, kp, kpu, unbal);
if p.n=O then -- the tree decreases in height
case p.t is
when leaf
=> p:= null;
when interior=> p:= p.tp(O);
end case;
end if;
end remove;
If the tree is empty, it is obvious that the element can not be removed.
Otherwise, the elimination is delegated to the procedure removeO. If, as a
4.9. B+ TREES
107
result, p. n=O and the root node is also a leaf, that leaf is the only node of the
tree and moreover it contains no element. Therefore the node can be released.
This case is so simple that does not require any illustrating figure.
1f the root node is an interior node and p. n=O, this means that only p.tp(O)
contains a significant pointer. If so, the node addressed by this pointer can
become the new root, and the ancient root can be released. In either case, the
tree's height decreases one level. The figure below illustrates the transformation:
108
is to be removed from. It provides as output parameters a boolean unbalp (abbreviation of unbalanced p) which is true if and only if, as a result of the removal,
the occupation of the node p is below one half of its capacity. For interior nodes,
the capacity is measured in the number of pointers, which is one more than the
number of keys. Additionally, it returns a seoond boolean, kpu (abbreviation
of kp updated) which tells if, as a result of the removal, the key of the lowest
element stored in the subtree rooted at p has been updated. If true, the output
parameter kp is the key of the new lowest element stored in the subtree rooted
at p.
The case that pis a leaf is rather simple. It removes the couple (k,x) from
the sorted array in the node. Indeed, the procedure remove_froTTLleaJ, which
is in charge of the details, may raise the exception does_not_exist, if no such a
couple is found. If, after the removal, the node has become filled under one half
of its capacity, the appropriate output parameter must be set.
The case that pis an interior node is more complex. First, the child that may
contain the element to be removed has to be selected. The function select...child
is in charge of the details. Then, the removal proceeds recursively upon the
selected child q. On return, two adjustments are required. On one hand, if the
lowest key of the elements now stored at the subtree rooted at q is different
than before the removal, the new key must be placed at the appropriate switch
position. If the position of the selected child q in pis greater than 0, this position
is the node p. The figure below illustrates the situation:
On the other hand, if the removal has left the node q filled below one half
of its capacity, its elements have to be redistributed with those of one of its
immediate siblings. In principle, the algorithm selects arbitrarily the right one,
unless there is no such right one. In either case, pls and prs indicate the left
and right siblings whose elements will be distributed. The procedure distribute
is in charge of redistributing evenly the elements currently in pis and prs and
to return the lowest key in prs after the redistribution. However, if the sibling
selected was also at one half of its capacity, all the elements from both siblings
can be placed in a single node. If so, the procedure distribute puts all them in
the left sibling and, on return, prs.n=O.
If, on return from distribute, two siblings still exist, the switching key must
be placed at the proper position in p:
4.9. B+ TREES
109
prs
pls
where the subtree marked as empty (ie. X) has a single node that contains no
pointers. After filling the gap, the situation is:
This is the way the tree shrinks in width. The procedure remove_child is
in charge of the details. Indeed, after such a shift, p may become filled under
one half of its capacity and this fact must be propagated upwards through the
parameter unbalp. The reader should realize that leafs are considered to be
below one half of their capacity if n <:; max_leaf /2, whereas for interior nodes
the condition is n < max_int/2. This is because the number of pointers in
interior nodes is max_int + 1.
The auxiliary operation selecLchild
This operation is used by putO, getO, updateO and removeO to switch towards
the appropriate child. It receives a pointer p to an interior node plus a key k
and returns the index of the child of p that has any chance to contain an item
with key k or else the child where a new element with key k must be inserted
in. The algorithm, which uses binary search, is:
function selecLchild(p: in pnode; k: in key) return natural is
n: natural renames p.n;
tk: t_._"l(eys rena..TAes p.tk;
i, j, m: natural;
begin
i:= 1; j:= n;
while i<=j loop
m:= (i+j)/2;
if tk(m)>k then j:= m-1; else i:= m+l; end if;
end loop;
return j;
end select_child;
110
4.9. B+ TREES
111
This operation is used by putO when the leaf where a new element has to be
inserted is completely full and moreover no element in that leaf has a key equal
to that of the element to be inserted. It receives a pointer p to a leaf which is
full, a pointer ps to an empty leaf that is to become the new right sibling of
p plus the element to be added, ie. (k,x). It distributes evenly the elements
currently in p plus the couple (k,x) into p and ps. Moreover, it returns as the
output parameter kps, which is the key of lowest element allocated in ps.
procedure distributeJ.nJ.eafs
112
4.9. B+ TREES
113
114
4.9. B+ TREES
115
116
end distributeJnterior;
4.9. B+ TREES
117
This operation is used by removeD when a child from an interior node has
become empty because all their elements have been moved to its left sibling by
a distribute operation. If so, all the keys and pointers at the right of the empty
child have to be shifted left to fill the gap. It receives as parameters a pointer of
an interior node p and iq, the position in p of the child that has become empty.
This position is never zero.
It is rather obvious that the execution time of all the operations but empty is
O(logn) and the operation empty is 0(1). However, it is important to realize
that B+ trees are actually very flat. Let us consider, for illustrating purposes,
a data set of n = 6, 000, 000 items having d'!ta sizes for keys, data and pointers
leading to the realistic values of max_leaf=52 and max_int=340.
In the best case, there would be
= fn/521 = 115, 358 leafs. The root
would keep 341 pointers and the nodes at the intermediate levels would keep
341 pointers too. Therefore, only one intermediate level is required because
341 341 = 116,281 >
Consequently, an item is found in d~ = 3 disk
accesses, one for the root, one for the intermediate level and one for the leaf.
In the worst case, the leafs would be filled up to one half of their capacity.
So, the number of leafs in the worst case is n = 2n1 = 230,770. Moreover,
the root may contain only two pointers and the nodes at the intermediate levels
must contain at least 171 pointers. Therefore, one more intermediate level is
required because
2171171
58,482 < n
but 2 171171171 = 10,000,422 > n
n'
n'.
So, if a third intermediate level existed, some of the nodes in the i:Dtermediate
levels should contain less than 171 pointers, which may not happen. Consequently, an item is found in d~ = 4 disk accesses, one for the root, two for the
intermediate levels and one for the leaf. In a fast device, these four accesses
may cost about 40 milliseconds.
Indeed, the operations for insertion and deletion can co_st a bit more because
some siblings may become involved too along the path. In the very worst case
this implies at most twice as many nodes. Moreover, the updated nodes have
to be written back to the disk. Therefore, in the very worst case that all the
nodes in the path from the root have to be merged or distributed with a sibling,
the total coSt of an insertion or a removal can be up to four times the cost of a
retrieval.
118
4.10
Tries
Overview
For alphanumeric keys or, in general, for keys that have the form of arrays
having their oomponents into a small range (ie. the same kind of keys that
allow the Binsort algorithm studied in section 5.7 of vol. III to be applied),
there is the possibility to organize the set according to the key components. For
instance, the set {blind, blow, blue, tooth, train, training, traverse, tree} can
be represented by a tree where all the keys having the same initial letter pend
from the same subtree of the root, then each of those subtrees are organized
likewise according to the second component of the key and so on. Indeed, each
leaf corresponds to a key in the set, but some interior nodes may represent keys
as well, if a key is a proper prefix of another one. -To make the appropriate
distinction, an additional child must be appended to each node that actually
represents a key, associated to a special symbol that acts as an end mark, as in
the fignre below:
b
0
u
e
$
r
e
v
e
e $
r
n
g
s
e
4.10. TR1ES
119
Such a structure is called a trie, which is a neologism that results from the
contraction of tree and retrieval. To avoid the confusion with other trees, it is
usually pronounced as though it rhymes with pie. It is clear that finding if a key
is stored in a trie requires at most as many steps as symbols are in the key. As
the key lengths are usually bounded and, in any case, they are independent from
the number of keys stored, we can expect that the execution time for searches,
as well as most of the other operations, is 0(1).
The description of tries below assumes that they implement pure sets (ie. not
mappings) and also that the keys are strings of symbols. This is reasonable
because it simplifies the description of the algorithms and also because the
most frequent application of tries are spell checkers. However, at the end of the
section, the ways to extend the technique to implement mappings and to apply
it to more complex types of keys is discussed.
120
The keys
When variable length keys are stored in a fixed length array, some kind of end
mark is required. As for the radix sort, we shall assume that the special symbol
that works as end mark is key_component'first. We shall assume also that the
keys are well constructed, ie. that such a mark is always present in the provided
keys.
To improve the readability of the algorithms, a few constants are declared
in the body of d..set, that will stand as shortcuts for their wordy equivalents:
mk: constant key_component:= key _component'first; -- end mark
lastc: constant key_component:= key _component'last;
iO:
constant key_index
:= key .index'first;
Empty
The empty set is represented by a tree that has only a root node. Such a tree
represents the acceptance of the empty string as a prefix. However, as the empty
string is not a key, the root has no child labelled $. Consequently,
procedure empty(s: out set) is
root: pnode renames s.root;
begin
root:= new node;
root.all:= (others=> null);
end empty;
Is in
The search for a key in a trie is a traversal of a tree's branch, going down the
tree from the root along the corresponding path, switching according to the
corresponding symbols in the key. Indeed, if for a given symbol in the key, no
corresponding branch is found, the traversal is interrupted. If the end mark of
the key is reached, it must be checked that the corresponding node has actually
a child labelled $, ie. that the pointer that corresponds to $ points to the node
itself:
function isJ.n(s: in set; k: in key) return boolean is
root: pnode renames s.root;
p: pnode;
i: key _index;
c: key _component; -- a copy of k(i)
begin
p:= root; i:= iO; c:= k(i);
while c/=mk and p.all(c)/= null loop
p:= p.all(c); i:= i+l; c:= k(i);
end loop;
return c=mk and p.all(mk)=p;
end is...in;
4.10. TRIES
121
Put
The insertion algorithm is very much like the search algorithm. However, in
this case when no child exists for the current symbol in the key, a new child is
created for that symbol and the traversal goes on up to the end of the key. The
node that corresponds to the last significant symbol in the key is pointed to
itself to indicate that that node actually represents a key, and therefore it has,
conceptually, a child labelled $:
-,
AO
AA
-,,
end if;
This version is correct, to the extent that the deleted key will not be found
any longer. Nevertheless, it does not release the nodes that may become su-
122
perfl.uous. For instance, if this algorithm is applied to delete the key reverse
in the example shown at the beginning of the section, the child labelled $ from
the node associated to the prefix reverse would actually disappear, but the
branch associated to the suffix verse, would keep unnecessarily attached to the
structure.
Consequently, the previous version must be improved by keeping a pointer
to the last node in the traversal that does not have a unique descendant, and
also the symbol that was used to switch from that node down. If the key to be
deleted is actually found in the set, and it is not a prefix of another key, then
the full branch can be disconnected from the structure and its nodes released.
The root is somewhat exceptional because if the spurious branch pends from
the root it can be removed even if it is its only branch. The improved algorithm
follows:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
p, r: pnode;
i:
key _index;
c, cr: key_component;
begin
p:= root; i:= iO; c:= k(i); r:= p; cr:= c;
while cf=mk and p.all(c)/= null loop
if not unique_desc(p) then r:= p; cr:= c; end if;
p:= p.all(c); i:= i+l; c:= k(i);
end loop;
if c=mk and p.all(mk)=p then - the element exists
if unique_desc(p)
-- is not a prefix of a longer key
then r.all(cr):= null;
--unlink spurious branch, if any
else p.all(mk):= null;
- remove only $ branch
end if;
end if;
end remove;
where
begin
c:= mk; nd:= 0;
while c/=lastc and nd<2 loop
if p.all(c)/= null then nd:= nd+l; end if;
c:= key_component'succ(c);
end loop;
if p.all(c)/= null then nd:= nd+l; end if;-- check the case c=lastc
return nd<2;
end unique_desc;
Is empty
As the empty set is represented by a single node ie. the root with no descendants, checking whether the set is empty is very similar to the algorithm for
1
4.11. HASIDNG
123
4.11
Hashing
4.11.1
Introduction
Any implementation of a mapping converts a key value into a position where the
associated data is stored. A possible approach can be to compute that position
from the value of the key itself. For instance, if the keys are strings and the
actual data are stored in an array indexed from 0 to b- 1, a possible approach
might be do some strange operation with the values of the characters in the
string that finally leads to a number in the range [0, b -1].
There are many ways to achieve this goal, from the simplest to the most
sophisticated. A simple approach, which is far from the best but may serve as a
124
4.11.2
Open hashing
Overview
In open hashing, the synonyms, ie. the elements with keys having the same
value of hash(k), are organized in a linked list. 'l'he figure below illustrates the
structure:
dt
0
b-1
The hash function ha.sh takes the key k and converts it into a natural number
in the range [0, b- 1], that is used as an index into a dispersion table dt. Each
component of dt is a pointer to a list of elements having keys k' such.- that
ha.sh(k') = ha.sh(k).
A good hashing function should distribute the possible key values evenly
among the components of d. If so, we may expect that after storing n different
elements, the length of all the lists of synonyms have lengths that are approximatsly nfb. The value a= n/b is called the load factor of the hash table. If we
choose a value for b that makes a = 1, the average length of these lists will be 1,
so we may expect that the search, and modification algorithms can be very fast.
A detailed probabilistic analysis is developed below.
4.11. HASHING
125
k:
key;
x:
item;
next: pnode;
end record;
type dispersion_table is array(natural range
type set is
record
dt: dispersion_table;
end record;
end d..set;
Empty
The empty set is represented by a dispersion table where all the lists are empty.
Setting them empty is as simple as:
procedure empty(s: out set) is
dt: dispersion_table renames s.dt;
begin
fori in O.. b-1 loop
dt(i):= nuli;
end loop;
end empty;
Put
The insertion of an element in the set is achieved by applying the hash function
to the key to obtain a position in the dispersion table to get a pointer. Then,
the list that starts from that pointer has to be traversed to discard the storage
of duplicated keys. If no element with the same key exists in the list, a new
126
block has to be requested to allocate the new element. Then the new block has
to be inserted in the list. Indeed, the insertion place that makes the algorithm
easiest is at the beginning of the list. The detailed algorithm is:
end get;
The algorithm for update is exactly the same but replacing x:= p.x; with
p.x:= x;
Remove
The same principle applies to the removal. The hash function has to be applied
to the key, to obtain a position in the dispersion table. From this position a
pointer is got to the beginning of the only list that may oontain an element with
that key. Then, an element with a key equal to that provided has to be searched
4.11. HASHING
127
in the list that starts from that pointer and in the case of success, the element
is actually removed from the list. Indeed, the removal requires that a pointer
to the previous element in the list is kept along the traversal.
procedure remove(s: in out set; k: in key) is
dt: dispersion_table renames s.dt;
i:
natural;
p, pp: pnode;
begin
i:= hash(k, b); p:= dt(i); pp:= null;
while p/=null and then p.k/=k loop
pp:= p; p:= p.next;
end loop;
if p=null then raise does...noLexist; end if;
if pp=null then dt(i):= p.next; else pp.next:= p.next; end if;
end remove;
Cost analysis
After the insertion of n elements, they are distributed in b lists, so the average
length of the lists is n/b. As we have assumed that the size of the dispersion
table is chosen to match the number of elements that are expected to be stored,
the average length ofthe lists is 1. Indeed, if the hash function does not perform
well, this technique has a worst case in which all the elements go to the same list.
If so, this technique would perform as bad as an unsorted linked list. Instead,
if the hash function is good, we may expect that the lists will be in general
short. Therefore, the algorithms for insertion, deletion and retrieval, that make
as many steps as elements are in these lists, are efficient too. However it is
required to make the concept of "short11 more precise. In other words, we have
to establish the probabilistic bounds for these lengths and, in particular, the
average length of the non empty lists.
In the study of these bounds, we shall assume that the hash function is
so good that each key is equally likely to hash to any of the positions of the
dispersion table. Later we shall study the algorithms to obtain hash functions
that achieve this goal reasonably well.
Let us consider one of the lists. The probability that this list is empty is the
probability that along the n insertions, the selected list has always been another
one. Therefore:
b-1)n
po= ( -bAs the total number of lists is b, the total number of non empty lists is
b(1 -Po) and the average length of the non empty lists is
L=
n
b(1- Po)
This length is a growing function of n because (b- 1)/b < 1 implies that
((b- 1)/bt is decreasing. However, if the size of the dispersion table is adjusted
to make its load factor a= 1 (ie. b = n), the growth of Lis bounded and for
large sets we have that
128
lim b ( 1 bn~=
n (1
n ft )n)
- (1 -
1
e
,.---:--;-- = - - "' 1.582
1-1/e
e-1
As this is the average length of the non empty lists, and all these lists have
at least one element, we may expect that very few of them will have more than
two.
Another way to realize how short these lists are is to consider the probability
that a given list has length i. For i = 0 we have established it above. The
probability that after n insertions a given list has length exactly one is the
probability that this list has been selected once and has not been selected for
the remaining n- 1 insertions. However, the actual selection can occur in any
one of the n insertions. Consequently,
_ (1) (b-1)n-l
PI-n
--
In general, the probability that after n insertions a given list has length i is
the probability that it has been selected i times and has not been selected the
remaining n - i. However the actual selections can be any i among the total n,
. C' :
Ie.
11
Indeed, the probability that a list has length l ~ i is P; = I:i=o p; and the
probability that a list has length l > i is P; = 1 - P;.
The following figures show how small is the probability of having long lists,
no matter how many items are stored in the set:
b-n
10
100
1000
10,000
p3
1.279%
1.837%
1.891%
1.914%
Ps
0.164%
0.343%
0.362%
0.382%
0.015%
0.053%
0.057%
0.076%
The conclusion of this analysis is that the operations for insertion, retrieval
and deletion do not require more than a limited number of steps, which is
independent of the number of elements stored. Therefore, these operations can
be considered to be 0(1).
4.11.3
Closed hashing
Overview
In closed hashing all the elements are stored in the distribution table itself. In
case of a collision, an alternative position is selected in the same table, by a
mechanism called rehashing.
4.11. HASHING
129
130
4.11. HASHING
131
to each other. In this case, the two lists overlap and the number of steps of the
operations for storing or retrieving an element can become twice as many as it
would be required.
Quadratic hashing avoids this drawback, by making jumps that seldom collide if rehashing starts from different positions.
procedure put(s: in out set; k: jn key; x: in item) is
d: dispersion_table renames s.dt;
ne: natural
renames s.ne;
iO, i: natural; -- initial and current position
na: natural; -- number of attempts
begin
iO:= hash(k, b); i:= iO; na:= 0;
while d(i).st=used and then d(i).k/=k loop
na:= na+1; i:= (iO+nMna) mod b;
end loop;
if d(i).st=used then raise already_exists; end if;
if ne=ma:x:Jle then raise space_overflow; end if;
d(i).st:= used;
- mark cell as used
d(i).k:= k; d(i).x:= x; - fill cell
-- count cell
ne:= ne+l;
end put;
132
(io
+ nf)
mod b
or, equivalently,
Removal in closed hashing is cumbersome. We can not simply mark the component containing the deleted key as free because that component may not be
the last one in its list. If so, such a mark would make the components of the list
that come after the deleted one to become unaccessible. Therefore, if removal
is required, the mark should have a third possible state, namely deleted.
Then, the algorithms for insertion and retrieval should skip the components
marked as deleted, because only the components marked as free actually identify
the end of the list. Indeed, the algorithm for insertion, after checking that no
key equal to that to be inserted is in the set, it must reuse one of the components
marked as deleted to allocate the new element, if any has been found.
Although this mechanism is possible, in principle, it has the drawback to
make the lists grow longer than it would be required in the absence of removals
(for the same number of elements finally stored). Indeed, the elements of the
list after the deleted element might be shifted down to fill the gap. However the
algorithm is not sinlple because only the elements with a key that is mapped
to the same initial position have to be shifted. If not, they belong to other lists
and must keep being in the same place.
To sum up, for applications that require frequent insertions and deletions,
the right choice should be open hashing. Instead, when only insertions are
foreseen, closed hashing can be simpler.
Cost analysis
As for open hashing, we shall assume that any key is equally likely to hash to
any of the b elements of the dispersion table. Moreover, we shall assume that
the rehashing function selects any one of the components not yet examined with
uniform probability.
Let p; and q; be
Pi: probability of examining exactly i components in a successful
4.11. HASHING
133
S=1+
Lip,
i=O
co
00
00
Lip,
00
00
i=l
00
00
i=l
i=l
00
00
L iq, - :L iq, + L q, = L q,
i=1
On the other hand, q1 is the probability that the first component examined
is occupied. Therefore qi = n/b, as there are n occupied components in the
table. Likewise, q2 is the probability that the first component examined is
occupied and the second one is occupied too. Under the assumption that the
rehashing function selects (equiprobably) one component among the b- 1 not
yet examined, as n- 1 of these components are occupied, Q2 = %~=i and, in
general:
b
nn-1
bb-1
bb=T b i+1
<
nn-1
n-i+1
(because n <b)
<
Therefore
-
oo
oo
ana-l
S=1+Lqi<~a'=
lim
L....J
n--too a - 1
i=l
i=O
1-a
(because a
< 1)
As long as a gets clase to 1, this upper bound for the number of steps of
the operations of inserting and retrieval grows extremely fast. Nevertheless,
for a < 0.8 this upper bound is rather small: S < 5. Consequently, if the
algorithms grant that the load factor does not exceed the 80%, we can grant
that the number of steps is bounded by a small factor that is independent of
the number of elements stored and therefore that the operations for insertion
and retrieval are 0(1).
An application remark
134
4.11.4
The technique of hashing can be applied aJso to sets that have to be kept in
secondary storage. As for B+ trees, in this case the goal is to minimize the
physical accesses to a disk that are required by the operations that consult or
modify the set. Consequently, the dispersion table must be sized to fill a disk
page and the lists of synonyms must link disk pages too, where each page must
contain as many elements as possible, as the figure below shows:
dt
0
b-1
To grant that the disk pages are reasonably filled, b must be set to 2n/ML,
being ML the maximum nnmber of elements that fit into a cell of a data list. If
so, we may expect that in average the data cells are filled to about one half of
their capacity, which implies that the data elements are spread through about b
cells and, consequently, that the average length of the non empty lists is still
ej(e -1) c:= 1.58.
Indeed, if the required size of the dispersion table is so large that it can
not fit into a single disk page additional disk pages can be used. Nevertheless,
even for single disk page dispersion tables, the set sizes can be rather large. For
instance, if we take the same sizes of disk pages, key type and item type of the
example that illustrated how flat B+ trees are (see section 4.9), we had that
ML = 50. The disk page size considered (2KB), can keep a dispersion table of
up to 512 pointers of 32 bits, Mv = 509 (the first prime number innnediately
below 512), which gives an upper bound for the size ofthe represented sets (to
keep reasonable execution times) of U = MLMv = 25,450.
4.11.5
Overview
Any of the variants of hashing considered hitherto, assumed that the expected
size of the sets to be represented is known beforehand :and that, therefore, the
size of the dispersion table can be established accordingly at compile time.
However, in practice, this size can seldom be predicted and the increase (or
decrease) in size of the dispersion table has to be done at run time. According to
the algorithms considered hitherto, a change in the size of the dispersion table
implies a complete reorganization of the structure, in which new hash positions
stored have to be computed for all the keys currently stored and new links have
to be established accordingly. The cost of those changes is clearly unaffordable
if it can be the result of any insertion or deletion, specially in the case of large
sets.
Extendible hashing is a technique that allows the dispersion table to grow
and shrink in a cost effective way. The technique may be adapted to primary
4.11. HASHING
135
storage too, although it has been developed for the implementation of indexes
in database systemsl and therefore having secondary storage in mind.
In this approach, the hash image of a key, ie. the result provided by the hash
function, is a natural number with a reasonably large number of bits (ie. 32
or 64), in which each bit has a uniform probability to be either 0 or 1. From
this image, only the least significant d bits will be taken into account and the
other ones will be discarded. The value of d will be adjusted at run time,
depending on the size of the dispersion table, which will grow and shrink at run
time too.
hash image
0100101101010010100001011000110101110110101101011010010111011101
1--d--1
These d least significant bits are used to index the dispersion table. Each
component of the dispersion table points to a disk page that contains the elements having keys with the same d least significant bits. It is important to
remark that they point to a single page, ie. not to a page list. For d = 2 the
structure looks as:
Indeed, the elements stored in the first data page have keys with a hash
image that ends with 00, those in the second page end with 01 and so on. Such
an organization allows a cost effective duplication in size of the dispersion table,
which can be achieved by simply copying the contents of the current table into
the new half part of the extended table:
000
001
010
011
100
101
110
111
k,x
k,x
k,x
k,x
k,x
k,x
k,x
kx
kx
k,x
k,x
k,x
So extended, the dispersion table still addresses to the right pages, because
either the new bit conSidered from the hash image of our key is 0 or 1 we get
the pointer to the page that contains the element with that key. The advantage
of having extended the dispersion table is that in the case that a new element
has to be added to a page that has been found completely full, we can split that
page into two and redistribute its elements according to the newly considered
bit. As the number of elements that fit into a page is expected to be somewhat
large (for the realistic sizes considered in the example of section 4.9 each page
could contain about 50 data elements) and the hash function is expected to
generate bits with equal probabilities, we can expect that there will be room
enough to allocate the new element after splitting the page into two.
136
In the example of the figure above, if a new element bas to be added into
the third page, which is full, we can allocate a new page, move there all the
elements with keys having their hash image ended with 110, keep in the current
page those that end with 010 and insert the new element to the page that
corresponds to its key:
000
001
010
011
100
101
110
111
k,x
k,x
k,x
k,x
k,x
k,x
k,x
k,x
k,x
k)X~
k,x
,x
Indeed, any attempt to add a new element to a completely full page requires
to split the page. However, not any page split requires to extend the dispersion
table. For instance, the addition of a new element into the first page, requires
splitting the page because it is full too. However, the dispersion table does not
require to be split again because the page still groups the keys on the basis of
the two last bits whereas the dispersion table keeps pointers on the basis of the
three last bits.
To distinguish among the two situations, a new field is associated to both
the data pages and the dispersion table, which is called the depth. The depth
of the dispersion table is called the global depth (d9 ) and is the number of bits
taken into account to index the table. Indeed, the size of the dispersion table
is b = 2d. The depth of a data page is called the local depth (d1) and is the
number of bits taken into account to group the elements in that page. In other
words, the keys of all the elements stored in a page with local depth d, have
their hash images with the last d, bits identical.
Given a data page with depth factor dl, the insertion of a new element in
that page does not require a split of the page unless the page is found completely
full. If so, a split and the new depth factor of both the newly allocated page
and the original page becomes df = d, + 1. If df > d 9 then the dispersion table
has to be extended and its new depth becomes d~ = d9 + 1.
If the number of elements that fit into a page is reasonably large and the
hash function is reasonably good, for all pages d, will differ at most by one with
respect to d9 . However, in general dl ~ d9 and the number of components of
the dispersion table that point to a page is 2dg -dz .
So, even though it may be unlikely, it is possible that very few keys have
had a hash image ending with 00 compared with the other possible values. If
so, we might reach to the situation that illustrates the figure below, where B 00
represents a data page that contains elements having keys with their hash image
ending by 00 (indeed, no other page may contain keys with their hash image
ending by 00). The numbers enclosed by square brackets represent the attached
depths.
4.11. HASHING
137
In this situation, if a new insertion forces Boo to be split, its local depth will
be increased by one. Then, the old page and the new one shall be pointed from
the dispersion table as:
To remove an element, first it must be looked for. So, the hash function has
to be applied to the key to provide the hash image aud then select the lowest
d9 bits from that image to obtain an index to the dispersion table, from which
we will get the pointer to the page containing the searched element, provided
that the element is in the set. If so, we shall remove it from that page and then
attempt to merge that page with another one to release disk pages as long as
possible.
Let us assume that d9 = 6 and that the lower d9 bits of the hash image of k,
the key to be deleted, are 010101. Using this value as an index to the dispersion
table we will get a pointer to the page containing the element with key k, if this
element is in the set. Let us assume that it does, and that the local depth of
this page is dz = 4. Let us indicate that page as B0101[4]. As dz < d9 , several
components of the dispersion table will point to that page. To be precise, these
components are:
00
01
10
11
0101
0101
0101
0101
If possible, the page B0101[4] should be merged with its split image, ie. the
page B 1101 [4] to make the page B101[3]. The merging will be possible only if
B 11 01 [4] does exist. It does not if it has been further split into BonDI [5] and
B 11101 [5]. If so, the merging with BDIDI[4] is not possible until the latter two
138
pages are merged in tum. In addition to the fact that the split image exists,
the second requirement for merging is that the elements in both images fit in a
single page.
If Bmm [4] has a split image Bu01 [4], its split image will be pointed from the
following positions in the dispersion table:
00
01
10
11
1101
1101
1101
1101
So, we may select 00 1101 and get the page pointed from that position in the
dispersion table. If the local depth of that page is also 4, the split image does
exist. If the number of elements in B 0101 [4] and Buo1[4] fit in a single page,
then, the elements of Bum [4] must be copied into B0101 [4] to make B1o1 [3], and
the disk page that contained Bu01 [4] can be released.
Then, all the positions in the dispersion table that pointed to Bum [4] must
be updated to point to B10 1 [3]. Indeed, after such merging, it must be checked
if B101 [3] may still be merged with its split image B001 [3] and so on.
On completion of merging, it must be checked if the second half of the
dispersion table is equal to the first half. If so, d9 may be reduced by one,
(ie. the dispersion table is reduced by one half). Then it must be checked if the
new second half of the table is equal to the first one and so on.
Indeed, it may happen that the dispersion table grows so much that a single
disk page is insufficient to keep it. For instance, if the page size is 2KB and
the pointer size is 32 bits, a disk page can contain at most 512 pointers (which
are addressed by 9 bit indexes). If d9 becomes greater than this amount, the
dispersion table must be organized as a tree. Then the d9 highest bits of the
hash images are used to select a page and the lower bits are used to select a
component in that page. For instance, if d9 = 11 and the lowest 11 bits of
the hash image of key k are 01101111010, those 11 bits are subdivided into the
highest two, to select a page and the lowest 9 to select a component in that
page:
[11]
00 rOll-+--~
10 f-.=
11\-L
101111010t+-- BllOllllOlO[lO]
To sum up, extendible hashing is a very efficient way to store data ip secondary storage. When the dispersion table fits in a single disk page, any data
access requires exactly two disk accesses, one for the dispersion table and one
for the data page. For very large sets, the retrieval of the dispersion table can
take a very few additional disk accesses.
4.11. HASHING
139
shown in the package below. The generic parameters have two slight differences
with respect to the previous approaches. Namely, no size is provided and the
hash function has no bounding parameter.
generic
type key is private;
type item is private;
with function ''="(kl, k2: in key) return boolean;
with function hash(k: in key) return natural;
package d..set is
--Exactly as in section 4.2 (mappings).
private
type node;
type pnode is access node;
k: key;
x: item;
end record;
-- number of elements
-- table of elements
type set is
record
root: pnode;
dg: natural; -- global depth
end record;
end d...set;
140
As in the case of B+ trees, the lengths of the arrays should be derived from
the sizes of the involved types (pointers, keys, items) and the size of the disk
physical pages. To avoid that these technicalities may distract the reader from
the essential points, we have directly set some realistic values, assuming a disk
page size of 4KB, a pointer size of 4B, a key size of SB and data items of 70B.
As we have done with B+ trees too, to keep a unified notation with the rest
of the book, the algorithms below describe the management of the nodes as if
they were allocated in dynamic memory. In any practical implementation, the
reader should replace any operation of the type p:= new node in the algorithms
below by a request to the operating system for a new disk page.
Empty
The empty set is represented by a null pointer:
procedure empty (s: out set) is
begin
s:= (null, 0);
end empty;
Put (single-page dispersion table)
We shall assume initially that the dispersion table is allocated in a single disk
page. Once this simple version is understood, we will describe how to deal with
a dispersion table expanding through several pages.
The operation put is divided in two levels. The top level deals with the
exceptional cases that the set is either completely empty or it consists on a
single data page. The procedure normaLput deals with the usual case that the
root node is a dispersion table.
procedure put(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p: pnode;
begin
if root=null then
root:= new node(data.node);
root.ne:= 0; root.dl:= 0;
insert_data(root, k, x);
elsif root.t=disp..node then
normaLput(s, k, x);
elsif isjn(root, k) then -- (root.t=data..node)
raise already_exists;
elsif root.ne<max_data then
insert_data(root, k, x);
else
-- create the dispersion table
p:= new node(disp..node);
p.td(O):= root; root:= p; dg:= 0;
normaLput(s, k, x);
end if;
end put;
4.11. HASHING
141
If the elements of the set can be placed in a single page, there is no need to
build a dispersion table on top. When a single page is insufficient, then a new
page is requested to allocate the dispersion table. That table is initially created
with a single entry, the current data page, and the new element causing its
creation is then inserted according to the algorithm for the normal case, which
is as follows:
procedure normaLput(s: in out set; k: in key; x: in item) is
root: pnode renames s.root;
dg: natural renames s.dg; - global depth
p, ps:
pnode;
-- data page and its split image
hi:
natural; -- hash image of k
hs:
natural; - hash bits of higher sibling's image after split
completed: boolean;
begin
hi:= hash(k); completed:= false;
while not completed loop
p:= get_data_page(root, dg, hi);
if is_in(p, k) then raise already_exists; end if;
if p.ne<max_data then
insert_data(p, k, x);
completed:= true;
else
ps:= new node(dataJJode); split(p, ps, hs);
if p.dl>dg then extend(root, dg); end if;
inserLdisp(root, ps, dg, hs);
end. if;
end loop;
end normaLput;
Initially, the hash image of the key is computed, and then the pointer to
the data page where the new element is to he inserted is obtained from the
dispersion table in the root node according to the lowest d9 bits of the hash
image. If there is room enough in that page to allocate the new element, it is
placed there and that completes the operation. If the selected page is full, that
page must be split, ie. the elements it contains that page have to be distributed
into two pages (the original p and its sibling ps), according to an additional
bit of the hash images of their keys (hitherto they were allocated according to
the d1 lower hits of their hash images, and now they will he allocated according
to the d, + 1 lower bits). Indeed, the local depth of the new two pages is one
more than the local depth of the original page. If the latter value is greater
than the general depth, then the the general depth must be increased by one
too, and the dispersion table has to be extended twice its size by copying the
pointers currently in the table into the expanded area. Either if the table has
been expanded or not, the pointer to the created sibling page ps has to be
allocated into the dispersion table, at the position corresponding to the lower
d, bits of the keys currently allocated in it, which is provided by hs.
After splitting the page, we may expect that there will he room enough to
allocate the element to be inserted. But it may happen that the split leaves all
the elements in one of the pages and that the hash image of the element to be
142
inserted correspon~s to the full page. Therefore, the insertion of the new element
must be deferred to the next step of the iteration, because it might be required
to split the full page again. However, in practice, this is extremely unlikely. If
the hash function is good, the probability that a page containing n elements is
split into an empty page and a full page is (1/2t-l and the probability that the
new element can not be allocated after the first split is (1/2(. For the realistic
sizes of our type declaration, this probability is 2.22. w-lH
The operation that selects from the dispersion table the pointer to the appropriate data page according to the hash image of a given key is
function get_data_page
(root: in pnode; dg: in natural; hi: in natural) return pnode is
hd: natural;
p: pnode;
begin
hd:= hi mod(2dg); p:= root.td(hd);
return Pi
end get_data_page;
The algorithm for splitting a page into two is as follows. The new local depth
of the two pages is one more than the old one because now the elements will be
distributed according to one more bit. The number of elements of the sibling
page is initially zero. Then, the hash images of the current keys are computed
again and, if the most significant bit (among those that correspond to the new
local depth) is one, the element is moved .from p tops. Finally, the lower dz bits
that are common to the hash images of keys that are currently in ps is set into
hs and returned as an out parameter.
procedure split
(p: in pnode; ps: out pnode; hs: out natural) is
- hash image of inspected keys
hi: natural;
hd: natural;
- lower p.dl bits of hi
- divisors to select lower bits of hi (s2=s/2)
s, s2: natural;
i:
natural;
- index to elements of p
begin
p.dl:= p.dl+ 1; ps.dl:= p.dl; ps.ne:= 0;
s:= 2*p.dl; s2:= s/2;
i:= 0;
while i<p.ne loop
i:= i+l;
hi:= hash(p.te(i).k); hd:= hi mod s;
if hd>=s2 then
ps.ne:= ps.ne+1; ps.te(ps.ne):= p.te(i);
p.te(i):= p.te(p.ne); p.ne:= p.ne-1;
i:= i-1;
end if;
end loop;
hs:= hd; if hs<s2 then hs:= hs+s2; end if;
end split;
4.11. HASHING
143
. 144
The procedure update is exactly the same as get, but replacing get...data by
update._data.
Remove (single page dispersion table)
The procedure remove deals with the special cases that no dispersion table exists
whereas the procedure normaLremove deals with the case in which it does. The
former is rather simple:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -, global depth
found: boolean;
1:
natural; -- position of the element with key k in root, if found
begin
if root=null then
raise doesJloLexist;
elsif root.t=data_node then
search(root, k, found, i);
if not found then raise does..not_exist; end if;
remove-data(root, i);
if root.ne=O then root:= null; end if;
else-- root.t=disp_node
norrnalremove(s, k);
if dg=O then root:= root.td(O); end if;
end if;
end remove;
Instead, normaLremove requires a more detailed explanation. The first part
is straightforward: it looks for the element to remove, in a way identical to the
procedure normaLget, and deletes it from its data page, provided that it has
been found.
If the element has been found and actually removed, the second part of
normaLremove attempts to merge the updated page with its split image, if
possible. This possibility depends from two facts: first, such a split image does
exist (ie. the original sibling has not been further split), and second, there is
room enough in a single page to place the elements currently stored in the two
pages. Such a merging must proceed as long as possible. As a result of previous
4.11. HASHING
145
remove operations, it may happen that some pages have become even emPty
but there ha.'3 not been the chance to merge them because their split image
did not exist. Consequently, it may happen that as a result of merging two
pages, the new page has a split image that is almost empty, or even completely
empty, and now there is the chance to merge it. Whenever a merge operation
is completed, the pointer to the merged page must be placed at the required
positions in the dispersion table, which is achieved by the procedure insert_disp
described above.
If some pages have been merged, the third step of normaL remove checks if
the dispersion table can be reduced to one half. This is possible if the contents
of the second half is equal to the contents of the first half.
According to these considerations, the algorithm for normaLremove is:
procedure normaLremove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p, ps:
pnode; -- data page and its split image
hi:
natural; -- hash image of k
adjusted: boolean;
found:
boolean;
i:
natural;
begin
hi:= hash(k);
p:= get_data_page(root, dg, hi);
search(p, k, found, i);
if not found then raise does_not_exist; end if;
remove_data(p, i);
adjusted:= false;
ps:= split_image(root, p, hi);
while p.dl>=1 and ps.dl=p.dl and ps.ne+p.ne<=mruc_data loop
merge(p, ps);
insert_disp(root, p, dg, hi);
adjusted:= true;
ps:= split_image(root, p, hi);
end loop;
if adjusted then
while dg>O and then equaLhalves(root, dg) loop
dg:= dg-1;
end loop;
end if;
end normaL.remove;
The function spliLimage looks in the dispersion table for a pointer to the
split image of a given page p. However, the pointer provided is actually the split
image of p only if its local depth is equal to that of p, otherwise it means that
it has been further split.
The page for which we look its split image is either the page from which we
have deleted an element or the result of a previous merge operation. In both
cases, we know the lower bits of the hash images of the keys that it contains. If
146
hi is the hash image of the deleted element, it means that the page corresponds
to all keys having the lower dl bits of their hash images equal to those of hi.
Therefore, the split image is the page associated to keys having their hash images
with the lower dl-1 bits equal to those in hi but the dl bit opposed to that
of hi. ff that page has the same local depth and there is room enough to allocate
the elements currently in both pages into a single one, the new page will have
local depth one less, as the common lower bits of the hash images of the keys
contained in the page that results from the merge operation are one less than
in the merged pages.
Consequently, the resulting algorithm is:
function split-image(root, p: in pnode; hi: in natural) return pnode is
td: t_disp renames root.td;
s: natural;
hd: natural;
begin
s:= 2p.dl; hd:= hi mod s; s:= s/2;
if hd>=s then hd:= hd-s; else hd:= hd+s; end if;
return td(hd);
end splitJ.rnage;
The operation merge just adds to p the elements in ps. The counter in ps is
not adjusted because that page will become immediately released.
procedure merge(p, ps: in pnode) is
i: natural;
begin
i:= ps.ne;
while i>O loop
p.ne:=p.ne+1; p.te(p.ne):= ps.te(i);
i:= i-1;
end loop;
p.dl:= p.dl-1;
end merge;
The function that checks if the two halves of the dispersion table are equal
is straightforward too:
4.11. HASHING
147
The subprograms is_ in and update._data are left as exercises to the reader.
Unbounded dispersion table
For large sets, the size of a disk data page may become insufficient to keep the
dispersion table. Therefore, as described in the overview, the dispersion table
is organized as a hierarchy of data pages. Then the higher order bits of a hash
image serve to index a table that contains pointers to another page that contains
another table that is indexed by the lower bits of the hash image. Indeed, if
required, such a hierarchy can expand to more than two levels.
This improvement requires no change in the type declaration and the procedure empty is also the same.
Put (unbounded dispersion table)
The procedures put and normaLput are exactly the same as above. Only the
subprograms geLdata._page, extend and insert_disp must be reconsidered.
Regarding geLdata._page, now the lower d 9 bits of the hash image have to
be decomposed in bit groups, each group having as many bits as required to
148
index the table that fits into a disk page. In our type declaration, this amount
is indicated by the constant bdi..disp (for binary digits to index a table in a
dispersion page). Indeed, the highest order of these groups may have less than
bdi_disp bits and is used to index the root table to get a pointer to a dispersion
page that will be indexed by the second bit group and so on. The next figure
illustrates such a decomposition of a hash image:
1 - - - - dg---..l
hi
L__L~~~==~~==~
end geLdata_page;
In the simple version, the procedure extend, just had to increase the size
of the dispersion table twice, by copying the pointers currently in the table to
the new half. The same principle applies when the dispersion table is organized
in several pages. Indeed, if the root is completely filled, the increase of the
dispersion table forces to expand it to a sibling dispersion page and to build
a new root on top of both. Either if a new root has been created or not, the
second half of the new table has to be filled by a copy of the first half. If this
contents are pointers to data pages, those pointers have to be copied. However,
if they are pointers to other dispersion pages, new dispersion pages have to
be created, with a copy of their siblings. The procedure copy is in charge to
do it recursively. To distinguish the two cases, the procedure copy receives a
4.1l. HASHING
149
parameter st (for steps) indicating its level in the tree of pages that implement
the dispersion table. The resulting algorithm is:
procedure extend(root: in out pnode; dg: in out natural) is
p:
pnode;
s, sp: natural; - current sizes of td and root page's size
st:
natural; -- height of dispersion tree
begin
st:= dg/bdi_disp; -- number of steps
if dg>O and dg mod bdi_disp = 0 then -- disp. tree gets one level higher
p:= new node(disp..node); p.td(O):= root; root:= p;
end if;
s:= 2dg; sp:= s/(mrucdispst);
for i in O.. sp-1 loop
copy(root.td(i), st, p);
root.td(sp+i):= p;
end loop;
dg:= dg+1;
end extend;
where
procedure copy(p: in pnode; st: in natural; cp: out pnode) is
q: pnode;
begin
if st=O then -- the pointed node is a data node
cp:= p;
else
cp:= new node(disp..node);
for i in O.. max_disp--1 loop
copy(p.td(i), st-1, q);
cp.td(i):= q;
end loop;
end if;
end copy;
The logical positions to insert a pointer to a data page in the dispersion
table are the same as for the simple case. However the conversion of a logical
position into a physical one is more complex and it is achieved by the procedure
inserLdisp_single.
150
The algorithm to convert the logical position indicated by hps into the proper
component of the proper dispersion page is achieved by the same algorithm that
uses get_data_page discussed above. Indeed, in this case the pointer is set in the
table instead of retrieved from it:
procedure insert_disp...single
is that the statement that eliminates the dispersion table after the call to normaLremove (if on return d9 = 0) is no longer required because such a shrink in
height of the dispersion tree is now part of the general case:
procedure remove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
found: boolean;
i:
natural;
begin
if root=null then
raise does....not_exist;
elsif root.t=data..node then
search(root, k, found, i);
if not found then raise does..noLexist; end if;
remove_data(root, i);
if root.ne=O then root:= null; end if;
else-- root.t=disp_node
normaLremove(s, k);
end if;
end remove;
4.11. HASHING
151
Indeed, the operation that shrinks the height of the page tree that implements the dispersion table is now included in the procedure normaLremove.
The decrease in height occurs when the decrease of the global depth leaves the
size of the table at the root node equal to one. In other words, whenever d9
mod bdi_disp = 1 after reducing the table by one half of its size:
procedure normaLremove(s: in out set; k: in key) is
root: pnode renames s.root;
dg: natural renames s.dg; -- global depth
p, ps:
pnode;
-- data page and its split image
natural;
- hash image and its lowest dg bits
hi:
adjusted: boolean;
found:
boolean;
i:
natural;
begin
hi:= hash(k);
p:= get_data...page(root, dg, hi);
search(p, k, found, i);
if not found then raise does_not_exist; end if;
remove_data(p, i);
adjusted:= false;
ps:= splitJmage(root, p, dg, hi);
while p.dl>=1 and ps.dl=p.dl and ps.ne+p.ne<=mrocdata loop
merge(p, ps);
insert_disp(root, p, dg, hi);
adjusted:= true;
ps:= split..image(root, p, dg, hi);
end loop;
if adjusted then
while dg>O and then equa.Lhalves(root, dg) loop
if dg mod bdi_disp =1 then root:= root.td(O); end if;
dg:= dg-1;
end loop;
end if;
end normaLremove;
Besides of these little differences, and the fact that the invocations to the
subprograms geLdata_page and insert...disp correspond to the new versions di&cussed for the unbounded version of put, the subprograms to get the image of a
page and to check if the two halves of the dispersion table are equal have some
differences too.
The procedure to get the split image of a page is essentially the same as for
the bounded case. However, in the unbounded case, the bits that correspond
to the split image ( hd) can not be used directly as an index to the dispersion
table. Instead, these bits indicate a logical position and its conversion into a
proper position of a proper dispersion page has to be made through the function
geL.data_page. As this latter function requires to get the global depth as an input
parameter, the function to get the split image requires this input parameter as
well. Consequently, the algorithm that results for the function spliLimage is
152
function splitjmage
(root, p: in pnode; dg: in natural; hi: in natural) return pnode is
td: t_disp renames root.td;
s: natural;
hd: natural;
ps: pnode;
begin
s:= 2**p.dl; hd:= hi mod s; s:= s/2;
if hd>=s then hd:= hd-s; else hd:= hd+s; end if;
ps:= geLdata...page(root, dg, hd);
return ps;
end splitjmage;
To check if the two halves of the dispersion table are equal, the two halves
of the table in the root page must be compared. Two components that point to
a data page are equal if they point to the same page, but two components that
point to a dispersion page are equal if all the components of the pointed pages
are themselves equal. The function equal is in charge to check it recursively.
It receives a parameter indicating the level in the dispersion tree to distinguish
between the two cases, like the procedure copy discussed above. Consequently:
function equal..halves(root: in pnode; dg: in natural) return boolean is
td: t_disp renames root.td;
dr: natural; -- binary digits to index the root's dispersion table
st: natural;. - number of steps
s: natural; - half the current size of the root's dispersion table
i: natural;
begin
st:= dg/bdi_disp; dr:= dg mod bdi_disp;
if dr = 0 then st:= st-1; dr:= bdLdisp; end if;
s:= 2(dr-l); i:= 0;
while i<s and then equal(td(i), td(i+s), st) loop i:= i+l; end loop;
return i=s;
end equal.halves;
where
4.11. HASHING
153
For the bounded case, it is obvious that the operation empty is 0(1) and that
the operation to get an element requires only two disk accesses: one to get the
distribution table and another to get the data. The operation to update requires
an additional access to write the page back to the disk. Finally, the operations
to put and remove require at most five disk accesses because the involved pages
are at most three (the dispersion table and the two split images) from which
two must be read first aud later written back. So we may consider that any of
these operations is 0(1).
For an unbounded dispersion table, the pages involved in a retrieval operation are the data page plus the pages in the path from the root to that data
page. The height of the dispersion tree grows with the logarithm of the number
of data pages, being the base of this logarithm the number of pointers that fit
into a disk page. However, this number is so large that makes the tree flat
enough to consider its height to be bounded by a constant in practice.
For instance, for the same sizes considered for the example of B+ trees
(100,000 elements to be stored, disk pages of 4KB, pointers of 4B and a maximum of 52 data elements per disk page), the dispersion table would consist
on a root node containing 4 pointers and a single level of secondary dispersion
pages. Therefore, a retrieval operation would involve three disk pages. This
height would be increased by one only if the number of elements to be stored is
greater than 26 106 .
4.11.6
Hash functions
P(k) =
koh(k)~i
The example of a hash function that was used to introduce the general principles of hashing techniques in section 4.11.1 consisted in adding the bytes that
represent a type and then apply the modulus to get a result in the proper range:
154
If the standard 8-bit based character set is used, the value of the constant m
Ho~ever, if an extended character set is used instead, this base
could be larger than 256.
This approach is, in general, better than simple addition. Nevertheless, if b
is a power of 2 it turns much worse because then the hash image depends only
from the lowest binary digits of the original key. In particular, if b = 256, all
the names ending with the same letter wonld hash to the same slot.
Therefore, if such an approach is used, it is most recommendable that b is a
prime number, so that the modulus forces an effective division, and the result
will be 256.
4.11. HASIDNG
155
depends from all the bits in the key. In any case, values of b that are a power
of two should be completely discarded.
Quadratic hashing
A better approach to achieve that the hash image of a key depends from all the
bits representing the key is to square it and take the middle digits. In other
2
words: h(k) =
/cJ mod b, being can integer sucb that bc2 oe k 2 .
The implementation of the square is the same algorithm that scholars use
to multiply, but using base 256 instead of 10. The fignre below illustrates the
distribution of the digits. The underbraced digits indicate the bits selected to
apply the modulus b.
lk
r-----Jkl----+-j
====
x==r--,
==
==
==
==
==
k2========
k
k
~r
JcJ
:I
1:= 1-1;
end loop;
r(l):= c;
end loop;
c:= (r(n)m + r(n+l)) mod b;
return c;
end hash;
156
Multiplicative hashing
Instead of multiplying the key by itself, there is the possibility to multiply it
by a fixed weighting factor. In [14] it is suggested to use as weighting factor
as many significant digits as the key has from (y'5 -1)/2 = 0.6180339887 ...
because those digits have the same properties than random generated numbers.
Hash facilities in the Ada language
Hash techniques are so important that the predefined language environment
supplies a hash function. The type Hash_Type, to represent hash images, is
declared in the package Ada. Containers as a modular type (ie. an unsigned
integer with modular arithmetic) with a range that is implementation defined,
usually depending from the facilities provided by the underlying hardware, like
for integer and floating point arithmetic.
The function hash is declared as
with Ada. Containers;
function Ada.Strings.Hash(Key: String) return Containers.Hash_Type;
The result can be constrained to the proper range by applying the convenient
modulus to the provided resnlt.
4.11.7
Constraints on keys
In the case that the keys that are not strings of characters, it is usual to take
the set of bytes of their binary representation as strings. For fixed length
types, this goal can be achieved using the appropriate instance of the procedure Ada. unchecked_conversion, which is declared as:
generic
type Source(<>) is limited private;
type Target(<>) is limited private;
function Ada.Unchecked_Conversion(S: Source) return Target;
where the symbols ( < >) indicate that unconstrained types are allowed as actual
parameters for the instantiation.
For Source, we shall use the key type to which a hash function has to be
applied, and for Target we shall use a string type having the same size in bytes
than our key type. The instance of this generic function will then return a string
having the same bytes than the key provided as parameter. Such a string can
be passed as a parameter to any of the hash functions described above.
However, we can apply hashing only to types having a unique binary representation. Let us assume, for example that we have a set S whose elements
are sets in tum, ie. x E S means that x c A for some A. If the subsets of
A are represented by an array of boolean components, the hash technique can
be applied to implement S because two arrays with different component values
represent different subsets of A. Instead, if the subsets of A are represented by
arrays of elements, the subset {a, b, c} can be represented in several ways, for
instance the two in the figure below:
4.12. ITERATORS
157
3
Ia Ibl cl
3
j
I clal bl
It is clear that a hash function will map these two representations of the
same set to different slots, making the technique unusable. Even if the sets
are ordered, there still exists the difficulty of the non significative trailing components. In these cases, specifically tailored algorithms have to be applied to
discard the non relevant components.
4.12
Iterators
4.12.1
Introduction
for x E A loop
end loop;
An iterator is a data type that allows such a retrieval, keeping the independence of the algorithm that uses the sets with respect to the technique used to
implement them. Such a type works in a way similar to file types, ie. they act
as a kind of index to a larger structure and have operations to point to the first
element in the set, to point to the next element, to check if no more elements
exist and to get the pointed element.
If the traversal of the elements in the set is required, the type set must be
extended with the following declarations:
generic
158
If the parameter it refers to the last element of the set, on return from next. it
has a non valid value. Likewise, if the set is empty, the procedure first returns a
non valid value of iterator too. The procedures next and get raise the exception
bad_use if the actual parameter for it has a non valid value.
With these operations available, the traversal of the set becomes
procedure traversal(s: in set) is
k: key; x: item;
it: iterator;
begin
first(s, it);
while is_valid(it) loop
get(s, it, k, x);
next(s, it);
end loop;
end traversal;
and the search for an element in the set satisfying some property P is
procedure search(s: in set) is
k: key; x: item; found: boolean;
it: iterator;
begin
it:= first(s); found:= false
while is_valid(it) and not found loop
get(s, it, k, x);
if P(k, x) then found:= true; else next(s, it); end if;
end loop;
if found then success(k, x); else failure; end if;
end search;
For keys having a total order defined, the retrieval of the keys should be in
ascending order inasmuch- as possible. As we shall see, most of the implementations make such an ordered retrieval possible. Ordered retrieval has several
kinds of advantages. In particular, the implementation of the operations of
union, intersection and difference, discussed in section 4.13 can be more efficient.
If pure sets are required, instead of mappings, the declarations are exactly
the same, but for the fact that the procedure get does not have the out parameter
of type item and the details are left as an exercise for the reader.
The rest of the section addresses the way the iterators have to be implemented according to the way the sets are.
4.12.2
For set implementations based on keys that belong to a discrete type that is used
to index an array (see sections 4.3.1 and 4.3.2), the iterators can be implemented
straightforwardly by the key itself. Ifthat key is provided as a generic parameter,
it is not possible to extend it with an extra value to identify a non valid operator.
Therefore, under such a restriction a boolean field has to be added to indicate
the operator's validity. Consequently, the type declaration should be:
4.12. ITERATORS
159
type iterator is
record
k:
key;
valid: boolean;
end record;
The operation first and next have to provide the lowest and next keys
recorded respectively, if any. The function is_valid just has to check the corresponding field in the parameter it. Retrieving the data pointed by the iterator
is immediate too. These simple algorithms are:
procedure first(s: in set; it: out iterator) is
e:
existence renrunes s.e;
k:
key
renames it.k;
valid: boolean renames it.valid;
begin
k:= key'first;
while not e(k) and k<key'last loop
k:= key'succ(k);
end loop;
valid:= e(k);
end first;
procedure next(s: in set; it: in out iterator) is
e:
existence renames s.e;
k:
key
renames it.k;
valid: boolean renames it.valid;
begin
if not valid then raise bad_use; end if;
if k<key'last then
k:= key'succ(k);
while not e(k) and k<key'last loop
k:= key'succ(k);
end loop;
valid:= e(k);
else
valid:= false;
end if;
end next;
end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
c:
contents renames s.c;
valid: boolean renames it.valid;
begin
if not valid then raise bad_use; end if;
k:= it.k; x:= c(k);
end get;
160
4.12.3
Arrays of elements
For sets implemented by arrays of elements (see section 4.4), the iterator can
be implemented as an index to the array. The special value 0 can be used to
represent a non valid iterator. The algorithms are so simple that are left as an
exercise to the reader. Indeed, the elements will be retrieved ill ascending order
only if the keys have an order relationship defined and the set is represented by
a sorted array.
4.12.4
Linked lists
If the sets are implemented by linked lists, the iterator can be a just a pointer
to the current eiement, if any. The related algorithms are straightforward and
reproduced below. Indeed, the elements will be retrieved in. ascending order
only if the keys have an order relationship defined and the set is represented by
a sorted list.
type iterator is
record
p: pcell;
end record;
4.12. ITERATORS
4.12.5
161
If a set is implemented as a binary search tree, the ordered retrieval of' its el-
ements correspond to the inorder traversal of the tree. However, the recursive
approach to such a traversal is inappropriate for two reasons. First, we want
to keep the external view of the set independent from any particular implementation, so the external interface must keep unchanged with respect to any
other implementation. Second, some applications on sets require a coordinated
advance on several sets. For instance, the algorithms for computing the union
or the intersection of two sets can be best achieved by merging algorithms on
sorted sequences, as those studied for sequential files in sections 8.5 and 8.6
of vaL II and section 5.3 of val. IIL Such ooordinated advancement over two
different sets can not be achieved on the basis of the recursive algorithm.
Therefore, the information provided by the iterator must be equivalent to
the information passed from one step to the next in the iterative version of the
inorder traversal of the tree. Binary search trees only have pointers from a parent to its children, so the operations parent and is_left can not be implemented.
Therefore, the information passed from one step to the next is precisely the stack
used in the third version of the iterative traversal described in section 3.7.5 of
val. III and the resulting type declaration is:
with dstack;
generic
: - Exactly as in section 4.6.
package d.ret is
: - Exactly as in section 4.12.1.
private
0-
The body of the procedure first corresponds to the preparation phase previous to the main loop of the third version of the iterative traversal described
in section 3.7.5 of val. III. If the tree is empty, the stack is left empty as well.
Otherwise, the first element to be extracted is looked for, ie. the lowest one. As
this element is in the leftmost node of the tree, this is achieved by traversing
its left spine. At each step, the parent node is pushed in the stack because that
node is the next one to be visited after the complete traversal of its left child.
Finally, the leftmost child is pushed in the stack as well, because the algorithm
expects to find the element to be retrieved on top of the stack. Consequently,
the algorithm below follows:
162
The procedure next must obtain the next element to be retrieved. So, first
the current element is popped from the stack. If the popped element has a right
child, the next element to be visited is the lowest one (ie. that in the leftmost
node) in its right subtree. This node is computed like in first, but starting at
the root of the right subtree of the popped element. Otherwise, ie. if the current
element has no right child, the next element to be visited is already on top of
the stack. Indeed, if the iterator is not valid, ie. the stack is empty, the attempt
to get its top will cause a dnodestack. bad_use exception to be raised. Such an
exception must be reconverted into d_set. bad_use.
procedure next(s: in set; it: in out iterator) is
st: stack rena:rnes it.st;
p: pnode;
begin
p:= top(st); pop(st);
if p.rcj=null then
p:= p.rc;
while p.lc/=null loop
push(st, p ); p:= p.lc;
end loop;
push(st, p);
end if;
exception
when dnodestackbad_use => raise d_set.bad_use;
end next;
Checking the validity of the iterator is as simple as
function is_valid(it: in iterator) return boolean is
st: stack renames it.st;
begin
return not is_empty(st);
end is_valid;
The retrieval of the element pointed by the iterator is simple as well:
4.12. ITERATORS
163
4.12.6
Overview
The complexity of the iterators for binary search trees can be simplified by
improving the representation of binary search trees. In a binary tree, each node
contains two pointers and is pointed by a single one. Therefore, exactly one half
of the pointers in a non empty binary search tree are null. This is somewhat
wasteful to the extent that a single bit is sufficient to tell if a pointer is null.
A better way to use this space of memory is to keep a boolean (in principle,
a single bit) to tell if a pointer is null (ie. it does not address to a child) and
if so, use the pointer field for the right child to indicate the next node to be
visited in the inorder traversal of the tree. Likewise, the pointer field for the
left child can be used to point to the previous one. These latter pointers are
called threads. Threads allow the iterator type to be simply a pointer to a node.
For a node p, if p.rc is effectively a pointer to a right child, then the next node
to be visited is the leftmost one in this right child, which can be found by a
simple algorithm. Otherwise p.rc is a thread that links directly to the node to
be visited immediately after p.
Threaded trees can be built either with cursors or pointers, although the use
of cursors makes it easier to pack the pointer and the boolean because the sign
bit can be used to make the distinction. In this case, pointers to children are
represented by positive values, whereas non positive values represent threads.
The type declaration
Binary search trees can be declared as:
generic
: - Exactly as in section 4.6.
package d..set is
: -- Exactly as in section 4.12.1.
private
max: constant integer:= 1000;
type index is new integer range -max .. max;
type set
is record root: index:= 0; end record;
type iterator is record p:
index;
end record;
end d....set;
164
begin
prep_mem_space;
end d....set;
begin
if free=O then raise space_overflow; end if;
p:= free; free:= ms(free).rc;
return p;
end get_block;
procedure release_block(p: in index) is
begin
ms(p ).rc:= free; free:= p;
end release_block;
procedure release_tree(p: in index) is
begin - Precondition: p>O
if ms(p).lc > 0 then release_tree(ms(p).lc); end if;
if ms(p).rc > 0 then release_tree(ms(p).rc); end if;
release_block(p );
end release_tree;
4.12. ITERATORS
165
where
procedure put
(p: in out index; k: in key; x: in item; previn, nextin: in index) is
begin
if p <= 0 then
p:= get_block;
ms(p):= (k, x, -previn, -nextin);
else
if
k < ms(p).k then put(ms(p).lc, k, x, previn, p);
elsif k > ms(p ).k then put(ms(p ).rc, k, x, p, nextin);
else
raise already _exists;
end if;
end if;
end put;
The operations get and update
These operations follow the same principles of binary search trees, the only
novelty in this case is that cursors are used instead of pointers. As usual, the
only difference between get aud update is that the assignment in update is
ms(p).x:= x instead of x:= ms(p).x and therefore only the algorithm for get is
reproduced here.
166
The removal of an element follows essentially the same steps as for normal binary
search trees: first the element to be removed has to be found and then proceed
to the actual removal 'Which has the same subcases: no child, one child, two
children. However, some additional care is required to keep the threads to the
next and previous node in the inorder traversal of the tree. As in the case of
insertion, the main procedure remove defers its job to an auxiliary procedure
remove that has two additional parameters, namely the pointers to the previous
and next nodes in the inorder traversal of the tree that has as root the node
pointed by the first parameter:
procedure remove(s: in out set; k: in key) is
root: index renames s.root;
begin
remove(root, k, 0, 0);
end remove;
where
4.12. ITERATORS
167
nextin./1
~
The way to recognize that p is a left child is to check if it is aetually the
left child of nextin. If so, the field for the left child of nextin, ie. p, has to be
replaced by the node that has to be visited immediately before to it, which is
pointed by the thread in the field for the left child of p.
Now, p is the field for the right child of its parent, and this field has to
be replaced by the node that has to be visited immediately after it, which is
pointed by the thread in the field for the right child of p.
As the node pointed by p is going to be removed, the thread in the field for
the right child of the rightmost node in p's child has to be updated with the
value ne.:din.
168
Case
4:
In non threaded trees, this removal is achieved replacing the node to be removed
by the leftmost node of its right subtree. For threaded trees we shall do the same,
plus the adjustment of the threads. The figure below illustrates the situation:
4.12. ITERATORS
169
where
begin
if p>O then
t:=p;
while ms(t).lc>O loop t:= ms(t).lc; end loop;
ms(t).lc:= -prev;
end if;
end set_prev;
procedure set_next(p, next: in index) is
t: indexj
begin
if p>O then
t:= p;
while ms(t).rc>O loop t:= ms(t).rc; end loop;
ms(t).rc:= -next;
end if;
end set..next;
The operations for iteration
The threads make the implementation of the type iterator and its related operations quite simple. As stated earlier, the iterator can be implemented by a
single index. The operation first has to make it point to the first element to be
visited, which is in the leftmost node of the tree, if there is any:
procedure first(s: in set; it: out iterator) is
root: index renames s.root;
p:
index renames it.p;
begin
p:= root;
if p>O then
while ms(p).lc > 0 loop p:= ms(p).lc; end loop;
end if;
end first;
Making the iterator point to the next node to be visited in the inorder
traversal of the tree is easy too. If the current node has a right child, then the
next element to be retrieved is the leftmost one of the right child of the current
node. Otherwise, the thread found in the field for its right child points to the
next node to be visited directly. Indeed, if this thread is 0, the node pointed by
the iterator is the last one and has no successor.
170
end is_valid;
And retrieving the data associated to the iterator is immediate as well:
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: index renames it. p;
begin
k:= ms(p).k; x:= ms(p).x;
end get;
4.12.7
B+ trees
The iteration on B+ trees is most easily solved by organizing the leaf nodes in a
doubly linked list, which can be achieved quite easily. From the point of view of
the type declaration, the leaf pages have to be extended by the two additional
pointers to the previous and to the next page in the list:
4.12. ITERATORS
171
In the procedure pu~ in the place where a leaf was created at root level,
ie. in the place where the sentence p:= new node( leaf); appears, now it must
be complemented by setting the two pointers to null:
p:= new node(leaf); p.prev:= null; p.next:= null;
In the procedure putO, in the place where the tree grows in width at leaf
level, the sentence ps:= new node (leaf); now it must be complemented by the
chaining of the node pointed by ps at the right of the node pointed by p:
ps:= new node(leaf);
pnext:= p.next;
ps.next:= pnext; ps.prev:= p; p.next:= ps;
if pnext/=null then pnext.prev:= ps; end if;
where pnext is a local auxilia..ry variable of type pnode. The fig11re below may
help to understand the changes in the links:
P
I
l__L__
pnext
__t_l_J't+-t::__c_l_
pnext
__t__J
ps
l__L__
ps
j
pnext
__t_I_Ji+~I-----'-I_J~~~I-~
Concerning the deletions, the links have to be updated when the tree decreases in width, ie. in the procedure removeO, immediately before the call to
remove_child, the following sentences have to be added:
if pls.t=leaf then
nextprs:= prs.next; pls.next:= nextprs;
if nextprs/=nu!l then nextprs.prev:= pls; end if;
end if;
Again, a figure may be helpful to understand the changes in the links:
nextprs
j
pis
j
prs
pls
L____l___j_ji4
nextprs
141
prs
With these links established, the implementation of the operations for iteration is straightforward. The type iterator is a record of a pointer to a leaf page
plus an index to some position in that page. Therefore we have:
type iterator is
record
p: pnode;
i: positive;
end record;
172
The operation first must go to the leftmost leaf, if any, and then point to the
first element in that page (alternatively, the type set might have been extended
with a pointer to the leftmost leaf):
procedure first(s: in set; it: out iterator) is
root: pnode renarnes s.root;
p:
pnode renames it.p;
1:
positive renames it.i;
begin
p:= root; i:= 1;
if rootj=null then
while p.t=interior loop p:= p.tp(O); end loop;
end if;
end first;
The subprograms next, is_valid and get are immediate and self explaining
too:
end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: pnode renames it.p;
i: positive renames it.i;
begin
k:= p.te(i).k; x:= p.te(i).x;
exception
when constraint_error => raise bad_use;
end get;
4.12.8
Tries
An iterator for a trie has to keep the data required by an algorithm to step from
one leaf to the next. Indeed, these data are the pointers to the nodes in the
path from the root to the current node plus the choice taken in each branch,
ie. the symbol used to choose the branch down, as the figure below shows.
4.12. ITERATORS
k
b
173
th
r
l
u
n
d
$
e $
r
$
$
n
g
s
e
$ 1$
In principle, a stack might be used to keep such data. However, the sequence
of the choices is precisely the key to be retrieved. So, the stack to be used should
have the additional operation to retrieve all the stacked symbols and it is more
practical to implement it directly by an array and a pointer. Consequently,
type path is array(keyjndex) of pnode;
type iterator is
record
pth: path;
k: key;
1:
keyindex;
end record;
The procedure first is in charge to search the leftmost branch up to the first
recorded key, if any. This is achieved by means of the procedure firstbranch,
that looks for the first non null pointer in the node, if any.
procedure first(s: in set; it: out iterator) is
root: pnode
renames s.root;
pth: path
renames it.pth;
k:
key
renames it.k;
1:
key judex renam.es it.i;
c:
key_component;
p:
pnode;
found: boolean;
begin
p:= root; i:= iO;
firstbranch(p, c, found);
while found and c/=mk loop
pth(i):= p; k(i):= c; i:= i+l;
p:= p.all(c);
firstbranch(p, c, found);
end loop;
pth(i):= p; k(i):= mk;
end first;
174
where
procedure firstbranch
(p: in pnode; c: out key_component; found: out boolean) is
begin
c:= mk; found:= (p.all(c)/=null);
while c<lastc and not found loop
c:= key_component'succ(c);
found:= (p.all(c)/=null);
end loop;
end firstbranch;
If s is the empty set, the procedure first provides as a result an iterator value
containing a key that is the empty string. If so, the iterator will be considered
as non valid.
The procedure next -has to find the next recorded key in the trie, if any. It
has two phases. First, it goes upwards until it reaches a node that has a non null
pointer for a symbol greater than the one that was chosen for the current key, if
any. Then it goes down to the leftmost branch of the found subtree. The reader
should realize that if the first phase actually finds a subtree to go downwards,
that subtree contains at least one recorded key, because the algorithm that
inserts new keys never creates spurious branches (ie. branches that do not end
by a node labelled with $) and the algorithm for remove described in section 4.10
prunes; them out in the case that they appear.
procedur~
The procedure nextbranch is identical to firstbranch, but for the fact that
the search starts from the symbol next to c, instead of the first possible value
of the type key_component:
4.12. ITERATORS
175
procedure nextbranch
(p: in pnode; c: in out key-component; found: out boolean) is
begin
found:= false;
while c<lastc and not found loop
c:= key_component'succ(c);
found:= (p.all(c)/=null);
end loop;
end nextbranch;
As stated earlier, the non valid value for the type iterator is characterized by
a key representing the empty string. So, checking the valiruty of the iterator is
function is_valid(it: in iterator) return boolean is
k: key renames it.k;
begin
return k(iO)/=mk;
end is_valid;
4.12.9
Hashing: open
The retrieval of the elements from a set implemented by open hashing is rather
easy, although it is clear that it can not be sorted. The type iterator is represented by an index that indicates a slot and a pointer that addresses to the
current node in that slot:
type iterator is
record
i: natural;
p: pnode;
end record;
The first element is the first cell of the first non empty slot, if any:
procedure first(s: in set; it: out iterator) is
dt: dispersion_table renaJiles s.dt;
i: natural
renames it.i;
p: pnode
renames it.p;
begin
i:= 0;
while i<b-1 and dt(i)=nullloop i:= i+ 1; end loop;
p:= dt(i);
end first;
176
If s is empty, all the lists are found empty and p is assigned the value null.
The procedure ne:d is straightforward too. If further elements exist in the
current slot, then the next one is selected. Otherwise, the next non empty slot
is searched and its first element selected.
4.12.10
Hashing: closed
Iteration in closed hashing is yet simpler than in open hashing and the algorithms are the same no matter the rehashing function is. Indeed, the retrieval
can not be sorted either.
The type iterator can be a simple index to the dispersion table:
type iterator is
record
i: natural;
end record;
The first element is the first non free component of the dispersion table,
if any. If s is the empty set, all the components of the dispersion table will
be found free and the value assigned to the iterator will be b, which will be
recognized as a non valid iterator value:
4.12. ITERATORS
177
The procedure next is identical to first, but starting at the position next to
the one currently pointed by the iterator:
procedure next(s: in set; it: in out iterator) is
dt: dispersion_table renames s.dt;
i: natural
renames it.i;
begin
if i=b then raise bad_use; end if;
i:= i+l;
while i<b and then dt(i).st=free loop i:= i+ 1; end loop;
end next;
The subprograms is_valid and get axe simple and self explaining:
function is_valid(it: in iterator) return boolean is
i: natural renames it.i;
begin
return i<b;
end is_valid;
4.12.11
Hashing: extendible
178
type node(t: nodetype) is
record
case tis
when disp_node =>
-Exactly as in section 4.11.5
when data_node =>
-- Exactly as in section 4.11.5
prev, next: pnode;
end case;
end record;
type set is
record
4.12. JTERATORS
179
by p is merged with its split image, pointed by ps, the elements from the latter
are copied into the former and the latter is removed from the structure. By
construction, one is next to the other in the list. To be precise, the one that
corresponds to a 0 in the leading bit goes before than the one that corresponds
to a 1. To make the appropriate removal from the linked list, it is important
that the page that goes first in the list plays the role of p and the page that goes
next plays the role of ps, so that the elements are copied in the page that goes
first and the page that goes next is removed. Therefore, immediately before to
the sentence merge(p, ps); the following sentences have to be added:
swap(p, ps);
nextps:= ps.next; p.next:= ne:x:tps;
if nextps/=null then nextps.prev:= p; end if;
where nextps is a local auxiliary variable of type pnode and the procedure
swap is:
begin
if p.nextf=ps then
paux:=p; p:= ps; ps:= paux;
end if;
end swap;
With these links established, the operations to iterate on the set are quite
straightforward, although slightly different than in the case of B+ trees because
now it may happen (although it is very unlikely) that some data page is empty.
The declaration of the type iterator is, like for B+ trees, a pointer to a data
page plus an index to a component in that page:
type iterator is
record
p: pnode;
i: natural;
end record;
The procedure first has to look for the first non empty page. Indeed, if
s.firstj"'null, at least a non empty page exists.
procedure first(s: in set; it: out iterator) is
p: pnode renames it.p;
i: natural renantes it.i;
begin
p:= s.first;
while p/=null and then p.ne=O loop p:= p.next; end loop;
i:= 1;
end first;
180
If additional elements remain in the current page, the procedure next seleCts
the next element in the current page. Otherwise, it looks for the next non empty
page:
procedure next(s: in set; it: in out iterator) is
p: pnode renames it.p;
i: natural renames it.i;
begin
if p=null then raise bad_use; end if;
i:= i+l;
if i>p.ne then
p:= p.next;
while p/=null and then p.ne=O loop p:= p.next; end loop;
i:= 1;
end if;
end next;
The subprograms is_valid and get are so simple that are self explaining:
function is_valid(it: in iterator) return boolean is
p: pnode renames it.p;
begin
return p/=null;
end is_valid;
procedure get(s: in set; it: in iterator; k: out key; x: out item) is
p: pnode renames it.p;
i: natural renames it.i;
begin
k:= p.te(i).k; x:= p.te(i).x;
end get;
4.13
The most obvious advantage of the fact that the elements from a set can be
retrieved in order is the possibility of printing them out in a more attractive
and useful way.
Nevertheless, there are other important advantages, perhaps less obvious.
For instance, the fact that some sets can be retrieved in sorted order with respect to the same index, gives a database management system the chance to
implement queries involving joins in a more efficient manner. It is out of the
scope of this book to describe the strategies for optimizing queries in database
systems, however it is essential for a database designer to have a good understanding of how B+ trees and extendible hashing work so that he can decide
which approach he shall request the database system tc implement a given index, according to the expected queries and their respective frequencies. Indeed,
hashing provides a faster access to individual elements, whereas B+ trees provide sorted retrieval, which can be the basis for more efficient implementations
in some kinds of queries.
181
Sorted retrieval is also important regarding the operations of union 1 difference and intersection. If it is available, these operations can be implemented in
linear time by means of merging algorithms on sorted sequences sucb as those
studied for ordered sequential files in section 8.5 of vol. II. Otherwise, they have
to be implemented traversing one of the sets and for each element retrieved from
the first set check if it is in the second set. Indeed, the sets implemented by
means of boolean arrays are an exception, as for such an implementation these
operations can be best implemented by means of bit-based operations. The
details are left as an exercise to the reader.
4.14
This long chapter has presented a rather complete repertoire of techniques for
implementing sets. A proficient programmer should be familiarized with all
them, so that he can choose the most appropriate to the particular characteristics of his application. The choice should be made according to different
facts:
1. The size of the sets. For small sets the simplest techniques are preferable
and they can provide even lower execution times that the more sophisti-
cated ones. Instead, for large sets, the simplest techniques such as arrays
of elements or linked lists the execution times of the operations become
unacceptably high.
2. The keys are of a discrete type. For keys of discrete types, the implementations based on arrays indexed by the keys themselves are the best
choice unless the cardinality of the type exceeds by far the sizes of the sets
that are expected to be managed.
3. The keys have a total order relationship defined. If not, techniques
such as binary search trees and B+ trees can not be applied.
4. The values of the keys have a unique binary representation. If
not, hashing techniques have to be discarded.
5. The keys are arrays of symbols. If so, the trie approach should be
considered as a good choice.
6. The set has to be stored in secondary storage. If so, B+ trees
and extendible hashing are most appropriate techniques. In particular
circumstances, other techniques can also be adapted to secondary storage.
For instance, the nodes of a trie can be implemented as records in a direct
file.
182
9. The expected frequency of each operation. Indeed, the implementation technique should be chosen so that it provides the highest efficiency
for the most frequent operations.
The table below surnm:rrizes the asymptotic execution times of the operations of the abstract data type set according to the different implementation
techniques:
empty
get
update
put
remove
sorted
retrieval
union
intersection
difference
Arr. of booleans
n
1
1
n
n
n2
Arr. of elem. (unsort.)
n
n
1
-tt
Arr. of elem. (sorted)
1
n
logn
n
n
n2
Linked list (unsort.)
n
n
1
-tt
Li:nked list (sorted)
1
n
n
n
n
Binary search trees
1
lognf
lognt
n
n
logn
Bal. bin. search trees
1
n
n
logn
B+ trees
logn
logn
1
n
n
Tries
1
1
n
n
1
Hashing
n
1
1
n
-ttt
t These values :rre m average. There eXISt worst cases that make them O(n).
ft A sorting algorithm is needed, that can be applied in the same structure.
ttt A sorting algorithm is needed, but must be applied in a different structure.
In all cases, there is the obvious request that the keys have the test for
equality defined. Additionally, the different implementations have the following
constraints reg:rrding the properties of the keys:
Arr. of booleans
Tries
The keys are arrays whose elements belong to a discrete type of a limited range.
Either the keys have a specific hash function defined
or each key value has a unique binary representation.
Hashing