0% found this document useful (0 votes)
21 views18 pages

Larson 1978

The document presents a new file organization technique called dynamic hashing, which allows for flexible storage allocation that can increase or decrease without reorganizing the file. It maintains an expected storage utilization of approximately 69% and ensures fast record retrieval with only one access to secondary storage, while avoiding overflow records. The structure includes a data file and an index organized as a forest of binary trees, with algorithms for efficient insertion and deletion of records.

Uploaded by

arcartoria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views18 pages

Larson 1978

The document presents a new file organization technique called dynamic hashing, which allows for flexible storage allocation that can increase or decrease without reorganizing the file. It maintains an expected storage utilization of approximately 69% and ensures fast record retrieval with only one access to secondary storage, while avoiding overflow records. The structure includes a data file and an index organized as a forest of binary trees, with algorithms for efficient insertion and deletion of records.

Uploaded by

arcartoria
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

BIT 18 (1978), 184~201

DYNAMIC HASHING
PER-AKE LARSON

Abstract.
A new file organisation called dynamic hashing is presented. The organisation is based
on normal hashing, but the allocated storage space can easily be increased and decreased
without reorganising the file, according to the number of records actually stored in the file.
The expected storage utilisation is analysed and is shown to be approximately 69 ~o all the
time. Algorithms for inserting and deleting a record are presented and analysed. Retrieval of
a record is fast, requiring only one access to secondary storage. There are no overflow
records. The proposed scheme necessitates maintenance of a relative!y small index
structured as a forest of binary trees or slightly modified binary tries. The expected size of
the index is analysed and a compact representation of the index is suggested.
Keywords: file organisation, hashing, dynamic storage allocation, data structures, hash
trees, tries, information retrieval.

1. Introduction.
Hashing is a well-known technique for organising direct access files. The
method is simple and retrieval, insertion and deletion of records is normally very
fast. One of the major disadvantages is the static storage allocation. The size of the
file must be estimated in advance and physical storage space allocated for the
whole file. The amount of allocated storage is fixed and cannot be altered without
reorganising the whole file. Knuth [6] gives an excellent survey and analysis of
hashing. Another survey with an extensive bibliography can be found in Knott [3].
If the storage requirements for a hash file are underestimated, the number of
overflow records will be large, which slows down searching and updating. If the
storage requirements are overestimated, storage utilisation will be low and
valuable storage space wasted. In situations where the storage requirements vary
dynamicalJy, i.e. the actual number of records in the file varies rapidly, it is very
difficult to determine how much storage to allocate. At certain times the amount
of storage is too low, and the access times increase. At other times the amount of
storage is unnecessarily high and the storage utilisation low. Especially in such
situations a file organisation is needed which allows the allocated storage space
to grow and shrink according to the actual number of records in the file, and
which enables fast access to a record, if possible requiring only one access to
secondary storage.
This paper presents and analyses a file organisatibn technique which achieves
the above goals. It is based on normal hashing, but the allocated storage can be
Received November 29, 1977. Revised February 16, 1978.
DYNAMIC HASHING 185

increased and decreased without reorganising the file. The expected storage
utilisation is approximately 69 ~o at each time and there are no overflow records.
The price which has to be paid for this is the maintenance of a relatively small
index. If this index is available in main storage, only one access to secondary
storage is necessary when searching for a record.

2. File structure.
The file organisation method to be described employs a data structure
consisting of a data file in which the data records are stored, and an index to the
data file. The index is organised as a forest of binary hash trees. The structure of a
binary hash tree is explained below. The hash trees used here are similar to the
prefix hash trees introduced by Coffman and Eve [2]. They are also closely related
to binary tries. The relation is discussed in more detail in section four. An account
and analysis of tries is given by Knuth [6], section 6.3. The data file is nothing
more than a variable number of buckets of fixed size.
The set of records to be stored at a certain time is denoted by {Ri}, i = 1, 2 . . . . . n.
The number of records n is not fixed but may vary with time. A record R~ is
assumed to contain a unique key K~. The set of keys is denoted by {Ki},
i = 1~2 . . . . . tl. To simplify matters the records are assumed to be of fixed length.
Each bucket in the data file has a capacity of c records.
The file is initialised in much the same way as a normal hash file. Secondary
storage space is allocated for m buckets. In the index m entries are initialised, one
entry for each bucket, each entry containing a pointer to a bucket in the data file.
Fig. 1 illustrates the initial situation for a file with m = 2 .

index
I 2

w m ~ w ~ J m D a i ~ w ~ m ~ m m ~ m m Q

data-

Fig. 1. Initial structure of a dynamic hash file containing two buckets on level zero.

The initial buckets are said to be on level zero. A hashing function Ho for
distributing the records a m o n g the buckets is also needed. The function Ho is a
quite normal hashing function, and hence is a mapping from the set of keys {K~}
into the set {1,2 . . . . . n}, see e.g. [6]. The value Ho(K~)in this case is used to define
an entry point in the index and does not refer directly to a bucket. The bucket is
found by means of the pointer in the corresponding entry.
186 PER-,~KE LARSON

When the file is properly initialised we can start loading the file. This is also
done in approximately the same way as for a normal hash file, but using the index
to locate the buckets. Sooner or later a bucket will overflow, i.e. when trying to
insert a record in a bucket this is already full. When this happens we split the
bucket into two. Storage space for a new bucket is allocated and the records are
distributed equally among the two buckets. At the same time the index is updated
to depict the new situation. Additional records that would be stored in the split
bucket are distributed between the two buckets. If, later, one or the other of the
two buckets becomes full, this in turn is split into two buckets, etc. Fig. 2 depicts
the structure of the file from Fig. 1 after three splits have occurred.

A level

level
0

I index

data-
i I i 3 I i , ! i 5 t file

Fig. 2. Structure of a dynamic hash file after three splits.

The size of the file has increased to five buckets as a result of the three splits.
The index has grown to a forest consisting of two trees with three and five nodes.
Internal nodes are shown as circles and external nodes as squares.
The following has occurred: bucket 1 was split and the records in the bucket
were distributed among bucket 1 and bucket 3. At the same time node 1 became a
father and had two sons: nodes 10 and 11. Bucket 2 was split, the records were
moved to buckets 2 and 4, and node 2 gave birth to sons 20 and 2t. Thereafter
bucket 4 split into buckets 4 and 5 and node 21 had sons 210 and 211. A similar
splitting technique is used in connection with several other file organisations,
a m o n g other IBM's Virtual Storage Access Method (VSAM) [7] and B-trees [1],
[6-].
When the number of records stored decreases the allocated storage space can
also be decreased. When the number of records in two brother buckets becomes
less than or equal to the capacity of one bucket, the two brother buckets are
merged into one, and one bucket can be freed. Two buckets are brothers if the
DYNAMIC HASHING 187

corresponding external nodes have the same father node. At the same time the
corresponding search tree is updated. Fig. 3 depicts the structure of the example
file after the brother buckets I and 3 and the brother buckets 4 and 5 have been
merged and the buckets 3 and 5 freed.

level 0
index

I , II 2 i L: . . . . . 3 .I:
~
~ 2 0

I 4 I
21 level 1
u~mm~

r ....
, .... s
I~
"I
.I,
date-
file

Fig. 3. Structure of the file from fig. 2 after t w o merges.

The next problem is how to find a record in a file of the above structure. We
assume that one access to secondary storage is needed to read or write a bucket
and that the index is available entirely in main storage. Let us first assume that,
when a bucket is split, the first c/2 records are stored in the left bucket whiie the
c/2 last records go to the right bucket. Any additional record goes to the bucket
which contains fewer records. How many accesses to secondary storage are then
required to find a record?
It is a simple task to find a record or to establish the fact that a certain record is
not in the file. From the key the corresponding search tree is found by means of
Ho. By traversing this search tree, in any order, all the buckets are found in which
the record may be stored. However, the number of accesses to secondary storage
might be large. Finding a record stored in bucket 5 in Fig. 2 takes three accesses in
the worst case. An unsuccessful search in search tree 2 always results in three
accesses. A successful search on the average means making access to half of the
buckets in the search tree, while an unsuccessful search involves checking all the
buckets in the tree. This must be considered too time-consuming: can the number
of accesses be reduced?
To achieve a reduction the decision whether to put a record in the left or the
right bucket when splitting must be uniquely determined-by "the key of the record.
Furthermore, the decision must not be influenced by which other records happen
to be stored in the bucket when it is split.
We define a second hashing function B, which is a mapping from the set of keys
{K~} into the set of infinite binary sequences:

(1) B(Ki) = (bio, bil, b,2,...), b~E {0,1}, j = 0 , 1 , 2 . . . .


188 PER-AKE LARSON

A binary sequence can be used to determine a unique path in a binary tree. For
j = 0, 1, 2 . . . . . if bi~ = 0 take the left branch, otherwise take the right branch.
When inserting a record Ri with key K~ we first locate the root of the search tree
by computing Ho(K~). Then we scan down the path uniquely determined by B(KI)
until we reach an external node. This node contains a pointer to the bucket in
which the record is to be stored. If the bucket is full already, the node is split into
two nodes, a new bucket is allocated to one of the nodes and the records are
divided into the left and the right bucket. Whether a record with key Kj in the
bucket goes left or right is uniquely determined by its corresponding binary
sequence B(Kj).
It is obvious that only one access to secondary storage is required when
searching for a record, provided that the index is available in main storage. This is
the case both for a successful and an unsuccessful search. In which bucket a record
with key K i must be stored is found by traversing the unique path in the index
determined by Ho(K i) and B(Ki).
It is suggested that the hashing function B is implemented by means of a
pseudo-random number generator. The generator is designed to return 0 or 1 with
probability 0.5 when called. The pseudo-random 0-l-sequence obtained by
successive calls is employed as the binary sequence (1). The generator must be of a
type where the generated sequence is uniquely determined by the seed. A
generator fulfilling these requirements is readily constructed, see e.g. [5]. To make
the generated binary sequence uniquely determined by the key of the record, the
seed for the generator is computed from the key. For this we need a hashing
function H 1 defined on the set of keys {Ki} and having a range which is a subset of
the set of allowable seed points for the generator in use. The hashing functions H o
and H1 should be independent of each other; in no case they may be the same
function.
Implementing the hashing function B using a random number generator is
not necessary, from a strict point of view, but has one important advantage. If the
two initial transformations Ho and H 1 cannot separate two records, the random
number generator will not be able to separate the records either, i.e. if two keys Ki
and K~ have H o (Ki) = Ho(Kj) and H l(Ki) = H i ( K j) then B(Ki) = B(K~). We want
the generated hash trees to be stochastically balanced. This means that the
expected number of records stored in the left and the right subtree of any node
should be equal. Balanced trees result in shorter expected path lengths and better
storage utilisation. To ensure this balance we want the elements bij, j > O, of each
sequence B(K~) to be independent of each other and to be 0 or 1 with probability
0.5. To guarantee such a bit distribution from the transformation H1 seems to be a
difficult task. It is more easily achieved by using a random number generator;
indeed, they are designed precisely for that purpose.
Let us briefly summarise the most important characteristics of the proposed file
organisation scheme. We have ended up with a method of roughly the same type
as normal hashing, but with the important difference that the allocated physical
DYNAMIC H A S H | N G 189

storage space is easily increased or decreased as required by the actual number of


records stored. There is no overflow problem since overflow records do not occur.
Retrieval of a record requires only one access to secondary storage provided that
the forest of hashing trees is in main storage. In the subsequent sections the
method is analysed further.

3. Insertion and deletion algorithms.


The basic operations in connection with files "are retrieval, insertion, and
deletion of a single record. The usefulness of a file organisation scheme is greatly
influenced by the complexity and speed of these operations. In this section
algorithms for deleting and inserting a record in a dynamic hash file are presented
and analysed. The algorithms are written in an Algol-like style. The first steps of
the deletion algorithm constitute a complete search algorithm.
The algorithms are developed assuming that all the search trees in the index
reside in main storage. It is also assumed that there exists a storage management
system from which storage space can be requested when more is needed and to
which free space can be returned. This is required both for main and secondary
storage. The hash trees grow and shrink and the number of buckets in the data
file fluctuates. The calls NEWBUCKETand NEWNODE are assumed to create a
new bucket or a new node and to return the location of the new bucket or node.
In order to keep the algorithms simple and not blur up the main ideas with too
many details, we choose a straightforward organisation of the search trees.
Another alternative is discussed in section 5. The internal and external nodes
contain the following fields:

t TAG [FATHER TAG[ FATHER


LEFT ,,,RIa.T 'RCR S l SKT
internal node: external node:
TAG = 0 TAG = 1

FA THER is a pointer to the father of the node with FATHER = nil for nodes on
level zero. LEFT and RIGHT are pointers to the left and right son of the node.
B K T is a pointer to the bucket on secondary storage where the records from this
node are stored. RCRDS is the number of records in the bucket. The nodes are
assumed to be of the same size. This makes it possible to change an internal node
to an external, and vice versa, "in place".
The algorithms make use of two hashing functions H 0 and H 1, and a random
number generator called RAND with the characteristics discussed in section two.
There are a fixed number m of nodes on level zero and consequently a fixed
number of search trees. The roots of the search trees (the nodes on level zero) are
stored in fixed locations. The function H o directly returns a pointer to the root of

BIT 1 8 - - t3
190 PER-AKE LARSON

the search tree. T w o buffers, called BUFFER (i), i = 1, 2, are used, a n d c is the
c a p a c i t y of a b u c k e t m e a s u r e d in n u m b e r of records.

procedure insertarecord;
begin B1 : = 1; B 2 : = 2; P : = H o ( K ) ; L : = 0,
initialise RAND using Hi(K);
(Scan d o w n the tree)
while TAG(P) = 0 do
begin if RAND = 0 then P : = LEFT(P)
else P : = RIGHT(P);
L := L+l

end;

if RCRDS(P) > 0 then read in BKT(P) into BUFFER(B~)


else BKT(P) := NEWBUCKET;

while RCRDS(P) = c do
{Now there are c records in BUFFER(B~), which are to be d i s t r i b u t e d
between two buckets. The r e c o r d s are designated R 1, R E , . . . , R c with keys
KI,K2 . . . . . Kc)
begin X : = RAND; LC : = 0; R C : = 0;
for i := 1 until c do
begin initialise RAND using H I ( K i ) ;
call RAND L times;
if RAND = 1 then move Ri to BUFFER(B2), RC :-- RC + 1
else leat~e Ri in BUFFER(B1),
end;

{Create two n e w . n o d e s l
P1 : = NEWNODE; P2 : = NEWNODE;
FATHER(PI) : = P ; FATHER(P2) : = P ;
RCRDS(P~) : = LC; RCRDS(P2) : = RC;
TAG(P1) : = 1; TAG(P2):= 1;

if X = 0 then begin BKT(P1) : = BKT(P);


if RC > 0 then BKT(P2) : = NEWBUCKET,
write out BUFFER(B2)
into BKT(P2)
else BKT(P2) := nil
DYNAMIC H A S t n N G 191

end
else begin BKT(Pz) := BKT(P);
if LC > 0 then B K T ( P 0 := NEWBUCKET,
write out BUFFER(B 0
into B K T ( P 0
else BKT(P1) := nil;
interchange B 1 *-~ B z
end;
LEFT(P) := P1; R I G H T ( P ) : = P2; T A G ( P ) : = 0;
if X = 0 then P : = P~ else P : = P2;
L := L + 1; initialise RAND using Hi(K);
call RAND L times
end {while-clause} ;

insert R into BUFFER(BO; RCRDS(P) := RCRDS(P)+ I;


write out BUFFER(Ba) into BKT(P)
end {insertarecord};

Insertion of a record may require more than one split. This happens when the
c + 1 records involved in a split all go either left or right. Denote the number of
splits by h, given the bucket capacity c. Assume that a record goes left or right
with equal probability and that all records do so independently of each other.
Then it can easily be shown that the probability of h splits is:

(2) P(h) = ( 1 - - 0 , 5 c ) 0 , 5 c(h-lI, h=1,2 ....

The expected number of splits is then

(3) E(h) = 1/(1-0.5c).


Denote the number of new nodes'created by s. The number of new nodes is always
twice the number of splits and hence E(s)=2E(h). The number of new buckets
created is always one. We note that multiple splits are very rare when using large
buckets. When the buckets have a capacity of 10 records, multiple splits will occur
once in approximately thousand splits. How often splitting takes place is analysed
in section 4.
In certain circumstances the insertion can fail completely and the splitting may
continue without success. When there are more than c records which hash to the
same tree and which initiate the random number generator from the same point,
the generator will never be able to break up this block of records. Such a situation,
however, is extremely unlikely for properly designed hashing functions. The
system can easily be designed to keep an eye on its own performance and to warn
the user when the risk of failure rises above a certain level.
192 PER-AKE LARSON

procedure detetearecord;
begin B t : = 1; B 2 : = 2; P : = Ho(K);
initiatise RAND using H i ( K ) ;

{Scan down the tree],


while TAG(P) = 0 do if RAND = 0 then P : = LEFT(P)
else P : = RIGHT(P);
if RCRDS(P) > 0 then
begin read in bucket BKT(P) into BUFFER(BO;
search BUFFER(B 0 for a record with key K;
if the record is found then
begin delete the record from BUFFER(BO;
RCRDS(P) := RCRDS(P)- 1:
F := FATHER(P); TRYTOMERGE := true;

{Try to merge the bucket found and its


brother bucket}
while TRYTOMERGE do
begin
if F =~ nil then
begin if LEFT(F) = P then T := RIGHT(F)
else T := LEFT(F);
if TAG(T) = 1 and RCRDS(T)+RCRDS(P) ~ c
then
begin
if RCRDS(T) > 0 then
begin read in bucket BKT(T) into BUFFER(B2);
merge the records in BUFFER(B 0 and
BUFFER(B2) into BUFFER(BO;
return bucket BKT(T)
end;
RCRDS(F) := RCRDS(T) + RCRDS(P);
BKT(F) := BKT(P); TAG(F):= 1;
return nodes P and T;
P : = F; F : = FATHER(P)
end
else TRYTOMERGE : = false; {The brother node is
end not a leaf or we have
too many records}
else TRYTOMERGE : = false {No father and no
end {while-clause} ; brother}
{All merging is done, write out}
if RCRDS(P) = 0 then return bucket BKT(P), BKT(P) : = nil
DYNAMIC HASHING 193

else write out BUFFER(B 0 into BKT(P);


report "record deleted"
end
else report "record not jbund" {The record was not in the bucket}
end
else report "record not Jbund" { The bucket was empty}
end {deletearecord} ;

The number of accesses to secondary storage when inserting or deleting a


record is easily determined from the algorithms, see table 1. The figures in table 1
do not include possible accesses needed by other systems when allocating space
for a new bucket or when de-allocating a bucket.

Table 1. Accesses to secondary storage required when


inserting or deleting a record.

Insertion Deletion

Splitting or
merging does not occur:
1. the bucket is or
becomes empty
2. otherwise
Splitting or
merging occurs:

4. Analysis.
In this section the main interest will be in the storage utilisation of a dynamic
hash file. We shall start by analysing the number of nodes in the index. F r o m this
the number of allocated buckets and the storage utilisation follow readily. In the
analysis m denotes the number of hash trees, n the number of records to be stored
in the file, and c the capacity of a bucket in number of records.
Each hash tree in the index is a complete binary tree. In a complete binary tree
every node has either two sons or none, and the number of internal nodes is
always one less than the number of external or end nodes. A forest of m complete
binary trees with a total of k internal nodes has k + m external nodes and hence a
total of 2k + m nodes.
As already noted, the hash trees in the index are closely related to binary tries,
cf. Knuth [6] section 6.3. Take any of the hash trees and the records which hash to
this tree. Consider the sequences obtained from the hashing function B as infinite-
precision real numbers in the interval [0, 1). If, using this new set of keys, we build
a binary trie, stopping whenever reaching a subtile of c or less records, the
194 PER-AKE LARSON

resulting trie will correspond exactly to the internal nodes of the hash tree. In
other words, the hash tree is equivalent to the corresponding trie with external
nodes added to form a complete binary tree. If the trie has k nodes the hash tree
will have k internal nodes and k + 1 external nodes.
In the subsequent analysis we regard the index as a forest of rn infinite binary
trees with certain nodes active and others inactive. Of course, only the active
nodes are actually stored in the index. The number of nodes on level r is m2r, r > 0 .
The probability that the search sequence Ho(K), B(K) of a certain record with key
K passes through a given node on level r is

(4) p, = 2 - ~/m, r > 0.

The probability that x out of n records pass through the node is then

(5) pr(x)=(nx)(2-r/m)~'(1--2-r/m) "-x, O<_x<_n.

A node will be active if and only if more than c records pass its father. A node will
be an active end node if and only if more than c records pass its father and at
most c records pass the node. The conditional probability that x records pass a
node if k records pass its father is

We will say that an active node controls or owns the records which pass it.
Combining the above probabilities we obtain the probability that a n o d e on level
r will be an active node controlling x records

(7) qo(x) = po(x) r>0.

An active node will be an end node if it controls at most c nodes and an end node
will have a bucket allocated if it controls at least one record. End nodes
controlling no records are not allocated a bucket. The probability that a node on
level r is an active end node is then

(8) 3'; = L q,(J), r~0,


j=O

and the probability of an active bucket on level r is

(9) br = i G(J), r>0.


j=t

The expected number of allocated buckets on level r is therefore

(10) E(br) = m2~b~, r>0,


DYNAMIC HASHING, 195

and the total number of allocated buckets

(11) E(b) = m ~ 2"br.


r=0

The probability that an allocated bucket on level r contains x records is simply

(12) dr(x) = qr(x)/br, r>=0, 1 <_x<_c ,

and the number of records trapped on level r is

(13) E(t~) = E(d¢)E(br), r>O.


The expected storage utilisation is defined as

(14) E(u) = (n/c)/E(b),

i.e. the number of buckets minimally required divided by the expected number of
allocated buckets. A useful characteristic is the number of records stored
compared with the total capacity on level zero. This ratio is called the load factor
and is defined as

(15) 1 = n/mc.

Let us first study a numerical example in somewhat more detail, in order to gain
some insight into the behaviour of a dynamic hash file. We choose the parameter
combination n=4000, m=100, c = 1 0 , i.e. the file contains 4000 records, there
are 100 buckets on level zero and each bucket has a capacity of 10 records. The
load factor is four. The expected characteristics of the file are shown in table 2.

Table 2. Expected number of allocated buckets, of records trapped (absolutely and


as a percentage) and oJ records per bucket on diJJerent levels jor the example in
the text.

r E(b3 E(t,) E(t,)% E(d~)

0 0.0 0.0 0.0 9.70


1 2.1 19.7 0.5 9.24
2 2 2 8 . 9 1812.4 45.3 7.92
3 3 2 2 . 6 2040.6 51.0 6.33
4 21.7 125.7 3.1 5.79
5 0.2 1.t 0.0 5.62
6 0.0 0.0 0.0 5.56
X 5 7 5 . 5 3999.5 99.9

The expected number of buckets allocated is 575.5 and the minimum number
required is 400. If all the buckets were completely full, all active buckets would be
on level two. Due to the stochastic behavior of a hash file the records are spread
out on several levels but 96.3 ~o of the records are on levels two and three. The
expected storage utitisation is 69.5 ~ .
196 PER-AKE LARSON

,,,,, ,,,, ,,, i

qr(X)
t
i
o -- .ooo

0,010

0,05

0,01
\
5 10 15 20 25 30 35 40 45 50 x
Fig. 4. Probability distribution of the number of records controlled by a node on levels zero to four.

The probability distribution of the number of records controlled by a node on


different levels is plotted in fig. 4. The proportion of the distributions to the left of
x = 10 defines the probability distribution of the number of records in an active
bucket on different levels, except for the scale factor b~, cf. equation (12). From the
figure it can be seen immediately why the expected number of records in an active
bucket decreases when r increases.
The parameter combination in the example resulted in a storage utilisation of
69.5 ~o. The expected storage utilisation for different bucket capacities and load
factors has been computed and the results are plotted in fig. 5.

E(u)

90

80 c=2
c=3

70 cz5

60

50

40

30

g
i • | •

1 2 3 4 5 6 I
Fig. 5. Expected storage utilisation for a dynamic hash file.
DYNAMIC HASHING 197

The computations were carried out for an infinite number of buckets, i.e. letting
n, m --, oc, keeping n/m = cl constant. In this case the binomial distribution in (5)
approaches a Poisson distribution with parameter cl2 -r.
The expected storage utilisation converges rapidly with increasing load factor
and has almost reached its limiting value for a load factor as low as one, The
limiting value depends on the bucket capacity. A low bucket capacity gives a
better expected storage utilisation. This is a consequence of the strategy of
allocating storage only to non-empty buckets.
Our next task is to find an asymptotic representation of the expected number of
allocated buckets and the expected number of nodes in the index. Because of the
close connection with tries we can directly apply results obtained by Knuth [6],
section 6.3, exercises 19 and 20. Knuth has shown that the expected number of
nodes in an M-ary trie containing N records, if branching is stopped whenever a
subtile of at most s records is reached, is

(16) N/(s In M ) - Ng(N) + O(1)

where g(N) is a complicated periodic function which, for practical purposes, can be
ignored because its absolute value is very small. Applying this formula to each of
the m hash trees and summing would give an approximation of the order O(m). If
we wish a better approximation we must choose another approach. Let us assume
that we extend the index with a level - 1 which employs m-ary branching. If this
m-ary branching is implemented by means of a complete binary tree, we have to
add m - 1 nodes. The internal nodes of this extended index are then a binary trie
which has n/(cln2)-ng(n)+O(1) nodes. Because the index is a complete binary
tree we have one external node more than internal nodes. Ignoring the function
g(n), this gives us the following asymptotic approximations for the number of
internal and external nodes in the index

E(i-nodes) ~ n/(cln2)-m
(17)
E ( e - nodes) ~ n/(c In 2) .

How much storage space will actually be required for the index depends on the
size of the nodes and the way of representing the trees. The structure used in
section three is the most straightforward but requires the largest storage space.
Other ways of representing trees and forests are discussed in [4]. One alternative
is presented in the subsequent section.
The expected number of allocated buckets equals the expected number of
external nodes less the number of (non-allocated) empty buckets. The next task is
then to find an asymptotic approximation for the expected number of empty
buckets. It suffices to consider one of the hash trees. Let us assume that n' records
hash to the tree. Using equation (7) we obtain the following expression for the
expected number of empty buckets:
198 PER-aKE LARSON

(18) E(be) = 2~ ~n' ( nk, ) ( 2_r+lk) ( 1 - 2_r+l),c_k2_ k


r=l k=c+l

= ~ k 2-k+1 2-~(k-l~(1--2-0"'-k"
k=c+l r=O

Applying Euler's summation formula to the last sum and cleaning up reduces (18)
to the following approximation
(n''] ~ 2 k+l
(19) E(be) ~ ~ . ~ J 2-' k'k-P, 1
k=c+l

= i~ k k(k-1) k(--~--i)J'
The first sum converges very rapidly and we can write approximately

(20) E(be) ,.~ ~ 1-1n2-2 .


k=2 k ( k - 1)
Even though (20) is only O (n') the error is very small and can safely be ignored for
practical purposes. Combining (17) and (20) gives the following asymptotic
approximation for the expected number of allocated buckets

(21) E(b)
-c k=2- k(k-i

Using (21) we obtain the following asymptotic expected storage utilisations for
different bucket sizes:

c = 2 78.2~o c = 6 69.6~o
c = 3 72.6~o c = 7 69.4~o
c = 4 70.7~ c = 8 69.4~o
c = 5 69.9~o c > 9 69.3~0.
The probability distribution of the total number of buckets is complicated
because of the stochastic dependence between the number of allocated buckets
on different levels. Simulation experiments indicate that the distribution can be
approximated by means of a normal distribution with a standard deviation of
(22) Std (b) = 0.312(n/c) ~ .
This formula has been obtained experimentally and hence is only a first
approximation. It can be used for l > 1, c > 5 and a moderately large number of
buckets, say E(b)> 100.
The number of allocated buckets is surprisingly stable. In the above example
with n =4000, c = 10 the standard deviation is only 6.24 buckets. Using the normal
approximation this means that the number of buckets lies between 563 and 588
and the storage utilisation between 7 1 ~ and 68~o with probability 0.95. If n is
DYNAMIC HASHING 199

increased to 100000 with an expected number of 14400 buckets, the standard


deviation is a low as 31 buckets.
Using equations (3) and (17) we can determine the approximate probability
that splitting will occur when inserting a record, or that merging wilt occur when
deleting a record:

(23) P(split/merge) ~ (1-0.5c)/(cln2) 1> 1 .

5. Refinements and variations.


In the preceding sections it has been assumed that the records are of fixed
length. This assumption is not in any respect crucial; the algorithms can easily be
modified to handle records of variable length. We just keep account of the free or
occupied space in each bucket instead of the number of records.
The simple scheme discussed in section 3 used explicit pointers to represent the
tree structure of the index. This makes the nodes large and leads to heavy storage
requirements for the index. It is possible to store the search trees in consecutive
storage without using the pointers F A T H E R , L E F T and R I G H T at all. The size
of an internal node is then reduced to one bit, the T A G field, and an external node
to the bits required for the fields RCRDS and BKT. If necessary the RCRDS field
can be deleted from the external nodes and stored in the buckets.
The m nodes on level zero are stored in locations 1,2 . . . . . m. Let L be the
location of a node and L~ and L~ be the locations of its left and right son. If we set:

Ll = 2L + m- 1
(24) Lr 2L + m

the nodes fill the space completely without "gaps". An example forest is shown in
fig. 6.

Fig.6. Example of consecutive storage allocation for a forest consisting of three binary trees.
200 PER-AKE LARSON

The father of a node is tound in location:

(25) Ly = [(L-m+ l)/2j .

The information in a certain external node can be stored in the bits constituting
the sub-tree whose root is the external node. If the node in location 7 in fig. 6 is an
external node, the bits in its RCRDS and BKT fields are stored in locations 16, 17,
34, 35, 36, 37, 70, 71, 72, 73 . . . . as many bits as required.
There is at least one difficulty with this storage allocation scheme for the index.
It works excellently as long as the trees and the forest are well-balanced. If this is
not the case valuable storage space is wasted. If, for example, node 21 in fig. 6 is
split, locations 22 to 43 are unused and there is a "gap", How common such
"gaps" are has not yet been analysed. If the storage efficiency turns out to be low,
one possible remedy could be to employ a combination of the two schemes.
Storage without explicit pointers is used down to a certain level; if there are nodes
below this level they are stored using explicit pointers. If a rough estimate of the
maximum number of buckets required in the file is available, we can calculate on
wich level to establish the boundary, The consequences of this combined scheme
have not been studied yet.
The analysis in section 4 revealed that the expected storage utilisation is
approximately 69.3 ~ , In certain instances this might be considered too low. One
method of improving the storage utilisation is to defer the splitting of a bucket
until both the bucket itself and its brother are full. In other words, if the "home"
bucket is full, we first try to store the record in its brother bucket. If this is full as
well, or has already been split, the "'home" bucket is split. This modification
obviously leads to higher storage utilisation but requires more complicated
algorithms for searching, insertion and deletion. Searching is slower because it
may be necessary to search two buckets. The author has not analysed the effects
of this modification in depth, but only made a few simulation experiments.
It is evident from the experiments that deferred splitting leads to more unstable
storage utilisation, The expected storage utilisation oscillates as a function of the
load factor. This seems to be a consequence of the fact that all the buckets on a
certain level split more or less simultaneously when the toad factor increases to a
certain point. This in turn causes a drop in the expected storage utitisation. In
table 3 the results of simulation runs with the modified scheme are compared with
the results for the original method. The improvements are not very dramatic.
DYNAMIC HASHING 201

T a b l e 3. C o m p a r i s o n of the e x p e c t e d s t o r a g e u t i l i s a t i o n for
t w o different splitting strategies, 1 ~ 2 .

Expected storage utilisation as percentages

Deferred splitting
Bucket Original
capacity method
min max

2 83.3 83.3 78.2


5 78.9 79.3 69.9
8 76.2 78.5 69.4
10 74.7 77.5 69.3
15 71.7 77.9 69.3
20 69.5 79.3 69.3

Acknowledgement.
The author wishes to t h a n k the a n o n y m o u s referee for p o i n t i n g o u t the
c o n n e c t i o n with tries. T h i s resulted in s u b s t a n t i a l i m p r o v e m e n t s .

REFERENCES
I. R. Bayer, Symmetric binary B-trees: Data structure and maintenance algorithms, Acta Informatica 1
(1972), 4, 290-306,
2. E. G. Coffman and J. Eve, File structures using hashing fimctions, Communications of the ACM 13
(1970), 7, 427-436.
3. G. D. Knott, Hushing Junctions, The Computer Journal 18 (1975), 3, 265-278.
4. D. E. Knuth, The art of computer programming, VoL 1: Fundamental algorithms, Addison-Wesley,
Reading, Mass., t968.
5. D. E. Knuth, The art of computer programming, Vol. 2: Semi-numerical algorithms, Addison-Wesley,
Reading, Mass., 1969.
6. D. E. Knuth, The art of computer programming, Vol. 3: Sorting and searching, Addison-Wesley,
Reading, Mass., 1973.
7. J. Martin, Computer data-base organization, Prentice-Hall, Englewood Cliffs, NA, 1975.

INSTITUT1ONENFOR INF.ORMATIONSBEHANDLING
ABO AKADEMI
F,~NRIKSGATAN3
205OOAaO 50
FINLAND

You might also like