Fs Mod 5 (WWW - Vtuloop.com)
Fs Mod 5 (WWW - Vtuloop.com)
An Object-Oriented
Approach with C++
M
CO
P.
O
LO
Michael J. Folk
TU
University ofIllinois
.V
W
Bill Zoellick
W
CAP Ventures
W
Greg Riccardi
Florida State University
A ADDISON-WESLEY
yy
Addison-Wesley is an imprint of Addison Wesley Longman,Inc..
https://hemanthrajhemu.github.io
Contents xix
M
CO
10.1 Indexed Sequential Access 424
10.2 Maintaining a Sequence Set 425
P.
https://hemanthrajhemu.github.io
Contents
M
11.7.2 Implications of Tombstones for Insertions 500
CO
11.7.3 Effects of Deletions and Additions on Performance 501
11.8 Other Collision Resolution Techniques 502
P.
https://hemanthrajhemu.github.io
CHAPTER
M
CO
P.
O
LO
TU
CHAPTER OBJECTIVES
.V
W
463
https://hemanthrajhemu.github.io
464 Chapter 11 Hashing
CHAPTER OUTLINE
11.1 Introduction
11.1.1 What Is Hashing?
11.1.2 Collisions
11.2 A Simple Hashing Algorithm
11.3 Hashing Functions and Record Distributions
11.3.1 Distributing Records among Addresses
11.3.2 Some Other Hashing Methods
11.3.3 Predicting the Distribution of Records
11.3.4 Predicting Collisions for a Full File
11.4 How Much Extra Memory Should Be Used?
11.4.1 Packing Density
11.4.2 Predicting Collisions for Different Packing Densities
11.5 Collision Resolution by Progressive Overflow
11.5.1 How Progressive Overflow Works
M
11.5.2 Search Length CO
11.6 Storing More Than One Record per Address: Buckets
11.6.1 Effects of Buckets on Performance
P.
11.1 ‘Introduction
O(1) access to files means that no matter how bigthefile grows, access to.
a record always takes the same, small number of seeks. By contrast,
sequential searching gives us O(N) access, wherein the numberofseeks
grows in proportion to the size of the file. As we saw in the preceding
chapters, B-trees improve on this greatly, providing O(log, N) access; the
numberof seeks increases as the logarithm to the base k of the numberof
records, wherek is a measureofthe leaf size. O(log, N) access can provide
https://hemanthrajhemu.github.io
Introduction 465
M
CO
A hash functionis like a black box that produces an addressevery time you
drop in a key. More formally,it is a function h(K) that transforms a key K
P.
into an address. The resulting addressis used as the basis for storing and
O
LO
ways:
W
https://hemanthrajhemu.github.io
466 Chapter 11 Hashing
Address Record
2
—Key 3
Ke= Lowst1|
wu 4 | LOWELL... ++#—-LOWELL’s
ueTe home address
: A(K)” Address
5
4
6
M
Figure 11.1 Hashing the key LOWELL to address 4.
CO
P.
using the rightmost three digits of the result for the address. Table 11.1
O
LO
shows how three names would produce three addresses. Note that even
though the namesare listed in alphabetical order, there is no apparent
TU
11.1.2 Collisions
W
W
Now suppose there is a key in the sample file with the name OLIVIER.
Because the nameOLIVIERstarts with the same twoletters as the name
LOWELL, they produce the same address (004). There is a collision
between the record for OLIVIER and the record for LOWELL. Wereferto
keys that hash to the same address as synonyms.
Collisions cause problems. We cannot put two records in the same
space, so we mustresolve collisions. We do this in two ways: by choosing
hashing algorithms partly on the basis of how fewcollisions they are like-
ly to produce and by playing sometricks with the way westore records.
The ideal solution to collisionsis to find a transformation algorithm
that avoidscollisions altogether. Such an algorithm is called a perfect hash-
ing algorithm,It turns out to be much moredifficult to find a perfect hash-
ing algorithm than one might expect. Suppose, for example, that you want
to store 4000 records among 5000 available addresses. It can be shown
(Hanson,1982) that of the huge numberof possible hashing algorithms
https://hemanthrajhemu.github.io
Introduction | - ' 467
ASCII code
for first two Home
Name letters Product address
for doing this, only one outof 10120000 avoidscollisions altogether. Hence,
it is usually not worth trying.!
A more practical solution is to reduce the numberofcollisions to an
acceptable number. For example, if only one out of ten searches for a
M
record results in a collision, then the average number of disk accesses
CO
required to retrieve a record remains quite low. Thereare several different
ways to reduce the numberofcollisions, including the following:
P.
O
m Spread out the records. Collisions occur when two or more records
LO
' ‘compete for the same address. If we could find a hashing algorithm
that distributes the records fairly randomly among theavailable
TU
around certain addresses. Our sample hash algorithm, which uses only
W
two letters from the key, is not good on this account because certain
W
1, It is not unreasonable to.try to generate perfect hashing functionsfor small (less than 500), stable
sets of keys, such as mightbe used to look up reserved words in a programming language. Butfiles
generally contain more than a few hundredkeys, or they contain sets of keys that change frequent-
ly, so they are not normally considered candidates for perfect hashing functions. See Knuth (1998),
Sager (1985), Chang (1984), and Chichelli (198% for more on perfect hashing functions.
https://hemanthrajhemu.github.io
468 Chapter 11 Hashing
M
called buckets. CO
In the following sections we elaborate on these collision-reducing
P.
files. '
LO
TU
11.2
W
as
W
https://hemanthrajhemu.github.io
A Simple Hashing Algorithm 469
In this algorithm we use the entire key rather than just the first two
M
letters. By using moreparts of a key, we increase the likelihood that differ-
CO
ences among the keys cause differences in addresses produced. The extra
processing time required to do this is usually insignificant when compared
P.
O
Folding and adding means choppingoff pieces of the number and adding
W
them together. In our algorithm we chop off pieces with two ASCII
W
numbers each:
W
76 79 | 87 69 | 76 76 | 32 32 | 32 321 32 32
These numberpairs can be thought of as integer variables (rather than
character variables, which is how they started out) so we can do arithmetic
on them. If we can treat them as integer variables, then we can add them.
This is easy to do in C because C allows us to do.arithmetic on characters.
In Pascal, we can use the ord() function to obtain the integer position of a
character within the computer’s characterset.
Before we add the numbers, we have to mention a problem caused by
the fact that in most cases the sizes of numbers we can add together are
limited. On some microcomputers, for example, integer values that exceed
32 767 (15 bits) cause overflow errors or become negative. For example,
adding thefirst five of the foregoing numbers gives
7679 + 8769 + 7676 + 3232 + 3232 = 30 588
https://hemanthrajhemu.github.io
470 Chapter 11 Hashing
Addingin the last 3232 would, unfortunately, push the result over the
maximum 32 767 (30 588 + 3232 = 33 820), causing an overflow error.
Consequently, we need to make sure that each successive sum is less than
32 767. We can dothis byfirst identifying the largest single value wewill
ever add in our summation and then making sure after each step that our
‘intermediate result differs from 32 767 by that amount.
In ourcase, let us assume thatkeys consist only of blanks and upper-
case alphabetic characters, so the largest addendis 9090, corresponding to
ZZ. Suppose we choose 19 937 as ourlargest allowable intermediate result.
This differs from 32 767 by much more than 9090, so we can be confident
(in this example) that no new addition will cause overflow. We can ensure
in our algorithm that no intermediate sum exceeds 19 937 by using the
mod operator, which returns the remainder when one integeris divided by
another:
7679 + 8769 — 16 448 — 16448 mod 19937 — 16448
M
16448 +7676 —+24124 — 24 124 mod 19 937
CO — 4187
4187 + 3232 — 7419 — mod 19 937 — 7419
7419 + 3232 > 10651 — mod 19 937 > 10651
P.
O
Whydid we use 19 937 as our upper bound rather than, say, 20 000?
Because the division and subtraction operations associated with the mod
.V
operator are more than just a way of keeping the number small; theyare
W
https://hemanthrajhemu.github.io
A Simple Hashing Algorithm 471
M
to be the address of a record, we choose a numberasclose as possible to
CO
the desired size of the address space.This number determinesthe sizeof
P.
the address space. For a file with 75 records, a good choice might be 101,
O
If 101 is the sizé of the address space, the home address of the record
in the example becomes
TU
=.46
W
in thefile..
W
This procedure can be carried out with the function Hash in Fig. 11.2.
Function Hash takes two inputs: key, which must be an array of ASCII
codes for at least twelve characters, and maxAddress, which has the maxi-
mum address value. The value returned by Hashis the address.
Figure 11.2 Function Hash uses folding and prime number division to
compute a hash addressfor a twelve-character string,
https://hemanthrajhemu.github.io
472 Chapter 11 Hashing
Of the two hash functions we have so far examined, one spreads out
records pretty well, and one does not spread them out well at all. In this
section we look at ways to describe distributions of records in files.
Understanding distributions makes it easier to discuss other hashing
methods.
M
the addresses. We pointed out earlier that completely uniform distribu-
CO
tions are so hardto findit is generally not considered worth tryingto find
them.
P.
somewhatspread out, but with a few collisions. This is the mostlikely case
W
random, then for a given key every address has the samelikelihood of.
being chosen as every other address. The fact that acertain addressis
chosen for one key neither diminishes nor increasesthelikelihoodthat the
same address will be chosen for anotherkey.
It should be clear that if a random hash functionis used to generate a
large numberof addresses from a large numberof keys, thensimply by
chance someaddresses are going to be generated moreoften than others.If
you have, for example, a random hash function that generates addresses
between 0 and 99 and you give the function one hundred keys, you would
expect some of the one hundred addresses to be chosen more than once '
and some to be chosen notatall. |
Although a random distribution of records amongavailable addresses
is not ideal, it is an acceptable alternative given thatit is practically impos-
sible to find a function that allows a uniform distribution. Uniform distri-
https://hemanthrajhemu.github.io
Hashing Functions and Record Distributions 473
A 2 A 3 A 2
3 3 3
B B B
4 4 4
0
D
5 °
D
5 C
D
5
6 6 6
E E E
7 7 7
F 8 r 8 8
G 9 G 9 C 9
10 10 10
M
CO
P.
butions may be out of the question, but there are times when we can_find
O
distributions that are better than random in the sense that, while they do
LO
https://hemanthrajhemu.github.io
474 Chapter 17 Hashing
M
has many small factors can result in manycollisions. Research has
CO
shown that numbers with no divisors less than 19 generally avoid this
P.
sequences.
TU
orderings amongthe keys. The next two methods should be tried when,
W
m Square the key and take the middle. This popular method (often
called the mid-square method) involves treating the key as a single
large number, squaring the number, and’ extracting whatever
numberof digits is needed from the middle ofthe result. For exam-
ple, suppose you wantto generate addresses between 0 and 99. If the
key is the number 453, its square is 205 209. Extracting the middle
two digits yields a number between 0 and 99, in this case 52. As long
as the keys do not contain many leading ortrailing zeros, this
method usually produces fairly random results. One unattractive
feature of this method is thatit often requires multiple precision
arithmetic.
m Radix transformation. This method involves converting the key to
some numberbase other than the one you are workingin, then taking
the result modulo the maximum address as the hash address. For
https://hemanthrajhemu.github.io
Hashing Functions and Record Distributions 475
M
Althoughthere are no nice mathematica]tools available for predicting
CO
collisions among distributions that are better than random, there are
P.
(knowing that verylikely it will be better than random), we can use these
TU
2. This section develops a formula for predicting the ways in which recordswill be distributed among
addresses in file if a random hashing function is used. The discussion assumes knowledge of
some elementary concepts of probability and combinatorics. You may want to skip the develop-
ment and go straight to the formula, whichis introduced inthe next section.
https://hemanthrajhemu.github.io
476 Chapter 71 Hashing
P
(B)=b=—
1
N
M
since the address has one chancein N of being chosen, and
CO
N-1 1
P.
(A)=a= =|- —
O
P N N
LO
since the address has N— chances in N of not being chosen. If there are
TU
0.1 = 0.9.
W
Now suppose two keys are hashed. Whatis the probability that both
W
keys hash to our given address? Since the two applications of the hashing
W
function are independent of one another, the probability that both will
produce the given addressis a product: ;
https://hemanthrajhemu.github.io
Hashing Functions and Record Distributions A77
M
ABAB abab = b2a2 (0.1)2(0.9)2 = 0.0036
CO
AABB aabb = b?a2 (0.1)2(0.9)2 = 0.0036
P.
O
bility of two Bs and two Asis the sum ofthe probabilities of the individual
TU
outcomes:
.V
In general, the event “r trials result in r~x As and x Bs” can happen in
as many ways as r~— x letters A can be distributed among places. The
probability of each such way is
grex px
https://hemanthrajhemu.github.io
478 Chapter 11 Hashing
=C
Ly*(1
] —-— —_
Pe) | y A
where Chasthe definition given previously.
What doesthis mean? It means that if, for example, x = 0, we can
computethe probability that a given address will have 0 recordsassigned
to it by the hashing function using the formula
plo) =C
ow)bs
If x= 1, this formula gives the probability that one record will be-assigned
omfl
to a given address:
M
l
COr-l l 1
p N N
P.
O
imation for p(x) and is much easier to compute.It is called the Poisson
.V
function.
W
W
where N, 1, x, and p(x) have exactly the same meaning they havein the
previoussection. Thatis,if
N =the numberofavailable addresses;
r =the numberof recordsto be stored; and
x =the numberofrecords assigned to a given address,
then p(x) gives the probability that.a given address will have had x records
assignedto it after the hashing function has been appliedto all n records.
Suppose, for example, that there are 1000 addresses (N = 1000) and
1000 records whosekeys are to be hashed to the addresses (r =1000). Since
https://hemanthrajhemu.github.io
Hashing Functions and Record Distributions 479
r/N = 1, the probability that a given address will have no keys hashedtoit
(x= 0) becomes
10 e-]
p(0) = = 0.368
0!
The probabilities that a given address will have exactly one, two, or
three keys, respectively, hashed toit are
pl)= = 0:368
p(2) _
= Pet
a _
= 0.184
eet
p(3) = sr = 0.061
_ If we can use the Poisson function to estimate the probability that a
M
given address will have a certain number of records, we can also use it to
CO
predict the numberof addresses that will have a certain numberof records
assigned.
P.
O
For example, suppose there are 1000 addresses (N = 1000) and 1000
LO
records (r= 1000). Multiplying 1000 by the probability that a given address
will have x records assigned to it gives the expected total number of
TU
addresses with x records assigned to them. That is, 1000p(x} gives the
.V
Np(x)
This suggests another wayof thinking about p(x). Rather than think-
ing about p(x) as a measure of probability, we can think of p(x) as giving
the proportion of addresses having x logical records assigned by hashing. |
Now that we havea tool for predicting the expected proportion of
addresses that will have zero, one, two,etc. records assigned to them bya
random hashing function, wecan applythis tool to predicting numbersof
collisions.
https://hemanthrajhemu.github.io
480 Chapter 11 Hashing
M
addresses with two records apiece, however, represent potential trouble.If
CO
each such address has space only for one record, and two records are
P.
into the addresses, but another 1839 will not fit. There will be 1839 over-
LO
flow records.
TU
Each of the 613 addresses with three records apiece has an even
bigger problem. If each address has space for only onerecord,therewill
.V
https://hemanthrajhemu.github.io
How Much Extra Memory Should Be Used? 481
length of fence. If there are ten tin cans and you throw rock, there is a
LO
certain likelihood that you will hit a can. If there are twenty cans on the
TU
samelengthof fence, the fence has a higher packing density and your rock
is more likely to hit a can. Soit is with records in a file. The more records
.V
there are packedinto given file space, the morelikelyit is that a collision
W
"We needto decide how much space weare willing to waste to reduce
W
3. We assumehere that only one record can be stored at each address. In fact, that is not necessarily
the case, as wesee later.
https://hemanthrajhemu.github.io
482 Chapter 11 Hashing
You may have noted already that the formula for packing density (r/N)
occurs twice in the Poisson formula
p(x) = (r/N)* @ -/N)
x!
Indeed, the numbers of records (r) and addresses (N) always occur togeth-
er as the ratio r/N. They never occur independently. An obvious implica-
tion of this is that the way records are distributed dependspartly on the
ratio of the numberof records to the numberof available addresses, and
not on the absolute numbersof records or addresses. The same behavioris
exhibited by 500 records distributed among 1000 addresses as by 500 000
records distributed among 1 000 000 addresses.
Suppose that 1000 addresses are allocated to hold 500 records in a
randomly hashed file, and that each address can hold one record. The
packing density forthefile is
M
CO
P.
synonyms)? | ' .
W
W
m How many addresses should have one record plus one or more
W
synonyms? = -
m Assuming that only one record can be assigned to each homeaddress,
how manyoverflow records can be expected?
m= Whatpercentage of records should be overflow records?
https://hemanthrajhemu.github.io
How Much Extra Memory Should Be Used? 483
3. How many addresses should have one record plus one or more
synonyms? The valuesof p(2), p(3), p(4), and so on give the propor-
tions of addresses with one, two, three, and so on synonyms
assigned to them. Hencethe sum
M
percent loaded, one would not expect very many keys to hash to
CO
any one address. Therefore, the numberof addresses with more
than about three keys hashed to them should be quite small. We
P.
insignificantly small:
LO
= 0.0902
.V
W
4, Assuming that only one record can be assigned to each home address,
how many overflow records could be expected? For each of the
addresses represented by p(2), one record can be stored at the
address and one must be an overflow record. For each address
represented by p(3), one record can be storedat the address,two are
overflow records, and so on. Hence, the expected numberof over-
flow recordsis given by
1x Nx p(2) +2 Nx p(3) +3 Nx p(4) +4 Nx p(5)
= Nx [1X p(2) +2 x p(3) +3 x p(4) +4 p(5)]
= 1000 x [1 x 0.0758 + 2 x 0.0126 + 3 x 0.0016 + 4 x 0.0002]
= 107
https://hemanthrajhemu.github.io
484 Chapter 11 Hashing
M
when the packing density is 10 percent looks very good until you realize
CO
that for every record in yourfile there will be nine unused spaces!
The 36.8 percentthat results from 100 percent usage looks good when
P.
doesn’t tell the whole story.If 36.8 percent of the records are notat their
LO
TU
.V
_ Packing Synonyms as
W
50 21.4
60 24.8
70 ' 28.1
80 31.2
90 34.1
100 36.8
https://hemanthrajhemu.github.io
Collision Resolution by Progressive Overflow 485
M
numberof techniques for handling overflow records, and the search for
CO
ever better techniques continues to be lively area of research. We exam-
ine several approaches, but we concentrate on a very simple one thatoften
P.
works well. The technique has various names, including progressive over-
O
In the example, we wantto store the record whose keyis York in the file.
W
Unfortunately, the name York hashes to the same address as the name
W
https://hemanthrajhemu.github.io
486 Chapter 11 Hashing
Novak... '
M
CO
Figure 11.4 Collision resolution with progressive overflow.
P.
O
LO
Since thefile holds only 100 records, it is not possible to use 100 as the next
TU
Since address 0 is not occupiedin this case, Blue gets stored in address0.
W
_ What happens if thereis a search for a record but the record was never
W
placed in the file? The search begins, as before, at the record’s home
W
address, then proceedsto look forit in successive locations. Two things can
happen: .
m@ [fan open addressis encountered,the searching routine might assume
this meansthatthe recordis notin thefile; or
m@ If the file is full, the search comés back to whereit began. Onlythenis
it clear that the record is not in the file. When this occurs, or even
when we approachfilling ourfile, searching can becomeintolerably
slow, whetheror not the record being sought is in thefile.
The greatest strength of progressive overflowis its simplicity. In many
cases, it is a perfectly adequate method. There are, however,collision-
handling techniquesthat perform better than progressive overflow, and we
examine someof themlater in this chapter. Now let us look at the effect of
progressive overflow on performance.
https://hemanthrajhemu.github.io
Collision Resolution by Progressive Overflow 487
—— Key
Blue +
98
Hash: ; ~ Address
.routine, L 99 Jello.
Wrapping around
M
CO
Figure 11.5 Searching for an address beyond the end ofa file.
P.
O
Thereason to avoid overflow is, of course, that extra searches (hence, extra
TU
disk accesses) have to occur when a record is not found in its home
.V
address.If there are a lot of collisions, there are going to be a lot of over-
W
flow records taking up spaces where they ought not to be. Clusters of
W
records can form, resulting in the placement ofrecords a long way from
W
Adams 20
Bates 21
Cole 21
Dean 22
‘Evans 20
https://hemanthrajhemu.github.io
488 Chapter. 11 Hashing
Numberof.
‘Actual Home accesses needed.
address address to retrieve
20 Adams... | 20 1.
21 Bates; .. 21 1
22 Cole... 21 2
23 Dean... 22 Z
24 Evans... 20 5
M
25 CO
P.
O
tered, the numberof accesses required to access later keys can become large.
TU
.V
W
shows where each keyis stored, together with information on how many
W
search length. The average search lengthis the average numberof times
you can expectto haveto access the disk to retrieve a record. A rough esti-
mate of average search length may be computed byfindingthetotal search
length (the sum ofthe search lengthsof the individual records) and divid-
ing this by the numberofrecords:
Averagesearch length =
panes —t0tal
searc h lengtords
h.
total numbe rofrec
https://hemanthrajhemu.github.io
Collision Resolution by Progressive Overflow 489
L+14+24+24+5_ 2.2
5
With nocollisionsatall, the average search lengthis ], since only one
access is needed to retrieve any record, (We indicated earlier that an algo-
rithm that distributes records so evenly nocollisions occuris appropriate-
ly called a perfect hashing algorithm, and we mentioned that,
unfortunately, such an algorithm is almost impossible to construct.) On
the other hand,if a large numberof the records in file results in colli-
sions, the average search length becomesquite long. There are waysto esti-
mate the expectedaverage search length,given variousfile specifications,
and wediscuss them in later section.
It turns out that, using progressive overflow, the average search length
increases very rapidly as the packing density increases. The curve in Fig.
11.7, adapted from Peterson (1957), illustrates the problem.If the packing
density is kept as lowas 60 percent, the average record takes fewer than two
M
tries to access, but for a much moredesirable packing density of 80 percent
CO
or more,it increases very rapidly.
P.
5
W
W
Average 3 oo ‘ hy
search
length
20 40 60 80. 100
https://hemanthrajhemu.github.io
490 Chapter 11 Hashing
our hashing program. The change involves putting more than one record
at a single address.
M
lywhen those records are seen as sharing the same address. On sector-
CO
addressingdisks, a bucket typically consists of one or more sectors; on
block-addressing disks, a bucket might be a block. .
P.
hashfile.
TU
Gre en 30
W
Hall 30
W
Jenks 32
W
King 33
Land 33
Marx 33.
Nutt 33
Figure 11.8 illustrates part of a file into which therecords with these keys
are loaded. Each addressin the file identifies a bucket capable of holding
the records corresponding to three synonyms. Only the record corre-
sponding to Nutt cannot be accommodated in a home address.
Whena recordis to be stored orretrieved, its home bucket address is
determined by hashing. The entire bucket is loaded into primary memory.
An in-memorysearch through successive records in the bucket can then
be usedto find the desired record. When a bucketis filled, we still have to
https://hemanthrajhemu.github.io
Storing More Than One Record per Address: Buckets 491
Bucket contents
Green... Hall...
Jenks... .
(Nutt. . . is
King... Land... Marks... an overflow
record)
M
CO
WOITy about the record overflow problem (as in the case of Nutt), but this
occurs muchless often when buckets are used than when each address can
P.
changedslightly since each bucket address can hold more than one record.
W
To compute how densely packeda file is, we need to consider both the
W
each address (bucket size). If Nis the number of addresses and b is the
numberof records thatfit in a bucket, then bN is the numberofavailable
locations for records.If ris still the numberof records in thefile, then
Packing g density
ty = =a
ON
229 _ 75%
1000
https://hemanthrajhemu.github.io
492 Chapter 11 Hashing
m Wecanstore the 750 records among 500 locations, where each loca-
tion has a bucketsize of 2. There arestill 1000 places (2 x 500) to store
the 750 records, so the packing densityisstill
[= 0.75 = 75%
Since the packing density is not changed, we might at first not expect
the use of buckets in this way to improve performance, but in fact it does
improve performance dramatically. The key to the improvement is that,
although there are fewer addresses, each individual address has more room
for variation in the numberof records assigned to it. _
Let’s calculate the difference in performance for these two ways of
storing the same number of records in the same amount of space. The
starting pointfor ourcalculations is the fundamental description of each
file structure.
M
File without buckets File with buckets
CO
Numberof records r= 750 r= 750
P.
O
case of eachfile, recall that when a random hashing functionis used, the
W
Poisson function
(r/N}* e (ND
p(x) =
x!
https://hemanthrajhemu.github.io
Storing More Than One Record per Address: Buckets 493
Note that the bucket columnin Table 11.3 is longer than the nonbuck-
O
LO
et column. Does this mean that there are more synonymsin the bucket
case than in the nonbucketcase? Indeed it does, but half of those syno-
TU
nymsdo notresult in overflow records because each bucket can hold two
.V
exactly one record doesnot have any overflow. Any address with more
W
than one record does have overflow. Recall that the expected number of
overflow records is given by
Nx [1 x p(2) +2 x p(3) +3 x p(4) +4 p(5)+...]
which, for r/N = 0.75 and N= 1000, is approximately
https://hemanthrajhemu.github.io
A494 Chapter 11 Hashing
there are overflow records. For each address represented by p(3), two
records can be stored at the address, and one must be an overflow record.
Similarly, for each address represented by p(4), there are two overflow
records, and so forth. Hence, the expected numberofoverflow records in
the bucketfile is
M
ing density remains 75 percent, but the expected number of overflow
CO
records drops to 18.7 percent. That is about a 37 percent decrease in the
numberof times the program has to look elsewhere for a record. As the
P.
packing densities and for different bucketsizes. We see from thetable, for
TU
the time. |
W
It should beclear that the use of buckets can improve hashing perfor-
W
https://hemanthrajhemu.github.io
Storing More Than One Record per Address: Buckets 495
M
CO
80 31.2 20.4 11.3 5.2 0.1
90 34.1 23.8 13.8 8.6 0.8
P.
O
store five records per cluster, and let the remaining 24 bytes go unused.
.V
cluster than it is to access a single record, the only losses from the use of
W
buckets are the extra transmission time and the 24 unused bytes.
W
https://hemanthrajhemu.github.io
496 Chapter 11 Hashing
Packing Bucketsize
density
(%) 1 2 3 10 100 —
M
95 10.50 5.60 2.70
CO 1.80 1.10
Adapted from Donald Knuth, The Art of Computer Programming, Vol. 3, ©1973, Addison-Wesley,
P.
In the early chapters ofthis text, we paid quite a bit of attention to issues
.V
https://hemanthrajhemu.github.io
Storing More Than One Record per Address: Buckets 497
Bucket Structure
The only difference between a file with buckets and one in which each
address can hold only one keyis that with a bucketfile each address has
enough space to hold more than one logical record. All records that are
housed in the same bucket share the same address. Suppose, for example,
that we wantto store as many as five names in one bucket. Here are three
such buckets with different numbersof records.
M
A full bucket: 5| JONES ARNSWORTH| STOCKTON| BRICE THROOP
CO
P.
it has stored in it. Collisions can occur only whenthe addition of a new
record causes the counter to exceed the numberof records a bucket can
TU
hold.
.V
' The countertells us how many datarecords are stored in a bucket, but
W
it does not tell us which slots are used and which are not. We need a way to
W
https://hemanthrajhemu.github.io
498 Chapter 11 Hashing
Loading a HashFile
A program that loads a hash. file is similar in many ways to earlier
programsweused for populatingfixed-length record files, with two differ-
ences. First, the program uses the function hash to produce a home
address for each key. Second, the program looksfor a free space for the
record by starting with the bucket stored at its home address and then,if
the home bucketis full, continuing to look at successive buckets until one
is found that is not full. The newrecord is inserted in this bucket, and the
bucketis rewritten to thefile at the location from whichit was loaded.
If, as it searches for an empty bucket, a loading program passes the
maximum allowable address, it must wrap around to the beginning
address. A potential problem occurs in loading a hash file when so many
records have been loaded into the file that there are no empty spacesleft.
A naive search for an open slot can easily result in an infinite loop.
Obviously, we want to prevent this from occurring by having the program
M
make sure that there is space available for each new record somewherein
CO
thefile.
P.
If there is a danger of duplicate keys occurring, and duplicate keys are not
TU
allowed in the file, some mechanism must be found for dealing with this.
problem.
.V
W
W
https://hemanthrajhemu.github.io
' Making Deletions’ =< > > -- 499
Home Actual
Record address address
4
Adams 5 5
Jones 6 6 5 Adams...
Morris 6 7 6 Jones...
Smith 5 8 4 Morris...
8 Smith...
M
in alphabetical order using progressive overflow for collisions, they are
CO
storedin the locations shownin Fig. 11.9.
P.
‘sively looks for Smith at addresses 6, 7, and 8, then finds Smith at 8. Now
LO
https://hemanthrajhemu.github.io
500 Chapter 11 Hashing
4 5 Adams. .
5 Adams... 6 Jones...
6 Jones... 7 HEEHRAF
7 8 Smith...
8 Smith...
M
with Morris deleted.
CO
P.
O
does not halt at the empty record number7. Instead,it uses the ###### as
LO
For example, supposein the preceding example that the record for Smith
.V
https://hemanthrajhemu.github.io
Making Deletions 504
shownin Fig. 11.11. Now suppose you want a programto insert Smith
into the file. If the program simply searches until it encounters a ####4#,
it never notices that Smith is already in the file. We almost certainly don’t
want to put a second Smith recordintothefile, since doing so means that
later searches would never find the older Smith record. To prevent this
from occurring, the program must examinethe entire cluster of contigu-
ous keys and tombstones to ensure that-no duplicate key exists, then go
back and insert the record in thefirst available tombstone, if there is one.
The use of tombstones enables our search algorithms to work and helpsin
storage recovery, but one can still expect some deterioration in perfor-
mance after a numberof deletions and additions occur within file.
Consider, for example, our little four-record file of Adams, Jones,
M
Smith, and Morris. After deleting Morris, Smithis one slotfurther from its
CO
homeaddress than it needs to be. If the tombstone is never to be used to
store another record, every retrieval of Smith requires one more access
P.
deletions, one can expect to find many tombstones occupying places that
LO
age search lengthis aslikely to get betteras it is to get worse (Bradley, 1982;
Peterson, 1957). By this time, however, search performancehas deteriorat-
ed to the point at which the average recordis three times as far (in terms
of accesses) from its home addressas it would be after initial loading. This
means, for example, that if after original loading the average search length
is 1.2, it will be about 1.6 after the point of equilibrium is reached.
There are three types of solutions to the problemof deteriorating
average search lengths. One involves doing a bit of local reorganizing every
time a deletion occurs. For example, the deletion algorithm might exam-
ine the records that follow a tombstone to see if the search length can be
shortened by moving the record backward toward its home address.
Another solution involves completely reorganizing the file after the aver-
age search length reaches an unacceptable value. A third type ofsolution
involves using an altogether differentcollision resolution algorithm.
https://hemanthrajhemu.github.io
“502 Chapter 11 -Hashing
M
store overflow records a long way from their home addresses by double
CO
hashing. With double hashing, when-a collision occurs, a second hash
P.
records some distance from their home addresses, increasing the likeli-
hoodthat the disk will need extra time to get to the new overflow address.
If the file covers more than one cylinder, this could require an expensive
extra head movement. Double hashing programscansolvethis problem if
they are able to generate overflow addressesin such a way that overflow
records are kept on the samecylinder as homerecords.
4. If Nis the number of addresses, then cand Narerelatively prime if they have no common divisors.
https://hemanthrajhemu.github.io
OtherCollision Resolution Techniques 503
record with the same home address. The next record in turn contains a
pointerto the following record with the sarme home address, and so forth.
The neteffect of this is that for each set of synonymsthereis a linked list
M
connecting their records, and it is this list that is searched when a recordis
CO
sought.
P.
sive overflow is that only records with keys that are synonymsneed to be
LO
accessed in any given search. Suppose, for example, that the set of keys
shownin Fig. 11.12 is to be loaded in the order showninto a hash file with
TU
bucket size 1, and progressive overflow is used. A search for Cole involves
.V
worst case, requires six accesses, only two of which involve synonyms.
W
forms a linked list connecting these three names, with Adamsat the head
of the list. Since Bates and Deanare also synonyms, they form a second
list. This arrangementis illustrated in Fig. 11.13. The average search length
decreases from 2.5 to : Se
1+1424+2+14+3
= 1.7
6
The use of chained progressive overflow requires that we attend to
some details that are not required for simple progressive overflow. First, a
link field must be added to each record, requiring the use ofa little more
storage. Second, a chaining algorithm must guaranteethatit is possible to
get to any synonym bystarting at its home address. This second require-
mentis not a trivial one, as the following example shows.
Suppose that in the example Dean’s homeaddressis 22 instead of 21.
Since, by the time Deanis loaded, address 22 is already occupied by Cole,
https://hemanthrajhemu.github.io
504 Chapter 11 Hashing
20 20 Adams... ‘ 22:0 1
21 21 Bates... 88 1
20 22 Cole... - 25° 2
21 23 ‘| Dean... Pai 2
24 24 Evans... = 1
20 25 Flint... | Soa 3
M
Figure 11.13 Hashing with chained progressive overflow. Adams, Cole, and
CO
Flint are synonyms; Bates and Dean are synonyms.
P.
O
LO
Deanstill ends up at address 23. Does this mean that Cole’s pointer should
point to 23 (Dean’s actual address) or to 25 (the address of Cole’s synonym
TU
Flint)? If the pointeris 25, the linked list joining Adams, Cole, and Flintis
.V
The problem here is that a certain address (22) that should be‘occu-
W
https://hemanthrajhemu.github.io
Other Collision Resolution Techniques 505
M
called the overflow area. The advantage ofthis approachis thatit keepsall
CO
unused but potential home addressesfree for later additions.
In terms of the file we examinedin the precedingsection, the records
P.
for Cole, Dean, and Flint could have been stored in a separate overflow
O
LO
home address has room, it is stored there. If not, it is moved to the over-
.V
flow file, whereit is added to the linkedlist that starts at the home address.
W
W
22 2 Flint... -l je
23 3 |
24 Evans. . - —I1
Figure 11.14 Chaining toa separate overflow area. Adams, Cale, andFlint are
synonyms; Bates and Dean are synonyms.
https://hemanthrajhemu.github.io
506 Chapter 11 Hashing
If the bucket size for the primaryfile is large enough to prevent exces-
sive numbers of overflow records, the overflowfile can be a simple entry-
sequencedfile with a bucketsize of 1. Space can be allocated for overflow
‘records only whenit is needed. |
The use of a separate overflow area simplifies processing somewhat
and would seem to improve performance, especially when many additions
and deletions occur. However, this is not always the case.If the separate
overflow area is on a different cylinder than is the home address, every
search for an overflow record will involve a very costly head movement.
Studies show that access timeis generally worse when overflow recordsare
stored in a separate overflow area than when they are stored in the prime
overflow area (Lum,1971).
One situation in which a separate overflow area is required occurs
whenthe packing density is greater than one—there are more records than
home addresses.If, for example, it is anticipated that a file will grow
M
beyondthe capacity oftheinitial set of home addresses and that rehashing
CO
the file with a larger addressspace is not reasonable, then a separate over-
flow area must be used.
P.
O
LO
rather than by some other method. The term scatter table (Severance,
W
Evans. ..|—~1
Figure 11.15 Example ofa scatter table structure. Because the hashed partis an index,
the data file may be organized in any waythat is appropriate.
https://hemanthrajhemu.github.io
Patterns of Record Access , "507
M
Twenty percent of the fishermen catch 80 percent of the fish.
CO
Twentypercent of the burglars steal 80 percent of the loot.
P.
. L.M. Boyd
O
LO
The use of different collision resolution techniquesis not the only nor
necessarilythe best way to improve performance in a hashedfile. If we
TU
grocery items and you have on your computera hashed inventoryfile with
a record for each of the 10 000 items that your company handles. Every
time an item 1s purchased,the record that correspondsto that item must
be accessed. Since the file is hashed, it is reasonable to assume that the
10 000 records are distributed randomly amongthe available addresses that
make up thefile. Is it equally reasonable to assume that the distribution of
accesses to the records in the inventory are randomlydistributed? Probably
not. Milk, for example, will be retrieved very frequently, brie seldom.
There is a principle used by economists called the Pareto Principle, or
The Conceptof the Vital Few and the Trivial Many, whichin file terms says
that a small percentage of the records in a file accountfor a large percent-
age of the accesses. A popular version of the Pareto Principle is the 80/20
Rule of Thumb: 80 percentof the accesses are performed on 20 percent of
the records. In our groceriesfile, milk would be among the 20 percent
high-activity items, brie among the rest.
https://hemanthrajhemu.github.io
508 Chapter 11 Hashing
M
have been accessed. When the sortedfile is rehashed and reloaded,thefirst
CO
records to be loaded are the onesthat, according to the previous month’s
P.
experience, are most likely to be accessed. Since they are the first’ ones
O
loaded, they are also the ones most likely to be loaded into their home
LO
addresses. If reasonably sized buckets are used, there will be very few, if
TU
any, high-activity items that are not in their home addresses and therefore
retrievable in one access.
.V
W
W
W
SUMMARY
There are three major modes for accessing files: sequentially, which
provides O(N) performance, through tree structures, which can produce
O(log,N) performance, and directly. Direct access provides O(1) perfor-
mance, which means that the numberof accesses required to retrieve a
record is constant and independentofthesize ofthefile. Hashingis the
primary form of organization used to provide direct access.
Hashing can provide faster access than most of the other organiza- .
tions we study, usually with verylittle storage overhead,andit is adapt- .
able to most types of primary keys. Ideally, hashing makesit possibleto
find any record with only one disk access, butthis idealis rarely achieved.
The primary disadvantage of hashingis that hashedfiles may not be sort-
ed by key.
https://hemanthrajhemu.github.io
Summary 509
M
approachesto reducing the numberofcollisions: CO
m Spreading out the records;
P.
m Using buckets.
LO
https://hemanthrajhemu.github.io
510 Chapter 11 Hashing
M
the file designer. The number of records that can be stored at a given
CO
address, calléd bucket size, determines the pointat which records assigned
to the address will overflow. The Poisson function can be usedto explore
P.
ets, combined with a low packing density, can result in very small average
search lengths.
TU
home address in order until one is found to hold the new record. If a
record is sought and is not foundin its home address, successive address-
es are searched until either the record is found or an empty address is
encountered.
Progressive overflow is simple and sometimes works very well.
However, progressive overflow creates long search lengths when the pack-
ing density is high and the bucketsize is low. It also sometimes produces
clusters of records, creating very long search lengths for new records whose
homeaddressesare in theclusters.
Three problemsassociated with record deletion in hashedfiles are
1. The possibility that emptyslots created by deletions will hinderlater
searches for overflow records;
https://hemanthrajhemu.github.io
Summary- 511
M
records so far from home that they require extra seeks.
CO
2. Chained progressive overflow reduces search lengths by requiring that
P.
record for some record in the file must hold a home record..
Mechanisms for making sure that this occurs are discussed.
TU
ly and has the advantage that the overflow area may be organized in
W
https://hemanthrajhemu.github.io
512 Chapter 11 Hashing
KEY TERMS _
Average search length, We define average search length as the surn of the
numberof accesses required for each record in the file divided by the
numberof records in the file. This definition does not take into account
the numberof accesses required for unsuccessful searches, nor doesit
account for the fact that somerecordsare likely to be accessed more
often than others. See 80/20 rule of thumb.
Better-than-random. This term is applied to distributions in which the
records are spread out more uniformly than they would be if the hash
function distributed them randomly. Normally, the distribution
produced by a hash functionis little bit better than random.
Bucket. An area of space on the file that is treated as a physical record for
storage and retrieval purposes butis capable of storing severallogical
records. By storing and retrieving logical records in buckets rather
M
than individually, access times can, in many cases, be improved
CO
substantially.
P.
https://hemanthrajhemu.github.io
KeyTerms 513
" direct access to recordsin file, but hashing is also often used to access
items in arrays in memory.In indexing, for example, an index might
be organized for hashingrather than for binarysearchif extremely fast
searchingof the index is desired.
Home address. The address generated by a hash function for a givenkey.
If a recordis stored at its home address, then the search length for the
record is 1 because only oneaccessis required to retrieve the record. A
record not at its home address requires more than one access to
retrieve or store.
Indexed hash.Instead of using theresults of a hash to producethe address
of a record, the hash can be used to identify a location in an index that
in turn points to the address of the record. Although this approach
requires one extra access for every search,it makesit possible to orga-
nize the data records in a waythatfacilitates other types of processing,
such as sequential processing.
M
Mid-square method. A hashing method in which a representation of the
CO
key is squared and some digits from the middle ofthe result are used
P.
to producethe address.
O
home address.
W
https://hemanthrajhemu.github.io
314 Chapter 11 Hashing
M
CO
Synonyms. Two or moredifferent keys that hash to the same address.
Wheneachfile address can hold only one record, synonymsalways
P.
ated with the deletion of records: the freed space does not break a
W
sequential search for a record, and the freed spaceis easily recognized
W
FURTHER READINGS
There are a numberof goodsurveys of hashing andissues related to hash-
ing generally, including Knuth (1998), Severance (1974), Maurer (1975),
and Sorenson, Tremblay, and Deutscher (1978). Textbooks concerned with
: file design generally contain substantial amounts of material on hashing,
and they often provide extensive references for further study. Loomis
(1989) also covers hashing generally, with additional emphasis on pro-
https://hemanthrajhemu.github.io
‘Exercises 515
[EXERCISES
1. Use the function hash(KEY, MAXAD) described in the text to
M
CO
answerthe following questions.
a. Whatis the value of hash("Jacobs", 101)?
P.
b. Find two different words of more than four characters that are
O
LO
synonyms.
c. It is assumed in the text that the function hash does not need to
TU
lem if we have a file with addresses larger than 19 937. Suggest some
W
https://hemanthrajhemu.github.io
516 Chapter 11 Hashing
a. How many unique keys are possible? (Hint: If K were one upper-
case letter rather than five, there would be 26 possible unique
keys.)
b. How are nand r related?
c. How are rand M related?
d. If the function h were a minimum perfect hashing function, how
would x, 7, and M be related?
The following table shows distributions of keys resulting from three
different hash functions on a file with 6000 records and 6000 ad-
dresses.
Function A Function B Function C
d(0) 0.71 0.25 ‘0.40
d(1) 0.05 0.50 0.36
d{2) 0.05 0.25 0.15
M
d(3) 0.05 0,00CO 0.05
d(4) 0.05 0.00 0.02
d(5) 0.04 0.00. 0.01
P.
https://hemanthrajhemu.github.io
Exercises 517
-d. The expected number of addresses with one record plus one or
more synonyms;
e. The expected number of overflow records; and
f. The expected percentage of overflow records.
6. Considerthe file described in the preceding exercise. What is the
expected numberof overflow recordsif the 10 000 locations are reor-
ganized as
a. 5000 two-record buckets; and
b. 1000 ten-record buckets?
7. Make table showing Poisson function values for r/N= 0.1, 0.5, 0.8, 1,
2, 5, and 11. Examine the table and discuss any features andpatterns
that provide useful information about hashing.
8. Thereis an overflow handling techniquecalled count-key progressive
overflow (Bradley, 1982) that works on block-addressable disks as
M
follows. Instead of generatinga relative record numberfrom key,the.
CO
hash function generates an address consisting of three values: a cylin-
der, a track, and a block number. The corresponding three numbers
P.
O
I/O processorcan direct the disk drive to search a track for the desired
W
record. [t can even direct the disk to search for an empty record slot if
W
progressive overflow.
a. Whatis it aboutthis technique that makes it superior to progressive
overflow techniques that might be implemented on sector-orga-
nized drives?
b. The main disadvantage of this techniqueis that it can be used only
with a bucketsize of 1. Whyis this the case, and whyis it a disad-
vantage?
9. In discussing implementation issues, we suggest initializing the data
file by creating real records that are marked empty before loading the
file with data. There are some good reasonsfor doing this. However,
there might be some reasons not to do it this way. For example,
suppose you want a hash file with a very low packing density and
cannotafford to have the unused space allocated. How might a file
managementsystem be designed to work with a very large Jogicalfile
but allocate space only for those blocks in the file that contain data?
https://hemanthrajhemu.github.io
518 Chapter 11 Hashing
M
Add Finch CO 0
Add Gates 2
P.
Del Alan
O
Add Hart 3
LO
two-pass loading?
W
1}. Suppose you havea file in which 20 percent of the records account for
80 percent of the accesses and that you wantto store,thefile with a
packing density of 0 and a bucket size of 5. Whenthefile is loaded,
you load the active 20 percentof the recordsfirst. After the active 20
percent of the records are loaded and before the other records are
loaded, whatis the packing densityof the partially filled file? Using
this packing density, compute the percentage of the active 20 percent
that would be overflow records. Commentontheresults.
“12. In our computations of average search lengths, we consider only the
timesit takes for successful searches. If our hashedfile were to be used
in such a way that searches were often madefor items that are not in
the file, it would be useful to havestatistics on average search length
for an unsuccessful search. If a large percentage of searches to a hashed
file are unsuccessful, howdo you expectthis to affect overall perfor-
mance if overflow is handled by
https://hemanthrajhemu.github.io
Exercises 519
a. Progressive overflow; or
b. Chaining to a separate overflow area?
(See Knuth, 1973b, pp. 535-539 for a treatmentofthese differences.)
13. Although hashed files are not generally designed to support access to
records in any sorted order, there may be times when batches of
transactions need to be performed on a hashed datafile. If the data
file is sorted (rather than hashed), these transactions are normally
carried out by somesort of cosequential process, which means that
the transaction file also has to be sorted. If the datafile is hashed, the
transaction file might also be presorted, but on the basis of the home
addresses of its records rather than some more “natural”criterion.
Suppose you havea file whose recordsare usually accessed direct-
ly but is periodically updated from a transactionfile. List the factors
you would have to consider in deciding between using an indexed
sequential organization and hashing. (See Hanson, 1982, pp.
M
280-285, for a discussion of these issues.)
CO
14, Weassume throughoutthis chapter that a hashing program should
P.
address. If this were not so, there would be times when we would
LO
trous result. But consider what Doug Mcllroy did in 1978 when he
was designing a spelling checker program. He found that by letting
.V
his program allow one outof every four thousand misspelled words
W
https://hemanthrajhemu.github.io
520 Chapter 11 Hashing
PROGRAMMING
15. Implement and test a version of the. function hash.
16. Create a hashed file with one record for every city in California. The
key in each recordis to be the name of the correspondingcity. (For.
the purposes of this exercise, there need be no fields other than the
key field.) Begin by creating a sortedlist of the namesofall of the
cities and townsin California. (If time or spaceis limited, just make a
list of namesstarting with theletter S.)
a. Examine the sorted list. What patterns do you notice that might
affect your choice of a hash function?
Implement the function hash in such a way.that you canalter the
numberof characters that are folded. Assuming a packing density
of 1, hash the entire file several times, each timefolding a different
M
number of characters and producing the following statistics for
CO
each run:
* The numberofcollisions; and
P.
O
records.
Discussthe results of your experimentin termsoftheeffects of
TU
17, Using a set of keys, such as the names of California towns, do the
following:
a. Write and test a program for loading the keys into three different
hashfiles using bucketsizes of 1, 2, and 5, respectively, and a pack-
ing density of 0.8. Use progressive overflow for handlingcollisions.
. Have your program maintain statistics on the average search
length, the maximum search length, and the percentage of records
that are overflowrecords.
Assuminga Poisson distribution, compare yourresults with the
expected values for average search length and the percentage of
records that are overflow records.
18. Repeat exercise 17, but use double hashing to handle overflow.
https://hemanthrajhemu.github.io
Programming Exercises 521
19. Repeat exercise 17, but handle overflow using chained overflow into a
separate overflow area. Assume that the packing density is the ratio of
numberof keys to available home addresses.
20. Write a program that can perform insertions and deletionsin thefile
created in the previous problem using a bucket size of 5. Have the
program keep runningsiatistics on average search length. (You might
also implement a mechanismto indicate whensearch length has dete-
riorated to a point wherethefile should be reorganized.) Discuss in
detail the issues you have to confrontin deciding how to handle inser-
tions and deletions.
M
CO
P.
O
LO
TU
.V
W
W
W
https://hemanthrajhemu.github.io
M
CO
P.
O
LO
TU
.V
W
W
W
https://hemanthrajhemu.github.io
CHAPTER
Extendible
Hashing
M
CO
P.
O
LO
TU
.V
W
CHAPTER OBJECTIVES
W
W
523
https://hemanthrajhemu.github.io
524 Chapter 12 Extendible Hashing
CHAPTER OUTLINE
12.1 Introduction
12.2 How Extendible Hashing Works
12.2.1 Tries
12.2.2 Turning the Trie into a Directory
12.2.3 Splitting to Handle Overflow
12.3. Implementation
12.3.1 Creating the Addresses
12.3.2 Classes for Representing Bucket and Directory Objects
12.3.3 Bucket and Directory Operations
12.3.4 Implementation Summary
12.4 Deletion
12.4.1 Overview of the Deletion Process
12.4.2 A Procedure for Finding Buddy Buckets
12.4.3 Collapsing the Directory-
M
12.4.4 Implementing the Deletion Operations
CO
12.4.5 Summary of the Deletion Operation
12.5 Extendible Hashing Performance
P.
12.1 Introduction
https://hemanthrajhemu.github.io
How Extendible Hashing Works 525
M
after the initial burst of research and design work revolving around B-trees
CO
was over, a number of researchers began to work on finding ways to modi-
fy hashing sothat it, too, could be self-adjusting as files grow andshrink.
P.
12.2.1 Tries
https://hemanthrajhemu.github.io
526 Chapter 12 Extendible Hashing
able
abrahms
anderson
andrews
M
Figure 12.1 Radix 26 trie that indexes names according to the letters of the
CO
alphabet.
P.
O
used the digits 0-9 as our search alphabet rather than the letters a—z, the
radix of the search would be reduced to 10. A search tree using digits
.V
https://hemanthrajhemu.github.io
How Extendible Hashing Works 527
7263
7268
M
CO
Figure 12.2 Radix 10 trie that indexes numbers according to the digits they
contain. .
P.
O
LO
are faced once again with all of the problemsassociated with storing trees
W
hashing.
W
https://hemanthrajhemu.github.io
528 Chapter 12 Extendible Hashing
00
: :
01 ‘
°
n
(a) (b)
Figure 12.4 Thetrie from Fig. 12.3 transformed first into a complete binary tree, then
flattened into a directory to the buckets. .
M
the same level as’shownin Fig. 12.4(a). Even thoughthe initial 0 is enough
CO
to select bucket A, the new form ofthe tree also uses the secondaddressbit
so bothalternatives lead to the same bucket.Once we have extendedthe
P.
tree this way, we can collapseit into the directory structure shownin Fig.
O
12.4(b). Now we havea structure that provides the kind of direct access
LO
associated with hashing: given an address beginningwith the bits 10, the
TU
https://hemanthrajhemu.github.io
How Extendible Hashing Works 529
Let’s consider a more complex case. Starting once again with the
directory and buckets in Fig. 12.4(b), suppose that bucket B overflows.
How do wesplit bucket B and where do we attach the new bucket after
the split? Unlike our previous example, wedo not have additional, unused.
M
bits of address space that we can press into duty as wesplit the bucket. We
CO
now need to use 3 bits of the hash address in order to divide up the
records that hash to bucket B. The trie iilustrated in Fig. 12.6(a) makes
P.
this trie looks like once it is extended into a completely full binary tree
with all leaves at the same level, and Fig. 12.6(c) shows the collapsed,
TU
(or shrink) the address space gracefully is what extendible hashingis all
W
about. .
We have been concentrating on the contribution that tries make to
extendible hashing; one might well ask where the hashing comesinto play.
Whynot just use the tries on the bits in the key, splitting buckets and
extending the address space as necessary? The answer to this question
grows out of hashing’s most fundamental characteristic: a good hash func-
tion produces a nearly uniform distribution of keys across an address
space. Notice that the trie shown in Fig. 12.6 is poorly balanced, resulting
in a directory that is twice as big as it needs to be. If we had an uneven
distribution of addresses that placed even more records in buckets B and
D without using other parts of the addressspace, the situation would get
even worse. By using a good hashfunction tocreate addresses with a near-
ly uniform distribution, we avoid this problem,
https://hemanthrajhemu.github.io
530 Chapter 12 Extendible Hashing ~N
(a}
000
001 ‘a
sO)
oa)
M
010
CO /
011
P.
O
LO
TU
10}
C2)
.V
ne | cs :
W
W
W
My
(b) (c)
Figure 12.6 The results of an overflow of bucket B in Fig. 12.4(b), represented first as a
trie, then as a complete binary tree, and finally as a directory.
12.3. Implementation
https://hemanthrajhemu.github.io
Implementation 531
Figure 12.7 Function Hash (key) returns an integer hash value for key for a 15-bit
full class definitions and method bodies for extendible hashing. The place
to start our discussion of the implementation is with the functions that
M
create the addresses,since the notion of an extendible address underliesall
CO
other extendible hashing operations.
P.
The function Hash given in Fig. 12.7, and file hash. cpp of Appen-
O
in Chapter 11. The only difference is that we do not conclude the opera-
TU
don’t have a fixed address space, instead we use as much of the address as
W
sum of the folded character values modulo 19 937, is to make sure that the
W
character summation stays within the range of a signed 16-bit integer. For
machinesthat use 32-bit integers, we could divide by a larger number and
create an even larger initial address.
Because extendible hashing uses more bits of the hashed address as
they are needed to distinguish between buckets, we need a function
MakeAddress that extracts just a portion of the full hashed address. We
also use MakeAddress to reverse the order of the bits in the hashed
address, making the lowest-orderbit of the hash address the highest-order
bit of the value used in extendible hashing. To see whythis reversal ofbit
order is desirable, look at Fig. 12.8, which is a set of keys and binary hash
addresses produced by our hash function. Even a quick scan of these
addresses reveals that the distribution of the least significant bits of these
integer values tends to have more variation than the high-orderbits. This
isbecause many of the addresses do not make use of the upper reaches of
our address space; the high-order bits often turn out to be 0.
https://hemanthrajhemu.github.io
532 Chapter 12 Extendible Hashing
Figure 12.8 Output from the hash function for a number of keys.
By reversing the bit order, working from rightto left, we take advan-
tage of the greater variability of low-order bit values. For example, given a
4-bit address space, we want to avoid having the addresses of bill, lee, and
M
pauline turn out to be 0000, 0000, and 0000. If we work from rightto left,
CO
Starting with the low-order bit in each address, we get 0011 for bill, 0001
for lee, and 1010 for pauline, which is a much moreusefulresult.
P.
{
W
int retval = 0;
int hashVal = Hash(key);
// reverse the bits
for (int j = 0; j < depth; j++)
{
retval = retval << i;
int lowbit = hashVal & 1;
retval = retval | lowbit;
hashVal = hashVal >> 1;
}
return retval;
}
https://hemanthrajhemu.github.io
Implementation 533
M
records: add a key-reference pair toa bucket, search for a key andreturnits
CO
reference, and removea key. Hence, we have chosen to make class Bucket
P.
a derived class of the class Text Index from Chapter 5 and Appendix F.
O
The definition of class Bucket is given in Fig. 12.10 and file bucket .h
LO
https://hemanthrajhemu.github.io
534 Chapter 12 Extendibie Hashing
M
int BucketAddr; // address in file CO
friend class Directory;
friend class BucketBuffer;
P.
O
};
LO
attached to file for the directory and one for the buckets. Fig 12.12(page
W
https://hemanthrajhemu.github.io
Implementation 535
class Directory
{public:
Directory (int maxBucketKeys = -1);
~Directory ();
int Open (char * name);
int Create (char *- name);
int Close ();
int Insert (char * key, int recAddr};
int Delete (char * key, int recAddr = -l);
int Search (char * key); // return RecAddr for key
ostream’& Print (ostream & stream);
‘protected:
int Depth; // depth of directory
int NumCells; // number of cells, = 2**Depth
int * BucketAddr; // array of bucket addresses
M
CO
// protected methods
int DoubleSize (); // double the size of the directory
P.
int MaxBucketKeys;
BufferFile * DirectoryFile;
LengthFieldBuffer * DirectoryBuffer;
Bucket * CurrentBucket;// object to hold one bucket
BucketBuffer * theBucketBuffer;// buffer for buckets
BufferFile * BucketFile;
int Pack () const;
int Unpack ();
Bucket. * PrintBucket;// object to hold one bucket for printing
friend class Bucket;
};
https://hemanthrajhemu.github.io
536 . Chapter 12 -Extendible Hashing
main ()
{
int result;
Directory Dir (4);
result = Dir . Create ("hashfile"):
if (result == 0) {return 0;} // unable to create files
char * keys{]J={"bill", "lee", "pauline", "alan", "julie",
"mike", "elizabeth", “mark", “ann", "peter"?
"christina", "john", "charles", "mary", "“emily"};
const int numkeys = 15;
for (int i = 0; i<numkeys; i++)
{
result = Dir . Insert’ (keys[{i], 100 + i);
if (result == 0).
cout << “insert for "<<keys[i]<<" failed"<<endl;
Dir . Print (cout);
M
}
CO
return 1;
P.
}
O
LO
Figure 12.12 Test program tsthash.cpp inserts a sequence of key-reference pairs into
a directory.
TU
.V
ry, since
W
https://hemanthrajhemu.github.io
Implementation 537
M
// and add it to the directory and the bucket file
CO
int result;
char * directoryName, * bucketName;
P.
O
result = BucketFile->Create(bucketName,ios::inlios::out);
.V
if ('result) return 0;
// store the empty bucket in the BucketFile; add to Directory
W
return result;
W
shown in Fig. 12.14. The Insert method first searches for the key.
Search arranges for the CurrentBucket member to contain the
proper bucket for the key. If the key is not already in the bucket, then the
Bucket: : Insert method 1s called to performthe insertion. In method
Directory: : Search, as in most search functions we have seen, the
Find method determines where the key would be if it were in the struc-
ture. In this case, Find determines which bucket is associated with the
key. As noted previously, MakeAddress finds the array index of the
directory cell that contains the file address of the appropriate bucket.
https://hemanthrajhemu.github.io
538 Chapter 12 Extendible Hashing
M
CO
Figure 12.14 Methods Insert, Search, and Find of class Directory.
P.
O
LO
{
if (NumKeys < MaxKeys)
.V
{
W
Dir.StoreBucket (this);
W
return result;
}
else // bucket is full
{
Split ();
return Dir.insert (key, recAddr);
}
}
Figure 12.15 MethodInsert of class Bucket adds the key to the existing
bucketif thereis room. If the bucketis full,it splits it and then adds the key.
https://hemanthrajhemu.github.io
Implementation 539
the key-reference pair to the bucket and stores the bucket in thefile. A full
bucket, however, requires asplit, which is where things start to get inter-
esting. After the split is done, the Directory: :Insert is called
(recursively) to try again to insert the key-referencepair.
What we do when we split a bucket depends on the relationship
between the numberof address bits used in the bucket and the number
used in the directory as a whole. The two numbersare often not the same.
To see this, look at Fig. 12.6(a). The directory uses 3 bits to define its
address space (8 cells). The keys in bucket A are distinguished from keysin
other buckets by. having aninitial 0 bit. All the other bits in the hashed key
values in bucket A can be any value;it is only the first bit that matters.
Bucket A is using only J bit and has depth 1.
The keys in bucket C all share a common first 2 bits; they all begin
with 11. The keys in buckets B and D use 3 bits to establish their identi-
ties and, therefore, their bucket locations. If you look at Fig. 12.6(c), you
M
can see how using moreor fewer address bits changes the relationship
CO
between the directory and the bucket. Buckets that do not use as many
address bits as the directory have more than onedirectorycell pointing
P.
to them.
O
If we split one of the buckets that is using fewer address bits than the
LO
we can usehalf of the directorycells to point to the new bucket after the
split. Suppose, for example, that wesplit bucket A inFig. 12.6(c). Before
.V
the split only 1 bit, the initial 0, is'used to identify keys that belongin
W
000 and 001) go in bucket A; keys starting with 01 (directory cells 010 and
W
https://hemanthrajhemu.github.io
540 Chapter 12 Extendible Hashing
Bucket * Bucket::Split ()
{// split this into two buckets, store the new bucket, and
// xveturn (memory) address of new bucket
int newStart, newEnd;
if (Depth == Dir.Depth)// no room to split this bucket
Dir .DoubleSize();// increase depth of directory
Bucket * newBucket = new Bucket (Dir, MaxKeys);
Dir.StoreBucket (newBucket}; // append to file
NewRange (newStart, newEnd); // determine directory addresses
Dir. InsertBucket (newBucket->BucketAddr, newStart, newEnd);
Depth ++; // increment depth of this
newBucket->Depth = Depth;
Redistribute (*newBucket); // move some keys into new bucket
Dir.StoreBucket (this);
Dir.StoreBucket (newBucket) ;
return newBucket;
M
}
CO
P.
Figure 12.16 MethodSplit of class Bucket divides keys between an existing bucket and
O
a new bucket. If necessary, it doubles the size of the directory to accommodate the new
LO
bucket.
TU
.V
Next we create the new bucket that we need for thesplit. Then wefind
W
the range of directory addresses that we will use for the new bucket. For
W
et to the directory over this range, adjust the bucket address depth infor-
“mationin both bucketsto reflect the use of an additional address bit, then
redistribute the keys from the original bucket across the two buckets.
The most complicated operation supporting the Split method is
NewRange, which finds the rangeof directory cells that should point to
the new bucket instead of the old one after the split. It is given in Fig.
12.17. To see how it works, return, once again, to Fig. 12.6(c). Assume that
we need to split bucket A, putting some of the keys into a new bucketE.
Before thesplit, any address beginning with a 0 leadsto A. In other words,
the shared address of the keys in bucket is 0.
When wesplit bucket A we add another addressbit to the path leading
to the keys; addresses leading to bucket A nowshare an initial 00 while
those leading to Eshare an 01. So,the range of addresses for the new buck-
et is all directory addresses beginning with 01. Since the directory address-
https://hemanthrajhemu.github.io
‘Implementation 541
Figure 12.17 Method NewRange ofclass Bucketfinds the start and end
M
directory addresses for the new bucket by using information from the old
CO
bucket.
P.
O
es.use 3 bits, the new bucket is attached to the directorycells starting with
LO
address. Then the range for the new bucket would start with 01000 and
end with 0111). This range coversall 5-bit addresses that share 01 as the
W
first 2 bits. The logic for finding the range of directory addresses for the
W
new bucket; then, starts by finding shared. address bits for the new bucket.
W
It then fills the address out with Os until we have the numberof bits used
in the directory. This is the start of the range. Filling the address out with
ls produces the end of the range.
The directory operations required to support Split are easy to
implement. They are given in Fig. 12.18. The first, Directory:
: DoubleSize, simply calculates the newdirectory size, allocates the
required memory, and writes the information from eachold directorycell
into two successive cells in the new directory.It finishes by freeing the old
space associated with member Buf ferAddrs, renaming the new space
as the Buf ferAdadrs, and increasing the Depth toreflect the fact that
the directory is nowusing an additional addressbit.
Method InsertBucket, used to attach a bucket address across a
range of directory cells, is simply a loop that works throughthe cells to
make the change.
https://hemanthrajhemu.github.io
542 Chapter 12 Extendible Hashing
M
CO
int Directory::InsertBucket (int bucketAddr, int first, int last)
{
P.
BucketAddr[{ijJ = bucketAddr;
LO
return 1; .
TU
}
.V
https://hemanthrajhemu.github.io
Deletion 543
12.4 Deletion
M
we delete a key, we need a wayto see if we can decrease thesize of the file
CO
system by combining buckets and,if possible, decreasing the size of the
P.
directory.
O
buddy buckets.
W
Look again at the trie in Fig. 12.6(b). Which buckets could be com-
W
https://hemanthrajhemu.github.io
544 Chapter 12 Extendible Hashing
M
single bucket, there cannot be a buddy. CO
The next test compares the numberofbits used by the bucket with the
numberofbits used in the directory address space. A pair of buddy buck-
P.
ets is a set of buckets that are immediate descendantsof the same nodein
O
LO
the trie. Theyare, in fact, pairwise siblings resulting from a split. Going
back to Fig. 12.6(b), we see that asking whether the bucket usesall the
TU
at the lowest level of the trie. It is only when a bucketis at the outer edge
W
of the trie that it can have a single parent and a single buddy.
Once we determine that there is a buddy bucket, we needto findits
W
address. First we find the address used to find the bucket we haveat hand;
W
Figure 12.19 Method FindBuddy ofclass Bucket returns a buddy bucket or —1 if none is
found,
https://hemanthrajhemu.github.io
Deletion: 545
this is the shared address of the keys in the bucket. Since we knowthatthe
buddy bucket is the other bucket that was formed from a split, we know
that the buddy has the same address in all regards except for the lastbit.
Once again, this relationship is illustrated by buckets B and D in Fig.
12.6(b). So, to get the buddy address, weflip the last bit with an exclusive
or, We return directory address of the buddybucket.
M
making sure that we are notat the lowerlimit of directorysize. By treating
CO
the special case of a directory with a single cell here, at the start of the
function, we simplify subsequent processing: with the exception of this
P.
Figure 12.20 Method Collapse of class Directory reduces the size of the directory,if
possible.
https://hemanthrajhemu.github.io
546 Chapter 12 Extendible Hashing
as we find such a pair, we know that we cannot collapse the directory and
the method returns. If we get all the way through the directory without
encountering such a pair, then we can collapse the directory.
The collapsing operation consists of allocating space for a new array
of bucket addresses that is half the size of the original and then copying
- the bucket references shared by each cell pair to a single cell in the new
directory.
M
we return failure; if we find it, we call Bucket : : Remove to remove the
CO
key from the bucket. We return the value reported back from that method.
Figure 12.21 gives the implementation of these two methods.
P.
removing the key from the bucket, is accomplished through the call to
LO
Text Index: :Remove, the base class Remove method. The second
TU
.V
W
https://hemanthrajhemu.github.io
Deletion 547
M
Bucket * buddyBucket = new Bucket (Dir, MaxKeys) ;
CO
Dir . LoadBucket (buddyBucket, buddyAddr);
P.
return 1;
}
W
int result;
// move keys from buddy to this
for (int 1 = 0; i.< buddy->NumKeys; i++)
{// insert the key of the buddy into this
result = Insert (buddy->Keys[i] ,buddy->RecAddrs[i])
‘Lf (!result) return 0;// this should not happen
}
Depth ~ -;// reduce the depth of the bucket
Dir . RemoveBucket (buddyIndex, Depth);// delete buddy bucket
return 1;
}
https://hemanthrajhemu.github.io
o+90 Chapter 12 Extendible Hashing
M
Bucket: : Remove. CO
The Bucket: :Remove method deletes the key, then passes the
bucket on to Directory: :TryCombine to seeif the smaller size of
P.
buddyis less than or equal to the size of a single bucket, we combine the.
.V
buckets.
W
https://hemanthrajhemu.github.io
Extendible Hashing Performance 549
M
ing performance. Both the analysis and simulation show that the space
CO
utilization is strongly periodic, fluctuating between values of 0.53 and
0.94, The analysis portion of their papersuggests that for a given number
P.
N=——— N
bin?
.V
W
https://hemanthrajhemu.github.io
550 Chapter 12 Extendible Hashing ;
space utilization. It turns out that ifwe have keys with randomly distrib-
uted addresses, the buckets in the extendible hashing table tendto fill up at
about the sametime and therefore tend to split at the same time. This
explains the large fluctuations in space utilization. As the buckets fill up,
space utilization can reach past 90 percent. Thisis followed by a concen-
. trated series of splits that reduce the utilization to below 50 percent. As
these now nearly half-full bucketsfill up again, the cycle repeatsitself.
M
Flajolet (1983) addressed this question in a lengthy, carefully devel-
CO
oped paper that produces a numberofdifferent ways to estimate the diréc-
tory size. Table 12.1, which is taken from Flajolet’s paper, shows the
P.
expected value for the directory size for different numbers of keys and
O
mates of the directory size for values that are not in this table. He notes
that this formula tends to overestimate directory size by a factorof 2 to 4.
.V
W
Table 12.1 Expected directory size for a given bucket size b and total number of records fr
6 5 10 20 50 100 200
r | \
1K = 103, 1M 108,
From Flajolet, 1983,
https://hemanthrajhemu.github.io
Alternative Approaches 551
M
buckets are addressed through a forestof tries that have been seeded out of
the original static address space.
CO
Let’s look at an example. Figure 12.23(a) showsan initial address space
P.
of four and four buckets descending from the four addresses in the direc-
O
tory. In Fig. 12.23(b) we have split the bucket at address 4. We address the
LO
two buckets resulting from the split as 40 and 41. We change the shape of
TU
new external nodes 20 and 2]. Wealso split the bucket addressedby 41,
W
extending the trie downwardto include 410 and 411. Because the directo-
ry node 4J is nowan internal node rather than an externalone, it changes
from a square to a circle. As we continue to add keys and split buckets,
these directory tries continue to grow.
Finding a key in a dynamic hashing scheme.can involve the use of two
hash functions rather than just one. First, there is the hash function that
covers the original address space.If you find that the directory nodeis an
external node and therefore points to a bucket, the search is complete.
However,if the directory node is an internal node, then you need addi-
tional address information to guide you through the 1s and 0s that form
the trie. Larson suggests using a second hash function onthe key and using
_the result of this hashing as the seed for a random-numbergenerator that
produces a sequence ofIs and 0s for the key. This sequence describes the
path throughthetrie.
https://hemanthrajhemu.github.io
552 Chapter 12 Extendible Hashing
OCODOD oD
(b)— ——- ——- — + ) F—— SF 2 FS 8 — — — 4 Original address space
.| 40 41
CD OD CODED D
(c)- }- — — ——/f 9 }—— —— 4 3 +} —— — 4 Original address space
20 2] 40 41
M
410 41]
CO
SSE SEDEDODAD SD
y ¥ ¥ + y Y
P.
C
O
LO
that while both schemes extend the hash function locally, as a binary
W
https://hemanthrajhemu.github.io
: Alternative Approaches 553
ers to children, the size of a node in dynamic hashing is larger than a direc-
tory cell in extendible hashing, probably byat least a factor of 2. So, the
directory for dynamic hashing will usually require more space in memory.
Moreover, if.the directory becomesso large that it requires use of virtual
memory, extendible hashing offers the advantage of being able to access
the directory with no morethan a single pagefault. Since dynamic hash-
ing uses a linked structure for the directory, it may be necessary to incur
more than one page fault to move throughthe directory.
M
more than one directory node can point to the same bucket. However, the
CO
directory adds an additional layer of indirection which, if the directory
must be stored on disk, can result in an additional seek.
P.
Enbodyand Du (1988).
.V
as the address space grows. The example begins (Fig. 12.24{a]) with an
address space of four, which means that we are using an address function
W
that produces addresses with two bits of depth. In terms of the operations
W
https://hemanthrajhemu.github.io
554 Chapter 12 Extendible Hashing
(b) a b c d A
(c) a b c d A B
M
CO
y
P.
O
LO
x
TU
.V
(d) a b c d A B Cc
W
Zz
(e){ a b c d A Bic! op
000 001 010 O11 100 101 110 211
https://hemanthrajhemu.github.io
Alternative Approaches 555
M
else
CO
address = hg , ,(k);
P.
extend the address space through splitting every time thereis overflow, the
TU
overflow chains do not become very large. Given a bucketsize of 50, the
average numberof disk accesses per search approachesveryclose to one.
.V
https://hemanthrajhemu.github.io
porter re erect Ha SHUG
(1980) suggests using the overall load factor of the file as an alternative
triggering event. Suppose we let the buckets overflow until the space
utilization reaches somedesired figure, such as 75 percent. Every time the
utilization exceeds that figure; we split a bucket and extend the address
space. Litwin simulatedthis kind of system and found thatfor load factors
of 75 percent and even 85 percent, the average numberof accesses for
successful and unsuccessful searchesstill stays below 2.
We can also use overflow buckets to defer splitting and increase space
utilization for dynamic hashing and extendible hashing. For these meth-
ods, which use directories to the buckets, deferring splitting has the addi-
tional attraction of keeping the directory size down. For extendible
hashingit is particularly advantageousto chain to an overflow bucket and
therefore avoid a split when the split would cause the directory to double
in size. Consider the example that we usedearly in this chapter, where we
split the bucket Bin Fig. 12.4(b), producing the expanded directory and
M
bucket structure shownin Fig. 12.6(c). If we had allowed bucket B to over-
CO
flow instead, we could have retained the smaller directory. Depending on
how much space weallocated for the overflow buckets, we might also have
P.
chains. .
TU
‘in his original paper on dynamic hashing but found theresults of some
W
https://hemanthrajhemu.github.io
Summary 557
SUMMARY
Conventional, static hashing does not adaptwellto file structures that are
dynamic, that grow and shrink over time. Extendible hashing is one of
several hashing systems that allow theaddress space for hashing to grow
and shrink along with the file. Because the size of the address space can
growas the file grows, it is possible for extendible hashing to provide
hashed access without the need foroverflow handling, even asfiles grow
manytimes beyondtheir original expected size.
The key to extendible hashing is using more bits of the hashed valueas
we need to cover more address space. The modelfor extending the use of
the hashed value is the trie: every time we use anotherbit of the hashed
value, we have added anotherlevel to the depth of a trie with a radix of 2.
In extendible hashingwefill out all the leaves of the trie until we have
a perfect tree, then we collapse that tree into a one-dimensional array. The
M
array forms a directory to the buckets, kept on disk, that hold the keys and
CO
records. The directory is managed in memory,if possible.
P.
bucket. We use } additional bit from the hash values for the keys in the
LO
bucket to divide the keys between the old bucket and the newone.If the
address space representéd in the directory can cover the use ofthis newbit,
TU
no more changes are necessary. If, however, the address space is using
.V
fewer bits than are needed by oursplitting buckets, then we double the
W
combine the records for two buckets only if they are buddy buckets, which
is to say that they are the pair of buckets that resulted from split.
Access performancefor extendible hashing is a single seekif the direc-
tory can be kept:in memory.If the directory must be paged off to disk,
worst-case performance1s two seeks. Space utilization for the bucketsis
approximately 69 percent. Tables and an approximation formula devel-
oped by Flajolet (1983) permit estimation of the probable directorysize,
given a bucket size and total numberof records.
There are a number of other approaches to the problem solved by
extendible hashing. Dynamic hashing uses a very similar approach but
expresses the directory as a linked structure rather than as an array. The
linked structure is more cumbersome but grows more smoothly. Space
utilization and’seek performance for dynamic hashingare the same asfor
extendible hashing.
https://hemanthrajhemu.github.io
558 Chapter 12 Extendible Hashing
Linear hashing does away with the directory entirely, extending the
address space by adding new buckets in a linear sequence. Although the
overflow of a bucket can be used to trigger extension of the address space
in linear hashing, typically the bucket that overflowsis not the one that is
split and extended. Consequently,linear hashing implies maintaining over-
flow chains and a consequent degradation in seek performance. The degra-
dationis slight, since the chains typically do not grow to be very long before
they are pulled into a new bucket. Space utilization is about 60 percent.
Space utilization for extendible, dynamic, and linear hashing can be
improved by postponingthesplitting of buckets. This is easy to implement
for linear hashing,since there are already overflow buckets. Using deferred
splitting, it is possible to increase space utilization for any of the hashing
schemes described here to 80 percent or better whilestill maintaining
search performance averagingless than two seeks. Overflow handling for
these approaches can use the sharing of overflow buckets.
M
CO
KEY TERMS
P.
O
z=yXOR 1
W
https://hemanthrajhemu.github.io
Key Terms 559
M
Their proposal is for a system that uses a directory to represent
CO
the address space. Access to buckets containing the records is
through the directory. The directory is handled as an array; the
P.
O
ets changes. .
TU
hashing, linear hashing does not use a directory. Instead, the address
W
https://hemanthrajhemu.github.io
560 Chapter 12 Extendible Hashing
FURTHER READINGS
For information about hashing for dynamicfiles that goes beyond what we
present here, you must turn to journalarticles. The best summary ofthe
different approachesis Enbody and Du’s Computing Surveysarticle titled
“Dynamic Hashing Schemes,” which appeared in 1988.
The original paperon extendible hashing is “Extendible Hashing—
A Fast Access Method for Dynamic Files” by Fagin, Nievergelt, Pippen-
ger, and Strong (1979). Larson (1978) introduces dynamic hashing in an
article titled “Dynamic Hashing.” Litwin’s initial paper on linear hashing
is titled “Linear Hashing: A New Tool for File and Table Addressing”
(1980), All three of these introductory articles are quite readable;
Larson’s paper and Fagin, Nievergelt, Pippenger, and Strong are especial-
ly recommended.
Michel Scholl’s 1981 papertitled “New File Organizations Based on
M
Dynamic Hashing” provides another readable introduction to dynamic
CO
hashing. It also investigates implementations thatdefer splitting by allow-
P.
" often derive results that apply to either of the two methods.Flajolet (1983)
TU
https://hemanthrajhemu.github.io
Exercises 561
EXERCISES
1. Briefly describe the differences between extendible hashing, dynamic
hashing, and linear hashing. What are the strengths and weaknesses of
each approach?
2. The tries that are the basis for the extendible hashing procedure
described in this chapter have a radix of 2. Howdoes performance
change if we use a larger radix?
3. In the MakeAddress function, what would happen if we did not
reverse the order ofthe bits but just extracted the required number of
low-orderbits in the same left-to-right order that they occur in the
address? Think about the way the directory location would change as
we extend the implicit trie structure to use yet anotherbit.
4, If the language that you are using to implement the MakeAddress
M
function does not support bit shifting and masking operations, how
CO
could you achieve the same ends, even if less elegantly andclearly?
P.
original bucket and a new one. Howdo you decide whether a key
LO
result in moving any keys into the new bucket. Under what conditions
.V
https://hemanthrajhemu.github.io
562 Chapter 12 Extendible Hashing
M
13. In section 12.6.3 we described an approach to linear hashing that
CO
controls splitting. For a load factor of 85 percent,the average number
P.
14. Because linear hashing splits one bucket at a time, in order, until it
has reached the end of the sequence, the overflow chains forthe last
.V
buckets in the sequence can become muchlonger than those for the
W
https://hemanthrajhemu.github.io
Programming Project 563
PROGRAMMING EXERCISES
16. Write a version of the MakeAddress function that prints out the
input key, the hash value, and the extracted, reversed address. Build a
driver that allows you to enter keys interactively for this function and
see the results. Study the operation of the function on different keys.
18. Implement method Directory: :Delete. Write a driver pro-
gram to verify that your implementation is correct. Experiment with
the program to see how deletion works. Try deleting all the keys. Try
to create situations in which the directory will recursively collapse
over more than one level.
19. Design and:‘implement a class HashedFile patterned after
class Text IndexedFile of Chapter 7 and Appendix G. A
HashedFile object is a data file and an extendible hash ditectory.
M
Theclass should have methods Create, Open, Close, Read (read
CO
record that matches key), Append, and Update.
P.
O
PROGRAMMING
LO
TU
This is'the last part of the programming project. We create a hashed index
of the student record files and the course registration files from the
.V
20. Use class HashedFile to create a hashed index ofa student record
file with student identifier as key. Note that the student identifier field
is not unique in a studentregistration file. Write a driver program to
create a hashed file from an existing student recordfile.
21. Use class HashedFile to create a hashed index of a courseregistra-
tion record file with student identifier as key. Write a driver program
to create a hashedfile from an existing course registration recordfile.
22. Write a program that opens a hashedstudent file and a hashed course
registrationfile and retrieves information on demand. Prompt user
for a studentidentifier and printall objects that matchit.
https://hemanthrajhemu.github.io
M
CO
P.
O
LO
TU
.V
W
W
W
https://hemanthrajhemu.github.io