0% found this document useful (0 votes)
34 views

Linear Hashing: Historical Background

Uploaded by

girls1271138
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views

Linear Hashing: Historical Background

Uploaded by

girls1271138
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

L

Historical Background
Linear Hashing
A hash table is an in-memory data structure that
Donghui Zhang1 , Yannis Manolopoulos2 ,
associates keys with values. The primary opera-
Yannis Theodoridis3 , and Vassilis J. Tsotras4
1 tion it supports efficiently is a lookup: given a
Paradigm4, Inc., Waltham, MA, USA
2 key, find the corresponding value. It works by
Aristotle University, Thessaloniki, Greece
3 transforming the key using a hash function into a
University of Piraeus, Piraeus, Greece
4 hash, a number that is used as an index in an array
University of California-Riverside, Riverside,
to locate the desired location where the values
MA, USA
should be. Multiple keys may be hashed to the
same bucket, and all keys in a bucket should be
searched upon a query. Hash tables are often used
Definition to implement associative arrays, sets and caches.
Like arrays, hash tables have O(1) lookup cost on
Linear Hashing is a dynamically updateable disk-
average.
based index structure which implements a hash-
ing scheme and which grows or shrinks one
bucket at a time. The index is used to support
Foundations
exact match queries, i.e., find the record with
a given key. Compared with the BC-tree index
The Linear Hashing scheme was introduced
which also supports exact match queries (in log-
by [2].
arithmic number of I/Os), Linear Hashing has
better expected query cost O(1) I/O. Compared
Initial Layout
with Extendible Hashing, Linear Hashing does
The Linear Hashing scheme has m initial buckets
not use a bucket directory, and when an overflow
labeled 0 through m  1, and an initial hashing
occurs, it is not always the overflown bucket that
function h0 (k)Df(k) % m that is used to map any
is split. The name Linear Hashing is used because
key k into one of the m buckets (for simplicity
the number of buckets grows or shrinks in a
assume h0 (k)Dk% m), and a pointer p which
linear fashion. Overflows are handled by creating
points to the bucket to be split next whenever an
a chain of pages under the overflown bucket. The
overflow page is generated (initially p D 0). An
hashing function changes dynamically and at any
example is shown in Fig. 1.
given instant there can be at most two hashing
functions used by the scheme.

© Springer Science+Business Media LLC 2017


L. Liu, M.T. Özsu (eds.), Encyclopedia of Database Systems,
DOI 10.1007/978-1-4899-7993-3_742-2
2 Linear Hashing

Linear Hashing, Fig. 1 An initial Linear Hashing. Here


m D 4, p D 0, h0 (k) D k % 4
Linear Hashing, Fig. 2 The Linear Hashing after in-
serting 11 into Fig. 1. Here p D 1, h0 (k) D k % 4,
Bucket Split h1 (k) D k% 8
When the first overflow occurs (it can occur in
any bucket), bucket 0, which is pointed by p,
is split (rehashed) into two buckets: the original Round and Hash Function Advancement
bucket 0 and a new bucket m. A new empty page After enough overflows, all original m buckets
is also added in the overflown bucket to accom- will be split. This marks the end of splitting-
modate the overflow. The search values originally round 0. During round 0, p went subsequently
mapped into bucket 0 (using function h0 ) are now from bucket 0 to bucket m  1. At the end of
distributed between buckets 0 and m using a new round 0 the Linear Hashing scheme has a total
hashing function h1 . of 2m buckets. Hashing function h0 is no longer
As an example, Fig. 2 shows the layout of needed as all 2m buckets can be addressed by
the Linear Hashing of Fig. 1 after inserting a hashing function h1 . Variable p is reset to 0 and
new record with key 11. The circled records are a new round, namely splitting-round 1, starts. A
the existing records that are moved to the new new hash function h2 will start to be used.
bucket. In more detail, the record is inserted into In general, the Linear Hashing scheme in-
bucket 11%4 D 3. The bucket overflows and an volves a family of hash functions h0 , h1 , h2 , and
overflow page is introduced to accommodate the so on. Let the initial function be h0 (k) D f(k)%
new record. Bucket 0 is split and the records m, then any later hash function hi (k) D f(k) % 2
i
originally in bucket 0 are distributed between m. This way, it is guaranteed that if hi hashes a
bucket 0 and bucket 4, using a new hash function key to bucket j 2 [0.0.2i m  1], hiC1 will hash
h1 (k) D k % 8. the same key to either bucket j or bucket j C 2i
The next bucket overflow, such as triggered by m. At any time, two hash functions hi and hiC1
inserting two records in bucket 2 or four records are used.
in bucket 3 in Fig. 2, will cause a new split that Figures 3 and 4 illustrates the cases at the
will attach a new bucket m C 1 and the contents end of splitting-round 0 and at the beginning of
of bucket 1 will be distributed using h1 between splitting-round 1. In general, in splitting round i,
buckets 1 and m C 1. A crucial property of h1 the hash functions hi and hiC1 are used. At the
is that search values that were originally mapped beginning of round i, p D 0 and there are 2i
by h0 to some bucket j must be remapped either m buckets. When all of these buckets are split,
to bucket j or bucket j C m. This is a necessary splitting round i C 1 starts. p goes back to 0. The
property for Linear Hashing to work. An example number of buckets becomes 2iC1 m. And hash
of such hashing function is: h1 (k) D k % 2m. functions hiC1 and hiC2 will start to be used.
Further bucket overflows will cause additional
bucket splits in a linear bucket-number order
(increasing p by one for every split).
Linear Hashing 3

• A total of 2i m C p buckets, each of which


consists of a primary page and possibly some
overflow pages.
• Two hash functions hi and hiC1 .

A search scheme is needed to map a key k to


a bucket, either when searching for an existing
record or when inserting a new record. The search
scheme works as follows:

1. If hi (k)  p, choose bucket hi (k) since the


bucket has not been split yet in the current
round.
2. If hi (k) < p, choose bucket hiC1 (k), which can
be either hi (k) or its spit image hi (k) C 2i m.
Linear Hashing, Fig. 3 The Linear Hashing at the end of
round 0. Here p D 3, h0 (k) D k % m, h1 (k) D k % 21 m For example, in Fig. 2, p D 1. To search for
record 5, since h0 (5) D 1 p, one directly goes
to bucket to find the record. But to search for
record 4, since h0 (4) D 0 <p, one needs to use
h1 to decide the actual bucket. In this case, the
record should be searched in bucket h1 (4) D 4. L
Variations
A split performed whenever a bucket overflow
occurs is an uncontrolled split. Let l denote
the Linear Hashing scheme’s load factor, i.e.,
l D S/b where S is the total number of records
and b is the number of buckets used. The load
factor achieved by uncontrolled splits is usually
between 50% and 70%, depending on the page
size and the search value distribution [2]. In
practice, higher storage utilization is achieved
if a split is triggered not by an overflow, but
when the load factor l becomes greater than some
Linear Hashing, Fig. 4 The Linear Hashing at the be- upper threshold. This is called a controlled split
ginning of round 1. Here p D 0, h1 (k) D k % 21 m, and can typically achieve 95% utilization. Other
h2 (k) D k % 22 m controlled schemes exist where a split is delayed
until both the threshold condition holds and an
Component Summary and Search Scheme overflow occurs.
In summary, at any time a Linear Hashing scheme Deletions will cause the hashing scheme to
has the following components: shrink. Buckets that have been split can be re-
combined if the load factor falls below some
lower threshold. Then two buckets are merged
• A value i which indicates the current splitting
together; this operation is the reverse of splitting
round.
and occurs in reverse linear order. Practical values
• A variable p 2 [0..2i m  1] which indicates
for the lower and upper thresholds are 0.7 and 0.9
the bucket to be split next.
respectively.
4 Linear Hashing

Linear Hashing has been further investigated Recommended Reading


in an effort to design more efficient variations.
In [3] a performance comparison study of four 1. Griswold WG, Townsend GM. The design and imple-
mentation of dynamic hashing for sets and tables in
Linear Hashing variations is reported.
icon. Softw Pract Ex. 1993;23(4):351–67.
2. Litwin W. Linear hashing: a new tool for file and
table addressing. In: Proceedings of the Sixth Inter-
Key Applications national Conference on Very Large Databases; 1980.
p. 212–23.
3. Manolopoulos Y, Lorentzos N. Performance of linear
Linear Hashing has been implemented into com- hashing schemes for primary key retrieval. Inf Syst.
mercial database systems. It is used in appli- 1994;19(5):433–46.
cations where exact match query is the most 4. Schneider DA., DeWitt DJ. Tradeoffs in processing
complex join queries via hashing in multiprocessor
important query such as hash join [4]. It has been
database machines. In: Proceedings of the 16th Inter-
adopted in the Icon language [1]. national Conference on Very Large Databases; 1990.
p. 469–80.

Cross-References

 Extendible Hashing
 Hashing
 Hash-based Indexing

You might also like