Hashing On The Disk: Keys Are Stored in " " (" ") Retrieval
Hashing On The Disk: Keys Are Stored in " " (" ") Retrieval
Retrieval:
find address of page
bring page into main memory
searching within the page comes for
free
E.G.M. Petrakis
Hashing
data pages
key
space
0
1
hash
function
2
.
.
.
.
m-1
# stored records
# pages b
Hashing
Collisions
Keys that hash to the same address
E.G.M. Petrakis
Hashing
overflow
xx
Access Time
Goal: find key in one disk access
Access time ~ number of accesses
Large u: good space utilization but
Hashing
Categories of Methods
Static: require file reorganization
open addressing, separate chaining
dynamic hashing,
extendible hashing,
linear hashing,
spiral storage
E.G.M. Petrakis
Hashing
total reorganization
Typically 1-3 disk accesses to access
a key
Access time and u are a typical tradeoff
u between 50-100% (typically 69%)
Complicated implementation
E.G.M. Petrakis
Hashing
Hashing
address space
data space
(Larson 1980)
Spiral Storage (Martin 1979)
E.G.M. Petrakis
Hashing
Hash Functions
Support for shrinking or growing file
shrinking or growing address space, the
hi (key) =
i
+
h
(key)
2
i 1
E.G.M. Petrakis
Hashing
1
2
3
4
h1(k)
E.G.M. Petrakis st
1 level
h2(k)
Hashing
2nd
level
data pages
10
Index
Fixed (static): h1(key) = key mod m
Dynamic behavior on secondary index
h2(key) uses i bits of key
the bit sequence of h2=bi-1b2b1b0
Hashing
11
0
1
2
3
4
5
h1(k)
1st level
index
1
0
h1=1, h2=0
h1=1, h2=01
h1=1, h2=11
h1=5, h2= any
4
b
data pages
h2(k)
2nd level
Hashing
12
Insertions
Initially fixed size primary bindex and
no data
0
1
2
3
0
1
2
3
h1=1,h2=any
0
1
2
3
0
1
Hashing
h1=1, h2=0
h1=1, h2=1
13
0
1
2
3
0
1
2
3
0
1
2
3
0
1
2
3
E.G.M. Petrakis
h1=0, h2=any
b
2
index
1
0
1
0
1
0
1
storage
1
h1=0, h2=0
h1=0, h2=1
h1=3, h2=any
1
3
4
2
h1=0, h2=0
5
Hashing
h1=3, h2=any
h1=0, h2=01
h1=0, h2=11
h1=3, h2=0
h1=3, h2=1
14
Deletions
Find record to be deleted using h1, h2
Delete record
Check sibling page:
E.G.M. Petrakis
Hashing
15
0
1
2
3
merging
2
3
4
0
1
2
3
delete
3
4
E.G.M. Petrakis
Hashing
16
Hashing
17
0
1
2
3
4
0
1
2
3
4
dynamic
hashing
0
1
0
1
00
01
10
11
E.G.M. Petrakis
Hashing
dynamic
hashing with
all binary trees
at same level
number of
address bits
18
Insertions
Initially 1 index and 1 data page
0 address bits
insert records in data page
index
global depth d:
size of index 2d
storage
local depth l :
Number of address bits
b
E.G.M. Petrakis
Hashing
19
Page 0 Overflows
d
index
storage
d: global depth = 1
l : local depth = 1
0
1
E.G.M. Petrakis
Hashing
20
E.G.M. Petrakis
Hashing
21
d
00
01
10
11
2
2
l d
contains records
with same 1st bit of key
E.G.M. Petrakis
Hashing
22
Page 01 Overflows
d
000
001
010
011
100
101
110
111
E.G.M. Petrakis
2
3
3
Hashing
23
3
3
2
2
+1
Hashing
24
Insertion Algorithm
If l < d, split overflowed page (1 extra
page)
If l = d double index, split page and
d is increased by 1=>1 more bit for addressing
update pointers (either way):
a) if d prefix bits are used for addressing
d=d+1;
Hashing
25
Deletion Algorithm
Find and delete record
Check sibling page
If less than b records in both pages
merge pages and free empty page
decrease local depth l by 1 (records in
Hashing
26
000
001
010
011
100
101
110
111
000
001
010
011
100
101
110
111
E.G.M. Petrakis
delete with
merging
3
2
2
2
2
l<d
2
2
00
01
10
11
2
2
2
Hashing
27
Observations
A page splits and there are more than b
E.G.M. Petrakis
Hashing
28
Performance
For n: records and page size b
expected size of index (Flajolet)
1
(1 + )
n b
1
(1 + )
n b
l
3.92
blog2
b
1 disk access/retrieval when index in
main memory
2 disk accesses when index is on disk
overflows increase number of disk
accesses
E.G.M. Petrakis
Hashing
29
b
before splitting
b
u =
= 50%
2b
after splitting
After splitting
Hashing
30
E.G.M. Petrakis
Hashing
31
32
E.G.M. Petrakis
Hashing
33
File Growing
A page splits whenever the splitting
criterion is satisfied
Hashing
34
125
320
90
435
16
711
402
27
737
712
613
303
4
319
215
522
u=
17
> 80% split
22
438
new element
b=bpage=4, boverflow=1
initially n=5 pages
hash function h0=k mod 5
splitting criterion u > A%
alternatively split when overflow overflows,
etc.
E.G.M. Petrakis
Hashing
35
613
303
438
4
319
125
435
215
h0
h0
h1
320
90
16
711
402
27
737
712
h1
h0
h0
522
18
u=
< 80%
25
E.G.M. Petrakis
Hashing
36
Hash Functions
Initially h0=key mod n
As new pages are added at end of file, h0
alone becomes insufficient
The file will eventually double its size
In that case use h1=key mod 2n
In the meantime
use h0 for pages not yet split
use h1 for pages that have already split
Split contents of page pointed to by p
based
E.G.M.
Petrakis on h1
Hashing
37
Hashing
38
Hash Functions
Initially n pages and 0 <= h0(k) <= n
Series of hash functions
hi (k)
hi +1 (k) =
i
hi (k) + n2
Selection of hash function:
if hi(k) >= p then use hi(k)
else use hi+1(k)
E.G.M. Petrakis
Hashing
39
E.G.M. Petrakis
Hashing
40
0 1
2n
2 pointers
to pages of
the same group
E.G.M. Petrakis
Hashing
41
st
1
Expansion
0
1st
2n
3n
E.G.M. Petrakis
Hashing
2n
3n
42
nd
2
Expansion
2 pointers
to pages of
the same group
0 1
E.G.M. Petrakis
2n
4n
Hashing
43
disk access/retrieval
1,6
Linear
Hashing
1,5
1,4
Linear
Hashing
2 partial
expansions
1,3
1,2
1,1
1
retrieval
insertion
deletion
E.G.M. Petrakis
1,2
1,6
Linear
Hashing
1.17
3.57
4.04
1,4
1,8
Linear
Hashing Linear Hashing
2 part. Exp. 3 part. Exp.
1.12
3.21
3.53
Hashing
1.09
3.31
3.56
b=5
b = 5
u = 0.85
44
E.G.M. Petrakis
Hashing
45