0% found this document useful (0 votes)
60 views21 pages

Linear Hashing

Uploaded by

veronica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views21 pages

Linear Hashing

Uploaded by

veronica
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

LINEAR HASHING

E0 261

Jayant Haritsa
Computer Science and Automation
Indian Institute of Science

JAN 2021 LINEAR-HASHING Slide 1


Indexing Techniques
• Comparison-based indexing (B+-trees)
• Computation-based indexing (Hashing)
• Main difference is that search cost in former is
function of database size, in latter it is constant.
• But the former can support range searches,
whereas the latter cannot.

JAN 2021 LINEAR-HASHING Slide 2


Hash-Based Indexes

JAN 2021 LINEAR-HASHING Slide 3


Static Hashing
• # primary pages N fixed, allocated sequentially,
never de-allocated; overflow pages if needed.
• h(k) mod N = bucket to which data entry with
key k belongs.

0
h(key) mod N
1
key
h

N-1
Primary bucket pages Overflow pages
JAN 2021 LINEAR-HASHING Slide 4
Extendible Hashing
• Situation: Bucket (primary page) becomes full.
Why not re-organize file by doubling # of buckets?
– Reading and writing all pages is expensive!
– Idea: Use directory of pointers to buckets, double # of
buckets by doubling the directory, splitting just the
bucket that overflowed.
– Directory much smaller than file, so doubling it is much
cheaper. Only one page of data entries is split. No
overflow page.
– Trick lies in how hash function is adjusted.

JAN 2021 LINEAR-HASHING Slide 5


LOCAL DEPTH
Example
2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*

2 2
Bucket B
00 1* 5* 21* 13*
• Directory is array of size 4.
01
• To find bucket for r, take 10 2
last `global depth’ # bits of 11 10*
Bucket C

h(k)
– If h(k) = 5* = binary 101, DIRECTORY 2
Bucket D
it is in bucket pointed to 15* 7* 19*

by 01. DATA PAGES

Insert: If bucket is full, split it (allocate new page, re-distribute).


If necessary, double the directory. (Splitting a
bucket does not always require doubling; we can tell by
comparing global depth with local depth for the split bucket.)
JAN 2021 LINEAR-HASHING Slide 6
Insert h(k)=20* in Bucket A
(Causes Doubling)
LOCAL DEPTH 2 3
LOCAL DEPTH
Bucket A1
GLOBAL DEPTH 32* 16* 32* 16* Bucket A1
GLOBAL DEPTH

2 2
3 2
00 1* 5* 21* 13* Bucket B 000 1* 5* 21* 13* Bucket B
01 001
10 2 2
010
10* Bucket C
11 10*
011 Bucket C
100
2
DIRECTORY 101 2
Bucket D
15* 7* 19*
110 15* 7* 19* Bucket D
111
2
3
Bucket A2
4* 12* 20* DIRECTORY 4* 12* 20* Bucket A2
(‘split image’
of Bucket A1) (‘split image’
JAN 2021 LINEAR-HASHING
of Bucket
Slide 7 A1)
Comments on Extendible Hashing
• If directory fits in memory, equality search
answered with one disk access; else two.
• Directory grows in spurts, and, if the distribution of
hash values is skewed, directory can grow large.
• Multiple entries with same hash value cause
problems
– Unfixable issue for duplicates
– Results in EH directory exploding!

JAN 2021 LINEAR-HASHING Slide 9


Linear Hashing

JAN 2021 LINEAR-HASHING Slide 10


Linear Hashing
• This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
• LH handles the problem of long overflow chains
without using a directory, and handles duplicates.
• Idea: Use a family of hash functions h0, h1, h2, ...
– hi(key) = h(key) mod(2iN); N = initial # buckets
– h is some hash function (range is 0 to 2|MachineBitLength|)
– If N = 2d0, for some d0, hi consists of applying h and looking
at the last di bits, where di = d0 + i.
– hi+1 doubles the range of hi (similar to directory doubling)

JAN 2021 LINEAR-HASHING Slide 11


Linear Hashing (Contd.)
• Directory avoided in LH by using overflow pages,
and choosing bucket to split round-robin.
– Splitting proceeds in ‘rounds’. Round ends when all NR
initial (for round R) buckets are split.
– Buckets 0 to Next-1 have been split; Next to NR yet to
be split.
– Current round number is Level.

JAN 2021 LINEAR-HASHING Slide 12


Overview of LH File

• In the middle of a round.


Buckets split in this round:
Bucket to be split If h Level-1 (search key value )
Next is in this range, must use
h Level ( search key value )
Buckets that existed at the
to decide if entry is in
beginning of this round: 'split image' bucket.
this is the range of
hLevel-1

NR 'split image' buckets:


created (through splitting
M of other buckets) in this round

JAN 2021 LINEAR-HASHING Slide 13


LH Search

To find bucket for data entry k, find hLevel-1(k):


• If hLevel-1(k) in range `Next to NR’ , k belongs
here.
• Else, k could belong to bucket hLevel-1(k) or
bucket hLevel-1(k) + NR; must apply hLevel(k) to
find out.

JAN 2021 LINEAR-HASHING Slide 14


Simpler Formulation

0 Next NR M 2NR
hlevel hlevel-1 hlevel

• m = hlevel (k)
if m >= M, m = m - NR

JAN 2021 LINEAR-HASHING Slide 15


LH Insert

• Find bucket by applying hLevel-1 / hLevel:


– If bucket to insert into is full:
• Add overflow page and insert data entry.
• (Maybe) Split Next bucket and increment Next.
• Can choose any criterion to `trigger’ split.
• Since buckets are split round-robin, long overflow
chains don’t develop!
• Doubling of directory in Extendible Hashing is
similar; switching of hash functions is implicit in how
the # of bits examined is increased.
JAN 2021 LINEAR-HASHING Slide 16
Example of Linear Hashing
• On split, hLevel is used to
re-distribute entries.
Level=1, N=4 Level=1

h h PRIMARY Add 43* h h PRIMARY OVERFLOW


1 0 Next=0 PAGES 1 0 PAGES PAGES

32* 44* 36* 32*


000 00 000 00
Next=1
Data entry k
001 01 9* 25* 5* with h(k)=5* 001 01 9* 25* 5*

14* 18*10*30* Primary 14* 18*10*30*


010 10 010 10
bucket page
31*35* 7* 11* 31*35* 7* 11* 43*
011 11 011 11
(Actual contents of the
linear hashed file) 100 00 44* 36*

JAN 2021 LINEAR-HASHING Slide 17


Example: End of a Round
Level=2
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
Next=0
Level=1 Add 50* 000 00 32*
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
001 01 9* 25*
000 00 32*
010 10 66* 18* 10* 34* 50*
001 01 9* 25*
011 11 43* 35* 11*
010 10 66*18* 10* 34*
Next=3 100 00 44* 36*
011 11 31* 35* 7* 11* 43*

101 11 5* 37* 29*


100 00 44* 36*

101 5* 37* 29* 110 10 14* 30* 22*


01

110 10 14* 30* 22* 111 11 31*7*

JAN 2021 LINEAR-HASHING Slide 18


Physical Address Computing

• Logical bucket address given by hashing must be


converted into the physical address of the bucket
on disk.
• Simple solution: contiguous allocation on disk, but
not feasible in general
• In practice, keep giving larger and larger
contiguous chunks as the file size grows (Equation
6 doubles chunk size with K=1)

JAN 2021 LINEAR-HASHING Slide 19


Points to Note

• Hash Function:
– hi (k) = [(A k) mod w] mod 2i N
where A = 6125423371 w = 232
(A is prime w.r.t. w)
• In LH, split is local impact, whereas in the
B-tree, split is “global”
• In LH, duplicates don’t cause a problem because
of presence of overflow buckets

JAN 2021 LINEAR-HASHING Slide 20


Problem
• For small cardinality domains (e.g. gender), neither hashing
nor indexing work
• Solution: Bitmap Indexes

JAN 2021 LINEAR-HASHING Slide 21


END LINEAR-HASHING

E0 361

JAN 2021 LINEAR-HASHING Slide 24

You might also like