Unit-5 B+Trees & Hashing
Unit-5 B+Trees & Hashing
B+ Tree
The B+ tree is a dynamic structure that adjusts to
changes in the file gracefully.
It is the most widely used index structure because it
adjusts well to changes and supports both equality
and range queries efficiently.
It avoids overflow pages.
The B+ tree is a balanced tree in which the internal
nodes direct the search and the leaf nodes contain
the data entries.
In order to retrieve all leaf pages efficiently, leaf
pages are linked using a doubly linked list. Range
queries can be efficiently answered by just
retrieving the sequence of leaf pages.
B+ Tree (cont.)
Index Entries
(Direct search)
Data Entries
("Sequence set")
Characteristics of B+ tree:
Operations(Insert/delete) on the tree keep it balanced.
A minimum occupancy of 50% is guaranteed for each node
except the root node. Each node contains d <= m <= 2d
entries. The parameter d is called the order of the tree.
Searching for a record requires just a traversal from the root
to the appropriate leaf.
B+ Tree (cont.)
Non-leaf
Pages
Leaf
Pages
(Sorted by search key)
Leaf pages contain data entries, and are chained (prev & next)
Non-leaf pages have index entries; only used to direct searches:
index entry
P0 K 1 P1 K 2 P 2 K m Pm
Root
13 17 24 30
2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
2* 3* 5* 7* 8*
5 13 24 30
B+ Tree (cont.)
Root
17
5 13 24 30
2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*
17
5 13 27 30
2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*
Root
5 13 17 30
0
h(key) mod N
1
key
h
N-1
Primary bucket pages Overflow pages
2 2
Bucket B
00 1* 5* 21*
01
10 2
Bucket C
11 10*
DIRECTORY 2
Bucket D
15* 7* 19*
DATA PAGES
2 3 2
2
1* 5* 21*13* Bucket B 000 1* 5* 21* 13* Bucket B
00
01 001
2 010 2
10
11 10* Bucket C 011 10* Bucket C
100
2 101 2
DIRECTORY Bucket D
15* 7* 19* 110 15* 7* 19* Bucket D
111
2 3
4* 12* 20* Bucket A2 DIRECTORY 4* 12* 20* Bucket A2
(`split image'
of Bucket A) (`split image'
of Bucket A)
Extendible Hashing(cont.)
After doubling the directory, to redistribute entries
across the old bucket and its split image, we
consider the last three bits of h(r).
Global depth of directory: Max number of bits needed
to tell which bucket an entry belongs to.
Local depth of a bucket: Number of bits used to
determine if an entry belongs to this bucket.
A bucket split does not necessarily require a
directory doubling. If a bucket whose local depth
is equal to the global depth is split, the directory
must be doubled.
Extendible Hashing(cont.)
Delete:
To delete a data entry, the data entry is located and
removed.
If removal of data entry makes bucket empty, it can be
merged with its `split image’. Merging buckets decreases
the local depth.
If each directory element points to same bucket as its split
image, we can halve the directory and reduce the global
depth.
h h PRIMARY
1 0 Next=0 PAGES
Data entry r
9* 25* 5* with h(r)=5
001 01
Fig: After inserting 37*, 29*, 22*, 66*, 34* and 50*
Linear Hashing (Contd.)
When Next is equal to NLevel−1 and a split is triggered, the
number of buckets after the split is twice the number at
the beginning of the round, and we start a new round
with Level incremented by 1 and Next reset to 0.
Delete:
It is essentially the inverse of insertion. If the last bucket in
the file is empty, it can be removed and Next can be
decremented.
If Next is 0 and the last bucket becomes empty, Next is made
to point to bucket (M/2) − 1, where M is the current number
of buckets, Level is decremented, and the empty bucket is
removed.
Linear Hashing (Contd.)
An equality selection costs just one disk I/O unless
the bucket has overflow pages.
Since buckets are split round-robin, long overflow
chains don’t develop. If the data distribution is very
skewed(non uniform), however, overflow chains
could cause Linear Hashing performance to be worse
than that of Extendible Hashing.
Extendible Hashing Vs Linear Hashing
Extendible and Linear Hashing are closely related.
Extendible Hashing uses a directory to locate buckets and
splits the bucket when it is full. Avoids overflow pages.
Linear Hashing avoids a directory by splitting the buckets in
a round-robin fashion. Linear Hashing proceeds in rounds.
Uses overflow pages.
Moving from hL to hL+1 in Linear Hashing corresponds to
doubling the directory in Extendible Hashing.
By always splitting the appropriate bucket, Extendible
Hashing may lead to a reduced number of splits and higher
bucket occupancy.
For uniform distributions, Linear Hashing has a lower
average cost for equality selections.
Extendible Hashing Vs Linear Hashing (Cont.)