0% found this document useful (0 votes)
17 views37 pages

Unit-5 B+Trees & Hashing

The document discusses B+ trees and hash-based indexes, detailing their structures, operations, and characteristics. B+ trees are dynamic, balanced structures that efficiently handle insertions, deletions, and queries, while hash-based indexes focus on equality selections and utilize buckets for data storage. It also covers static, extendible, and linear hashing techniques, highlighting their advantages and limitations in managing data entries.

Uploaded by

siliy42631
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views37 pages

Unit-5 B+Trees & Hashing

The document discusses B+ trees and hash-based indexes, detailing their structures, operations, and characteristics. B+ trees are dynamic, balanced structures that efficiently handle insertions, deletions, and queries, while hash-based indexes focus on equality selections and utilize buckets for data storage. It also covers static, extendible, and linear hashing techniques, highlighting their advantages and limitations in managing data entries.

Uploaded by

siliy42631
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Tree-Structured Indexing

B+ Tree
 The B+ tree is a dynamic structure that adjusts to
changes in the file gracefully.
 It is the most widely used index structure because it
adjusts well to changes and supports both equality
and range queries efficiently.
 It avoids overflow pages.
 The B+ tree is a balanced tree in which the internal
nodes direct the search and the leaf nodes contain
the data entries.
 In order to retrieve all leaf pages efficiently, leaf
pages are linked using a doubly linked list. Range
queries can be efficiently answered by just
retrieving the sequence of leaf pages.
B+ Tree (cont.)

Index Entries
(Direct search)

Data Entries
("Sequence set")

Characteristics of B+ tree:
 Operations(Insert/delete) on the tree keep it balanced.
 A minimum occupancy of 50% is guaranteed for each node
except the root node. Each node contains d <= m <= 2d
entries. The parameter d is called the order of the tree.
 Searching for a record requires just a traversal from the root
to the appropriate leaf.
B+ Tree (cont.)

Non-leaf
Pages

Leaf
Pages
(Sorted by search key)

 Leaf pages contain data entries, and are chained (prev & next)
 Non-leaf pages have index entries; only used to direct searches:
index entry

P0 K 1 P1 K 2 P 2 K m Pm

Fig: Format of a node


B+ Tree (cont.)
Search:
 Search begins at root, and key comparisons direct it to a leaf.
 To search for entry 5*, we follow the left-most pointer of root,
since 5 < 13. To search for the entries 14* , we follow the
second pointer, since 13 < 14 < 17.
 Search cost = log F N (F = fanout, N = # leaf pages)

Root

13 17 24 30

2* 3* 5* 7* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Fig: Example B+ Tree, order d=2


B+ Tree (cont.)
To insert a Data Entry into a B+ Tree :
 Find correct leaf L.
 Put data entry onto L if L has enough space.
 Otherwise, split L into two nodes( L and a new node L2)
• Redistribute entries evenly, copy up middle key.
• Insert index entry pointing to L2 into parent of L.
 This can happen recursively
 To split index node, redistribute entries evenly, but
push up middle key. (Contrast with leaf splits.)
 Splits “grow” tree; root split increases height by
one level.
B+ Tree (cont.)
Inserting 8* into Example B+ Tree :
If we insert entry 8*, it belongs to the left-most leaf, which is already full. This
insertion causes a split of the leaf page; the split pages are shown in the below
figure. The tree must now be adjusted to take the new leaf page into account.

Entry to be inserted in parent node.


5 (Note that 5 is copied up and
continues to appear in the leaf.)

2* 3* 5* 7* 8*

Entry to be inserted in parent node.


17 (Note that 17 is pushed up and only
appears once in the index. Contrast
this with a leaf split.)

5 13 24 30
B+ Tree (cont.)
Root
17

5 13 24 30

2* 3* 5* 7* 8* 14* 16* 19* 20* 22* 24* 27* 29* 33* 34* 38* 39*

Fig: Example B+ Tree After Inserting 8*

 Notice that root was split, leading to increase in height.


An alternative for splitting is to redistribute entries of a node N
with a sibling before splitting the node; this improves average
occupancy. The sibling of a node N is a node that is immediately
to the left or right of N and has the same parent as N.
B+ Tree (cont.)
To delete a Data Entry from a B+ Tree:
 Start at root, find leaf L where entry belongs.
 Remove the entry.
 Deletion causes the leaf L to go below the minimum
occupancy threshold. When this happens,
 We must either redistribute entries from an adjacent sibling
or merge the node with a sibling to maintain minimum
occupancy.
 If entries are redistributed between two nodes, their parent
node must be updated to reflect this.
 If two nodes are merged, their parent must be updated to
reflect this.
 If the last entry in the root node is deleted in this manner
because one of its children was deleted, the height of the
tree decreases by one.
B+ Tree (cont.)
Deleting 19*, 20* and 24* from Example B+ Tree :
 To delete entry 19*, we simply remove it from the
leaf page on which it appears.
 Deleting 20* is done with re-distribution. The middle
key 27* is copied up to parent.
Root

17

5 13 27 30

2* 3* 5* 7* 8* 14* 16* 22* 24* 27* 29* 33* 34* 38* 39*

Fig: Example B+ Tree After Deleting 19* and 20*


B+ Tree (cont.)

Deletion of 24* causes 30


merging of leaf pages
and index pages.
22* 27* 29* 33* 34* 38* 39*

Root
5 13 17 30

2* 3* 5* 7* 8* 14* 16* 22* 27* 29* 33* 34* 38* 39*

Fig: Example B+ Tree After Deleting 24*


Hash-Based Indexes
Introduction

 Hash-based indexes are efficient for equality selections.


Cannot support range searches.
 Hash-based indexes use a collection of buckets.
Buckets contain data entries.
 Hash-based indexes use hash function to locate the
data entries. No need for “index entries” in this
scheme.
 Hash function: It maps values in a search field into a
range of bucket numbers to find the page on which
a desired data entry belongs.
Types of Hash-Based Indexes
Static Hashing: This scheme uses a fixed number of
buckets. Like ISAM, it suffers from the problem of
long overflow chains which can affect performance.
Dynamic Hashing: It avoids the problems of static
hashing. There are two schemes.
1) Extendible Hashing: This scheme uses a directory
to support inserts and deletes efficiently without any
overflow pages.
2) Linear Hashing : This scheme uses a clever policy
for creating new buckets and supports inserts and
deletes efficiently without the use of a directory.
Although overflow pages are used, the length of
overflow chains is rarely more than two.
Static Hashing
 Static hashing divides the pages into buckets.
 The buckets are fixed. Each bucket contains one
primary page and overflow pages are added if
needed.
 Buckets contain data entries.

0
h(key) mod N
1
key
h

N-1
Primary bucket pages Overflow pages

Fig: Static Hashing


Static Hashing (Contd.)
 Search: To search for a data entry, hash function is
applied to identify the bucket to which it belongs
and then the bucket is searched.
 Hash function works on search key field of record.
The hash function distributes values in the search
field uniformly over the collection of buckets. The
bucket identified using h(key) mod N.
 Insert: To insert a data entry, we use the hash
function to identify the correct bucket and then put
the data entry there. If there is no space for this data
entry, we allocate a new overflow page, put the data
entry on this page.
5
Static Hashing (Contd.)
 Delete: To delete a data entry, we use the hashing
function to identify the correct bucket, locate the
data entry by searching the bucket, and then
remove it. If this data entry is the last in an overflow
page, the overflow page is removed from the
overflow chain of the bucket.
 Search ideally requires just one disk I/O, and insert
and delete operations require two I/Os (read and
write the page), although the cost could be higher in
the presence of overflow pages.
 As the file grows, long overflow chains can develop
and degrade performance.
Extendible Hashing
 Extendible hashing technique avoids overflow pages by
using directory concept.
 It uses a directory of pointers to buckets, and double the
size of the number of buckets by doubling just the directory
and splitting only the bucket that overflowed.
 The basic technique used in Extendible Hashing is to treat
the result of applying a hash function h as a binary number
and to interpret the last d bits( where d depends on the size
of the directory) for locating the desired bucket.
 Search: To locate a data entry, we apply a hash function to
the search field and take the last two bits of its binary
representation to get a number between 0 and 3. The
pointer in this array position gives us the desired bucket.
Extendible Hashing(cont.)
LOCAL DEPTH 2
Bucket A
GLOBAL DEPTH 4* 12* 32* 16*

2 2
Bucket B
00 1* 5* 21*

01
10 2
Bucket C
11 10*

DIRECTORY 2
Bucket D
15* 7* 19*

DATA PAGES

 To locate a data entry with hash value 5 (binary


101), we look at directory element 01 and follow the
pointer to the data page (bucket B in the figure).
Extendible Hashing(cont.)
Insert:
 To insert a data entry, we search to find the
appropriate bucket.
 For example, to insert a data entry with hash value
13 (binary 1101)we would examine directory
element 01 and go to the page containing data
entries 1*, 5*, and 21*.
 Since the bucket has space for an additional data
 entry, we are done after we insert the entry.

 If bucket is full, split it (allocate new page, re-distribute


the entires). If necessary, double the directory. (by comparing
global depth with local depth for the split bucket.)
Extendible Hashing(cont.)
 Consider the insertion of data entry 20* (binary 10100).
Looking at directory element 00, we are led to bucket A,
which is already full. We must first split the bucket by
allocating a new bucket and redistributing the contents across
the old bucket and its split image. Directory is also doubled.
LOCAL DEPTH 3
LOCAL DEPTH 2
Bucket A GLOBAL DEPTH 32* 16* Bucket A
GLOBAL DEPTH 32*16*

2 3 2
2
1* 5* 21*13* Bucket B 000 1* 5* 21* 13* Bucket B
00
01 001
2 010 2
10
11 10* Bucket C 011 10* Bucket C
100
2 101 2
DIRECTORY Bucket D
15* 7* 19* 110 15* 7* 19* Bucket D
111
2 3
4* 12* 20* Bucket A2 DIRECTORY 4* 12* 20* Bucket A2
(`split image'
of Bucket A) (`split image'
of Bucket A)
Extendible Hashing(cont.)
 After doubling the directory, to redistribute entries
across the old bucket and its split image, we
consider the last three bits of h(r).
Global depth of directory: Max number of bits needed
to tell which bucket an entry belongs to.
Local depth of a bucket: Number of bits used to
determine if an entry belongs to this bucket.
A bucket split does not necessarily require a
directory doubling. If a bucket whose local depth
is equal to the global depth is split, the directory
must be doubled.
Extendible Hashing(cont.)
Delete:
 To delete a data entry, the data entry is located and
removed.
 If removal of data entry makes bucket empty, it can be
merged with its `split image’. Merging buckets decreases
the local depth.
 If each directory element points to same bucket as its split
image, we can halve the directory and reduce the global
depth.

 If directory fits in memory, equality search can be


answered with one disk access; else two.
Linear Hashing
 This is another dynamic hashing scheme, an
alternative to Extendible Hashing.
 Linear Hashing(LH) handles the problem of long
overflow chains without using a directory, and
handles duplicates.
 The scheme utilizes a family of hash functions h0, h1,
h2, …, with the property that each function's range is
twice that of its predecessor. That is, if hi maps a data
entry into one of M buckets, hi+1 maps a data entry
into one of 2M buckets.
 Linear hashing proceeds in rounds. It splits the
buckets in a linear(round-robin) order.
Linear Hashing (Contd.)
 During round number Level, only hash functions hLevel and
hLevel+1 are in use. The buckets in the file at the beginning of
the round are split, one by one from the first to the last
bucket, thereby doubling the number of buckets.
Buckets split in this round:
Bucket to be split If h Level ( search key value )
Next is in this range, must use
h Level+1 ( search key value )
Buckets that existed at the
to decide if entry is in
beginning of this round: `split image' bucket.
this is the range of
hLevel
`split image' buckets:
created (through splitting
of other buckets) in this round
Fig: Buckets during a Round in Linear Hashing
Linear Hashing (Contd.)
 Splitting proceeds in `rounds’. Round ends when
all NR initial (for round R) buckets are split. Buckets
0 to Next-1 have been split; Next to NR yet to be
split.
Search:
 To search for a data entry with a given search key value. We
apply hash function hLevel, and if this leads us to one of the
unsplit buckets, we simply look there.
 If it leads us to one of the split buckets, the entry may be
there or it may have been moved to the new bucket created
earlier in this round by splitting this bucket; to determine
which of these two buckets contains the entry, we apply
hLevel+1.
Linear Hashing (Contd.)
Level=0, N=4

h h PRIMARY
1 0 Next=0 PAGES

32* 44* 36*


000 00

Data entry r
9* 25* 5* with h(r)=5
001 01

14* 18* 10* 30*


010 10 Primary
bucket page
31* 35* 7* 11*
011 11
(The actual contents
of the linear hashed
file)

Fig: Example of a Linear hashed file


Linear Hashing (Contd.)

 A counter Level is used to indicate the current round


number and is initialized to 0.
 The bucket to split is denoted by Next and is initially
bucket 0 (the first bucket).
 We denote the number of buckets in the file at the
beginning of round Level by NLevel. NLevel = N * 2Level.
 Let the number of buckets at the beginning of round 0,
denoted by N0 , be N.
 Each bucket can hold four data entries, and the file
initially contains four buckets, as shown in the figure in
previous slide.
Linear Hashing (Contd.)
Insert:
 Find bucket by applying hLevel / hLevel+1
 Insert the data entry into bucket if it is not full.
 If the bucket is full:
• Add overflow page and insert data entry.
• Split Next bucket and increment Next.
• If the Next pointer points to this bucket, we do not need
a new overflow page. The new entry is inserted into
splitted bucket.
 Insertion of data entry 43* triggers a split. Whenever a split is
triggered the Next bucket is split, and hash function hLevel+1
redistributes entries between this bucket and its split image.
 After splitting a bucket, the value of Next is incremented by 1.
 The file after completing the insertion 43* is shown in next
slide.
Linear Hashing (Contd.)
Example:
 On split, hLevel+1 is used to re-distribute entries.

Level=0, N=4 Level=0

h h PRIMARY h h PRIMARY OVERFLOW


1 0 Next=0 PAGES 1 0 PAGES PAGES

32*44* 36* 32*


000 00 000 00
Next=1
Data entry r
001 01 9* 25* 5* with h(r)=5 001 01 9* 25* 5*

14* 18*10* 30* Primary 14* 18*10* 30*


010 10 010 10
bucket page
31* 35* 7* 11* 31* 35* 7* 11* 43*
011 11 011 11
(This info (The actual contents
is for illustration of the linear hashed 100 00 44* 36*
only!) file)
Linear Hashing (Contd.)
Example: End of a Round Level=1
PRIMARY OVERFLOW
h2 h1 PAGES PAGES
Next=0
Level=0 0000 000 32*
PRIMARY OVERFLOW
h1 h0 PAGES PAGES
0001 001 9* 25*
000 00 32*
0010 010 66* 18* 10* 34* 50*
001 01 9* 25*
0011 011 43* 35* 11*
010 10 66*18* 10* 34*
Next=3 0100 100 44* 36*
011 11 31*35* 7* 11* 43*
0101 101 5* 37* 29*
100 00 44*36*

101 01 5* 37*29* 0110 110 14* 30* 22*

110 10 14*30*22* 0111 111 31* 7*

Fig: After inserting 37*, 29*, 22*, 66*, 34* and 50*
Linear Hashing (Contd.)
 When Next is equal to NLevel−1 and a split is triggered, the
number of buckets after the split is twice the number at
the beginning of the round, and we start a new round
with Level incremented by 1 and Next reset to 0.

Delete:
It is essentially the inverse of insertion. If the last bucket in
the file is empty, it can be removed and Next can be
decremented.
If Next is 0 and the last bucket becomes empty, Next is made
to point to bucket (M/2) − 1, where M is the current number
of buckets, Level is decremented, and the empty bucket is
removed.
Linear Hashing (Contd.)
 An equality selection costs just one disk I/O unless
the bucket has overflow pages.
 Since buckets are split round-robin, long overflow
chains don’t develop. If the data distribution is very
skewed(non uniform), however, overflow chains
could cause Linear Hashing performance to be worse
than that of Extendible Hashing.
Extendible Hashing Vs Linear Hashing
 Extendible and Linear Hashing are closely related.
 Extendible Hashing uses a directory to locate buckets and
splits the bucket when it is full. Avoids overflow pages.
 Linear Hashing avoids a directory by splitting the buckets in
a round-robin fashion. Linear Hashing proceeds in rounds.
Uses overflow pages.
 Moving from hL to hL+1 in Linear Hashing corresponds to
doubling the directory in Extendible Hashing.
 By always splitting the appropriate bucket, Extendible
Hashing may lead to a reduced number of splits and higher
bucket occupancy.
 For uniform distributions, Linear Hashing has a lower
average cost for equality selections.
Extendible Hashing Vs Linear Hashing (Cont.)

 The disadvantage of Linear Hashing is that space utilization


could be lower, especially for skewed(nonuniform)
distributions, because the bucket splits are not concentrated
where the data density is highest, as they are in Extendible
Hashing.
Summary

 Hash-based indexes: best for equality searches,


cannot support range searches.
 Static Hashing can lead to long overflow chains.
 Extendible Hashing avoids overflow pages by
splitting a full bucket when a new data entry is to be
added to it. (Duplicates may require overflow pages.)
 Directory to keep track of buckets, doubles periodically.
 Can get large with skewed(nonuniform) data; additional
I/Os if this does not fit in main memory.
Summary (Contd.)

 Linear Hashing avoids directory by splitting buckets


round-robin, and using overflow pages.
 Overflow pages not likely to be long.
 Duplicates handled easily.
 Space utilization could be lower than Extendible Hashing,
since splits not concentrated on `dense’ data areas.

You might also like